{"id":3735,"date":"2022-02-08T01:15:22","date_gmt":"2022-02-08T01:15:22","guid":{"rendered":"https:\/\/wiki.thomasandsofia.com\/?p=3735"},"modified":"2022-02-08T11:06:17","modified_gmt":"2022-02-08T11:06:17","slug":"bqfbd-efficient-schema-design-in-bigquery","status":"publish","type":"post","link":"https:\/\/wiki.thomasandsofia.com\/?p=3735","title":{"rendered":"BQFBD &#8211; Efficient Schema Design in BigQuery"},"content":{"rendered":"<p><a href=\"\/bigquery-for-big-data\/\">Main Menu<\/a><\/p>\n<h1>Section 5: Efficient Schema Design in BigQuery<\/h1>\n<h2>25. Design an Efficient schema for BigQuery Tables<\/h2>\n<p>https:\/\/www.udemy.com\/course\/bigquery\/learn\/lecture\/22735695#questions<\/p>\n<h3>Nested and Repeated Columns<\/h3>\n<ul>\n<li>Works best when <strong>denormalized<\/strong>\n<ul>\n<li>Rather than preserving relational schema, such as star or snowflake<\/li>\n<\/ul>\n<\/li>\n<li>Normalization\n<ul>\n<li>Technique to eliminate redundant data to minimize the insertion, deletion and update anomalies<\/li>\n<li>Saves a lot of space<\/li>\n<li>Involved multiple tables where\n<ul>\n<li>the core data is organized into one table<\/li>\n<li>Related data is stored in different tables.<\/li>\n<\/ul>\n<\/li>\n<li>Data is often fetched by including multiple tables through relationships to each other\n<ul>\n<li>Usually on some key: order_id or customer_id<\/li>\n<\/ul>\n<\/li>\n<li>Reduces redundancy and preserves data integrity<\/li>\n<li>Queries run slower: More tables to join, the queries have to jump from table to table<\/li>\n<\/ul>\n<\/li>\n<li>Denormalization\n<ul>\n<li>Strategy of allowing duplicate files values for a column in a table to gain processing performance<\/li>\n<li>Put everything in 1 tables and don&#8217;t care about redundancy.<\/li>\n<li>Performance improved because no need to jump between tables.<\/li>\n<li>Denormalized data can be processed in parallel using columnar processing.<\/li>\n<li>Downside:\n<ul>\n<li>Can reduce performance where there are 1 &#8211; to &#8211; Many relationships\n<ul>\n<li>Grouping by Order_Id where there is a lot of repetitive data<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>BQ supports both Nested and Repeated data\n<ul>\n<li>Nested is synonym for Struct data type<\/li>\n<li>Repeated is like an Array\n<ul>\n<li>In short, you have an array of structs.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Nested and Repeated columns maintain relationships without impacting performance of a normalized schema<\/li>\n<\/ul>\n<p>Example:<\/p>\n<p><a href=\"https:\/\/wiki.thomasandsofia.com\/wp-content\/uploads\/2022\/02\/nested-and-repeated-json.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-3737\" src=\"https:\/\/wiki.thomasandsofia.com\/wp-content\/uploads\/2022\/02\/nested-and-repeated-json.png\" alt=\"\" width=\"904\" height=\"450\" srcset=\"https:\/\/wiki.thomasandsofia.com\/wp-content\/uploads\/2022\/02\/nested-and-repeated-json.png 904w, https:\/\/wiki.thomasandsofia.com\/wp-content\/uploads\/2022\/02\/nested-and-repeated-json-300x149.png 300w, https:\/\/wiki.thomasandsofia.com\/wp-content\/uploads\/2022\/02\/nested-and-repeated-json-768x382.png 768w, https:\/\/wiki.thomasandsofia.com\/wp-content\/uploads\/2022\/02\/nested-and-repeated-json-150x75.png 150w\" sizes=\"auto, (max-width: 904px) 100vw, 904px\" \/><\/a><\/p>\n<pre>{\r\n  orderId: 1,\r\n  storeLoc: Newyork,\r\n  orderAmt: 1450,\r\n  custId: C01,\r\n  products: [\r\n   {\r\n     prodId: 1234,\r\n     prodCat: Cosmetics\r\n     prodName: Shampoo\r\n     prodPrice: 140\r\n   },{\r\n     prodId: 421\r\n     prodCat: Kitchen\r\n     prodName: Grinder\r\n     prodPrice: 240\r\n   }\r\n ]\r\n},\r\n{\r\n  orderId: 2,\r\n  storeLoc: Newyork,\r\n...<\/pre>\n<ul>\n<li>The Products field is a Nested and Repeated field\n<ul>\n<li>Repeated because it is an array<\/li>\n<li>Nested because it is a Struct<\/li>\n<\/ul>\n<\/li>\n<li>These can be imported from\n<ul>\n<li>JSON, Avro, Firestore exports and Datastore export files<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>26. Nested and Repeated Columns<\/h2>\n<p>https:\/\/www.udemy.com\/course\/bigquery\/learn\/lecture\/22730017#questions<\/p>\n<ul>\n<li>When importing a JSON, you should be able to use Schema Auto Detect.\n<ul>\n<li>The downside is all fields will be nullable.<\/li>\n<li>This is not good because we would require orderId and productId to be filled.<\/li>\n<li>Will need to create the schema manually.<\/li>\n<\/ul>\n<\/li>\n<li>First, specify the normal fields<\/li>\n<\/ul>\n<p><a href=\"https:\/\/wiki.thomasandsofia.com\/wp-content\/uploads\/2022\/02\/nrmalfields.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-3740\" src=\"https:\/\/wiki.thomasandsofia.com\/wp-content\/uploads\/2022\/02\/nrmalfields.png\" alt=\"\" width=\"766\" height=\"221\" srcset=\"https:\/\/wiki.thomasandsofia.com\/wp-content\/uploads\/2022\/02\/nrmalfields.png 766w, https:\/\/wiki.thomasandsofia.com\/wp-content\/uploads\/2022\/02\/nrmalfields-300x87.png 300w, https:\/\/wiki.thomasandsofia.com\/wp-content\/uploads\/2022\/02\/nrmalfields-150x43.png 150w\" sizes=\"auto, (max-width: 766px) 100vw, 766px\" \/><\/a><\/p>\n<ul>\n<li>Then create the Repeated Array<\/li>\n<li>Type: RECORD = Struct<\/li>\n<li>Click the [+] symbol to add the array elements<\/li>\n<\/ul>\n<p><a href=\"https:\/\/wiki.thomasandsofia.com\/wp-content\/uploads\/2022\/02\/struct-fields.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-3741\" src=\"https:\/\/wiki.thomasandsofia.com\/wp-content\/uploads\/2022\/02\/struct-fields.png\" alt=\"\" width=\"769\" height=\"200\" srcset=\"https:\/\/wiki.thomasandsofia.com\/wp-content\/uploads\/2022\/02\/struct-fields.png 769w, https:\/\/wiki.thomasandsofia.com\/wp-content\/uploads\/2022\/02\/struct-fields-300x78.png 300w, https:\/\/wiki.thomasandsofia.com\/wp-content\/uploads\/2022\/02\/struct-fields-150x39.png 150w\" sizes=\"auto, (max-width: 769px) 100vw, 769px\" \/><\/a><\/p>\n<ul>\n<li>Finally, click [Create table]<\/li>\n<\/ul>\n<h3>How to query this table<\/h3>\n<p><a href=\"https:\/\/wiki.thomasandsofia.com\/wp-content\/uploads\/2022\/02\/nestedandrepeatedtable.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-3742\" src=\"https:\/\/wiki.thomasandsofia.com\/wp-content\/uploads\/2022\/02\/nestedandrepeatedtable.png\" alt=\"\" width=\"1031\" height=\"172\" srcset=\"https:\/\/wiki.thomasandsofia.com\/wp-content\/uploads\/2022\/02\/nestedandrepeatedtable.png 1031w, https:\/\/wiki.thomasandsofia.com\/wp-content\/uploads\/2022\/02\/nestedandrepeatedtable-300x50.png 300w, https:\/\/wiki.thomasandsofia.com\/wp-content\/uploads\/2022\/02\/nestedandrepeatedtable-1024x171.png 1024w, https:\/\/wiki.thomasandsofia.com\/wp-content\/uploads\/2022\/02\/nestedandrepeatedtable-768x128.png 768w, https:\/\/wiki.thomasandsofia.com\/wp-content\/uploads\/2022\/02\/nestedandrepeatedtable-150x25.png 150w\" sizes=\"auto, (max-width: 1031px) 100vw, 1031px\" \/><\/a><\/p>\n<p><strong>Query this table where productName = Table<\/strong><\/p>\n<ul>\n<li>You have to &#8220;unnest&#8221; the array so each &#8220;flattened&#8221; row gets attached with its root columns<\/li>\n<\/ul>\n<pre>SELECT * from `PROJECT.DATASET.TABLE` CROSS JOIN unnest(products) as my_field WHERE my_field.productName = \"TABLE\"<\/pre>\n<ul>\n<li>You can now access all flattened columns as normal columns directly using &#8216;my_field.COLNAME&#8217; and apply any mathematical, string, or any function on it.<\/li>\n<\/ul>\n<h3>CROSS JOIN<\/h3>\n<ul>\n<li>is an expensive process so only select the columns you need<\/li>\n<\/ul>\n<pre>SELECT orderId, storeLocation, my_field.* from `PROJECT.DATASET.TABLE` CROSS JOIN unnest(products) as my_field WHERE my_field.productName = \"TABLE\"<\/pre>\n<p>You can replace CROSS JOIN with a comma<\/p>\n<pre>SELECT orderId, storeLocation, my_field.* from <code>PROJECT.DATASET.TABLE<\/code>, unnest(products) as my_field WHERE my_field.productName = \"TABLE\"<\/pre>\n<h3>Other Schemas<\/h3>\n<ul>\n<li>BQ also supports\n<ul>\n<li>3NF<\/li>\n<li>Snowflake<\/li>\n<li>Star<\/li>\n<\/ul>\n<\/li>\n<li>Use Nested and Repeated when your data naturally fits this data type\n<ul>\n<li>Best for performance<\/li>\n<\/ul>\n<\/li>\n<li>Use the others if that is the format it is provided in<\/li>\n<\/ul>\n<h2>Assignment #2<\/h2>\n<p>https:\/\/www.udemy.com\/course\/bigquery\/learn\/practice\/1257626#questions<\/p>\n<pre>{\"id\":\"1\",\"first_name\":\"John\",\"last_name\":\"Doe\",\"dob\":\"1968-01-22\",\"addresses\":[{\"status\":\"current\",\"address\":\"123 First Avenue\",\"city\":\"Seattle\",\"state\":\"WA\",\"zip\":\"11111\",\"numberOfYears\":\"1\"},{\"status\":\"previous\",\"address\":\"456 Main Street\",\"city\":\"Portland\",\"state\":\"OR\",\"zip\":\"22222\",\"numberOfYears\":\"5\"}]}\r\n{\"id\":\"2\",\"first_name\":\"Jane\",\"last_name\":\"Doe\",\"dob\":\"1980-10-16\",\"addresses\":[{\"status\":\"current\",\"address\":\"789 Any Avenue\",\"city\":\"New York\",\"state\":\"NY\",\"zip\":\"33333\",\"numberOfYears\":\"2\"},{\"status\":\"previous\",\"address\":\"321 Main Street\",\"city\":\"Hoboken\",\"state\":\"NJ\",\"zip\":\"44444\",\"numberOfYears\":\"3\"}]}<\/pre>\n<p><strong>Assignment: Pull all fields where city = &#8220;Portland&#8221;<\/strong><\/p>\n<pre>select id, first_name, last_name, dob, addr.* from `multi-340710.repnest.assg2`, unnest(addresses) as addr where city=\"Portland\"<\/pre>\n<p><strong>Observations<\/strong><\/p>\n<ul>\n<li>You have to add the primary (non-nested) fields manually.<\/li>\n<li>&#8220;Portland&#8221; is case sensitive.\u00a0 This is different from MySQL<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Main Menu Section 5: Efficient Schema Design in BigQuery 25. Design an Efficient schema for BigQuery Tables https:\/\/www.udemy.com\/course\/bigquery\/learn\/lecture\/22735695#questions Nested and Repeated Columns Works best when denormalized Rather than preserving relational schema, such as star or snowflake Normalization Technique to eliminate redundant data to minimize the insertion, deletion and update anomalies Saves a lot of space ..<\/p>\n<div class=\"clear-fix\"><\/div>\n<p><a href=\"https:\/\/wiki.thomasandsofia.com\/?p=3735\" title=\"read more...\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[77,78],"tags":[],"class_list":["post-3735","post","type-post","status-publish","format-standard","hentry","category-bigquery","category-bigquery-for-big-data-engineers"],"_links":{"self":[{"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=\/wp\/v2\/posts\/3735","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3735"}],"version-history":[{"count":4,"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=\/wp\/v2\/posts\/3735\/revisions"}],"predecessor-version":[{"id":3746,"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=\/wp\/v2\/posts\/3735\/revisions\/3746"}],"wp:attachment":[{"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3735"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3735"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3735"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}