BQFBD – Efficient Schema Design in BigQuery

February 8, 2022 BigQuery, BigQuery for Big Data Engineers

Section 5: Efficient Schema Design in BigQuery

25. Design an Efficient schema for BigQuery Tables

https://www.udemy.com/course/bigquery/learn/lecture/22735695#questions

Nested and Repeated Columns

Works best when denormalized
- Rather than preserving relational schema, such as star or snowflake
Normalization
- Technique to eliminate redundant data to minimize the insertion, deletion and update anomalies
- Saves a lot of space
- Involved multiple tables where
  - the core data is organized into one table
  - Related data is stored in different tables.
- Data is often fetched by including multiple tables through relationships to each other
  - Usually on some key: order_id or customer_id
- Reduces redundancy and preserves data integrity
- Queries run slower: More tables to join, the queries have to jump from table to table
Denormalization
- Strategy of allowing duplicate files values for a column in a table to gain processing performance
- Put everything in 1 tables and don’t care about redundancy.
- Performance improved because no need to jump between tables.
- Denormalized data can be processed in parallel using columnar processing.
- Downside:
  - Can reduce performance where there are 1 – to – Many relationships
    - Grouping by Order_Id where there is a lot of repetitive data
BQ supports both Nested and Repeated data
- Nested is synonym for Struct data type
- Repeated is like an Array
  - In short, you have an array of structs.
Nested and Repeated columns maintain relationships without impacting performance of a normalized schema

Example:

{
  orderId: 1,
  storeLoc: Newyork,
  orderAmt: 1450,
  custId: C01,
  products: [
   {
     prodId: 1234,
     prodCat: Cosmetics
     prodName: Shampoo
     prodPrice: 140
   },{
     prodId: 421
     prodCat: Kitchen
     prodName: Grinder
     prodPrice: 240
   }
 ]
},
{
  orderId: 2,
  storeLoc: Newyork,
...

The Products field is a Nested and Repeated field
- Repeated because it is an array
- Nested because it is a Struct
These can be imported from
- JSON, Avro, Firestore exports and Datastore export files

26. Nested and Repeated Columns

https://www.udemy.com/course/bigquery/learn/lecture/22730017#questions

When importing a JSON, you should be able to use Schema Auto Detect.
- The downside is all fields will be nullable.
- This is not good because we would require orderId and productId to be filled.
- Will need to create the schema manually.
First, specify the normal fields

Then create the Repeated Array
Type: RECORD = Struct
Click the [+] symbol to add the array elements

Finally, click [Create table]

How to query this table

Query this table where productName = Table

You have to “unnest” the array so each “flattened” row gets attached with its root columns

SELECT * from `PROJECT.DATASET.TABLE` CROSS JOIN unnest(products) as my_field WHERE my_field.productName = "TABLE"

You can now access all flattened columns as normal columns directly using ‘my_field.COLNAME’ and apply any mathematical, string, or any function on it.

CROSS JOIN

is an expensive process so only select the columns you need

SELECT orderId, storeLocation, my_field.* from `PROJECT.DATASET.TABLE` CROSS JOIN unnest(products) as my_field WHERE my_field.productName = "TABLE"

You can replace CROSS JOIN with a comma

SELECT orderId, storeLocation, my_field.* from PROJECT.DATASET.TABLE, unnest(products) as my_field WHERE my_field.productName = "TABLE"

Other Schemas

BQ also supports
- 3NF
- Snowflake
- Star
Use Nested and Repeated when your data naturally fits this data type
- Best for performance
Use the others if that is the format it is provided in

Assignment #2

https://www.udemy.com/course/bigquery/learn/practice/1257626#questions

{"id":"1","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]}
{"id":"2","first_name":"Jane","last_name":"Doe","dob":"1980-10-16","addresses":[{"status":"current","address":"789 Any Avenue","city":"New York","state":"NY","zip":"33333","numberOfYears":"2"},{"status":"previous","address":"321 Main Street","city":"Hoboken","state":"NJ","zip":"44444","numberOfYears":"3"}]}

Assignment: Pull all fields where city = “Portland”

select id, first_name, last_name, dob, addr.* from `multi-340710.repnest.assg2`, unnest(addresses) as addr where city="Portland"

Observations

You have to add the primary (non-nested) fields manually.
“Portland” is case sensitive. This is different from MySQL

BQFBD – Efficient Schema Design in BigQuery

Section 5: Efficient Schema Design in BigQuery

25. Design an Efficient schema for BigQuery Tables

Nested and Repeated Columns

26. Nested and Repeated Columns

How to query this table

CROSS JOIN

Other Schemas

Assignment #2

LEAVE A COMMENT Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta