BQFBD – Clustered Tables in BigQuery

February 11, 2022 BigQuery, BigQuery for Big Data Engineers

Section 9: Clustered Tables in BigQuery

40. What is Clustering

https://www.udemy.com/course/bigquery/learn/lecture/22755755#questions

Clustered tables
- aka Bucketing in Hive
Further break down Partitioned Tables
Work best for very large unpartitioned databases, or very large partitions.
When a table or partitions is bucketed on a column, all data in that row will be moved to that bucket
Buckets can be grouped similar to Integer partitions. You can specify more than one value for the Bucket column.
Use filters on both the Partition and Bucket columns to significantly reduce the number of rows a query has to scan.
If the table/partition is < 1 GB, then clustering isn’t required.
Clustering can be done on a standard table or a partitioned table
Table Options
- Standard, non-partitioned, non clustered table
- Partitioned Table
- Clustered Table
- Partitioned + Clustered Table

Cluster Storage

Partitions are like Directories
- Buckets are physical files stored in those directories

41. When to use Clustering, Partitioning, or Both

https://www.udemy.com/course/bigquery/learn/lecture/22755771#questions

Partitioning Alone
- Need strict cost guarantees before running the query
- Need to manage Partition level Expirations
- Need to manage Partition level DML operations
Partitioning with Clusters
- No need of strict cost guarantees before running the query
- Very large Partitions
- Frequent filter and aggregate queries on the table
- BQ automatically sorts the data in the table based on clustering columns
Clustering alone
- Data is unique
- Average partition size is > 1GB
  - (He actually said < 1GB here, so which is it?)

42. Create Clustered Table

https://www.udemy.com/course/bigquery/learn/lecture/22756007#questions

Schema
- name string
- surname string
- id string
- count float
- date date
Partitioning
- date
Clustering order
- name, surname

43, Dos and Don’ts for Clustering

https://www.udemy.com/course/bigquery/learn/lecture/22756029#questions

Create clusters in the order that the fields are created for optimal performance
- Example: name, surname NOT surname, name
Do not use clustered columns in complex queries. (Same as Partitioned Tables)
- Example: Don’t do this: select * from `proj.ds.table` where cast(count as string) = "5"
Do not compare cluster columns to other columns
- Same as Partitioned tables

Quotas and Limitations

Only Standard SQL is supported for querying and writing
Can modify/remove clustering columns, but it will be for new data only
- I guess you would need to delete all the data in the cluster to completely remove it.
Clustering must be top-level columns only
- DATE, Boolean, Geography, Int64, Numeric, String, Timestamp
- No Struct or Array fields
- Same as Partition Tables
Max 4 clustering columns.
All quotas that apply to Partitioned Tables apply to clustering

LEAVE A COMMENT Cancel reply