Section 9: Clustered Tables in BigQuery
40. What is Clustering
https://www.udemy.com/course/bigquery/learn/lecture/22755755#questions
- Clustered tables
- aka Bucketing in Hive
- Further break down Partitioned Tables
- Work best for very large unpartitioned databases, or very large partitions.
- When a table or partitions is bucketed on a column, all data in that row will be moved to that bucket
- Buckets can be grouped similar to Integer partitions. You can specify more than one value for the Bucket column.
- Use filters on both the Partition and Bucket columns to significantly reduce the number of rows a query has to scan.
- If the table/partition is < 1 GB, then clustering isn’t required.
- Clustering can be done on a standard table or a partitioned table
- Table Options
- Standard, non-partitioned, non clustered table
- Partitioned Table
- Clustered Table
- Partitioned + Clustered Table
Cluster Storage
- Partitions are like Directories
- Buckets are physical files stored in those directories
41. When to use Clustering, Partitioning, or Both
https://www.udemy.com/course/bigquery/learn/lecture/22755771#questions
- Partitioning Alone
- Need strict cost guarantees before running the query
- Need to manage Partition level Expirations
- Need to manage Partition level DML operations
- Partitioning with Clusters
- No need of strict cost guarantees before running the query
- Very large Partitions
- Frequent filter and aggregate queries on the table
- BQ automatically sorts the data in the table based on clustering columns
- Clustering alone
- Data is unique
- Average partition size is > 1GB
- (He actually said < 1GB here, so which is it?)
42. Create Clustered Table
https://www.udemy.com/course/bigquery/learn/lecture/22756007#questions
- Schema
- name string
- surname string
- id string
- count float
- date date
- Partitioning
- date
- Clustering order
- name, surname
43, Dos and Don’ts for Clustering
https://www.udemy.com/course/bigquery/learn/lecture/22756029#questions
- Create clusters in the order that the fields are created for optimal performance
- Example: name, surname NOT surname, name
- Do not use clustered columns in complex queries. (Same as Partitioned Tables)
- Example: Don’t do this:
select * from `proj.ds.table` where cast(count as string) = "5"
- Example: Don’t do this:
- Do not compare cluster columns to other columns
- Same as Partitioned tables
Quotas and Limitations
- Only Standard SQL is supported for querying and writing
- Can modify/remove clustering columns, but it will be for new data only
- I guess you would need to delete all the data in the cluster to completely remove it.
- Clustering must be top-level columns only
- DATE, Boolean, Geography, Int64, Numeric, String, Timestamp
- No Struct or Array fields
- Same as Partition Tables
- Max 4 clustering columns.
- All quotas that apply to Partitioned Tables apply to clustering