BQFBD – Clustered Tables in BigQuery

  BigQuery, BigQuery for Big Data Engineers

Main Menu

Section 9: Clustered Tables in BigQuery

40. What is Clustering

https://www.udemy.com/course/bigquery/learn/lecture/22755755#questions

  • Clustered tables
    • aka Bucketing in Hive
  • Further break down Partitioned Tables
  • Work best for very large unpartitioned databases, or very large partitions.
  • When a table or partitions is bucketed on a column, all data in that row will be moved to that bucket
  • Buckets can be grouped similar to Integer partitions.  You can specify more than one value for the Bucket column.
  • Use filters on both the Partition and Bucket columns to significantly reduce the number of rows a query has to scan.
  • If the table/partition is < 1 GB, then clustering isn’t required.
  • Clustering can be done on a standard table or a partitioned table
  • Table Options
    • Standard, non-partitioned, non clustered table
    • Partitioned Table
    • Clustered Table
    • Partitioned + Clustered Table

Cluster Storage

  • Partitions are like Directories
    • Buckets are physical files stored in those directories

41. When to use Clustering, Partitioning, or Both

https://www.udemy.com/course/bigquery/learn/lecture/22755771#questions

  • Partitioning Alone
    • Need strict cost guarantees before running the query
    • Need to manage Partition level Expirations
    • Need to manage Partition level DML operations
  • Partitioning with Clusters
    • No need of strict cost guarantees before running the query
    • Very large Partitions
    • Frequent filter and aggregate queries on the table
    • BQ automatically sorts the data in the table based on clustering columns
  • Clustering alone
    • Data is unique
    • Average partition size is > 1GB
      • (He actually said < 1GB here, so which is it?)

42. Create Clustered Table

https://www.udemy.com/course/bigquery/learn/lecture/22756007#questions

  • Schema
    • name string
    • surname string
    • id string
    • count float
    • date date
  • Partitioning
    • date
  • Clustering order
    • name, surname

 

43, Dos and Don’ts for Clustering

https://www.udemy.com/course/bigquery/learn/lecture/22756029#questions

  • Create clusters in the order that the fields are created for optimal performance
    • Example: name, surname NOT surname, name
  • Do not use clustered columns in complex queries.  (Same as Partitioned Tables)
    • Example: Don’t do this: select * from `proj.ds.table` where cast(count as string) = "5"
  • Do not compare cluster columns to other columns
    • Same as Partitioned tables

Quotas and Limitations

  • Only Standard SQL is supported for querying and writing
  • Can modify/remove clustering columns, but it will be for new data only
    • I guess you would need to delete all the data in the cluster to completely remove it.
  • Clustering must be top-level columns only
    • DATE, Boolean, Geography, Int64, Numeric, String, Timestamp
    • No Struct or Array fields
    • Same as Partition Tables
  • Max 4 clustering columns.
  • All quotas that apply to Partitioned Tables apply to clustering

 

LEAVE A COMMENT