BQFBD – Partitioned Tables in BigQuery

  BigQuery, BigQuery for Big Data Engineers

Main Menu

Section 8: Partitioned Tables in BigQuery

33. What is Partitioning and its benefits

https://www.udemy.com/course/bigquery/learn/lecture/22751511#questions

Partition Tables

  • Special tables divided into segments, called partitions, based on specific criteria.
  • Queries can be performed on these partitions, reducing the scan size.
    • Makes table management easy
    • Enhances query performance
    • Cost effective
  • Group the data and put an index on each group
    • Example: Frequent queries on ‘department’
      • Group the data by department, making sure all similar department data is stored in the same location.
    • Only a smaller subset of the whole table is required for queries.

Partition Table Advantages

  • Partitioning improves query performance
  • Cost effective
  • Increases parallelism
  • Independent partitions can be managed independently
  • Different partitions can live on different disk sub-systems
    • Less frequently accessed data can live on slower disks
  • Helpful for upserts on bulk data
    • Common for data warehousing for doing daily incremental loads.
    • Example 1:Huge table in Data warehouse, non partitioned
      • Source table holds current month data to date with daily upserts
        • if today is 21st, source table holds 21 days of data
      • Uploading this source, all records in the source will get compared to all records in the data warehouse, which may contain years and years of data.
    • Example 2: Same, but data warehouse is partitioned
      • Drop the current month partition from the Data warehouse
      • Replace it with the new data

34. Ingestion Time Partitioned Tables

https://www.udemy.com/course/bigquery/learn/lecture/22751515#questions

  • Tables can be partitioned by:
    • Ingestion time
    • Date/Timestamp column
    • Integer Column
  • Partitioning and Partition type is set during table creation
  • These can be viewed in the table Details after creation

Ingestion Time

    • BQ automatically loads data into daily, date based partitions based on the arrival date
      • UTC Only!

Example:

  • Create table same as any other.
  • Partition and cluster settings
    • Partition by ingestion time
    • Partitioning type:
      • By day
        • Default
        • Best for large tables spanning many months and years
      • By hour
        • Best for smaller date ranges (< 6mo)
        • Good for high volume data
        • Will adversely affect performance on very large tables spanning many months and years.
    • Partition filter: [ X ] Require partition filter
      • This option will REQUIRE a query to use a WHERE clause.
      • May reduce costs and improve performance.

Pseudo columns

  • These tables will automatically create new pseudo columns depending on the type
    • Daily
      • _PARTITIONTIME containing the timestamp (full, I’m guessing)
      • _PARTITIONDATE containing only the date
        • Guessing this is a much faster search
    • Hourly
      • only _PARTITIONTIME for hourly tables.
  • Can use any standard filtering operators on these columns
    • AND, OR, >, <, =, !=, etc.

Viewing pseudo-column data in a query

  • You must use an alias for the column name
select _PARTITIONTIME as pt, _PARTITIONDATE as pd, name from `proj.ds.table` where _PARTITION = DATE("YYYY-MM-DD")

ROW  pt                       pd          name
1    2022-02-09 00:00:00 UTC  2022-02-09  Mary
2    2022-02-09 00:00:00 UTC  2022-02-09  Jim

__PARTITIONS_SUMMARY__ Meta Tables

  • These tables contain the metadata about partitions in a time-partitioned table.
    • Queries have to use Legacy SQL
    • There will be 1 row for each partition, so each new date upload will get a new row
      • Multiple uploads during the same day will share the same row
      • They will just update the last_modified columns.
select * from [ds.table$__PARTITIONS_SUMMARY__]
  • Row
  • project_id: proj
  • dataset_id: ds
  • table_id: table
  • partition_id: YYYYMMDD (no hyphens)
  • creation_time: UNIX_TIMESTAMP
  • creation_timestamp: YYYY-MM-DD HH:MM:SS.MMM UTC
  • last_modified_time: UNIX_TIMESTAMP
  • last_modified_timestamp: YYYY-MM-DD HH:MM:SS.MMM UTC

Using Non-UTC Timestamps

  • Extra columns
    • Create an extra column in your table that stores the current timestamp
    • Handle the differences in your query using the pseudo column && current timestamp column
    • select * from `proj.ds.table` where _PARTITIONTIME=TIMESTAMP_ADD(MY_TIMESTAMP_FIELD, INTERVAL 8 HOUR)
  • Use Partition Decorators to load data into a specific partition

35. Date or Timestamp Column Partitioned Tables

https://www.udemy.com/course/bigquery/learn/lecture/22751521#questions

Creating the Table

  • Create the table as usual
  • The partitioning field must be:
    • DATE or TIMESTAMP
    • Top-level field
      • Cannot be a ‘leaf’ field from a Struct
    • REQUIRED or NULLABLE
      • Cannot be REPEATED
  • Partitioning
    • Partition by field: Select your column
      • It will be available in the dropdown
  • The pseudo column will NOT be created since you are using an existing field
    • You WILL still get the __PARTITIONS_SUMMARY__ table
  • Upload your data
    • Data that is NULL or Invalid
      • NULL data will be stored in the __NULL__ partition
        • select * from [ds.table$__NULL__]
      • Invalid data will be stored in the __UNPARTITIONED__ partition
        • select * from [ds.table$__UNPARTITIONED__]
select * from `proj.ds.table` where DATE_FIELD = DATE("YYYY-MM-DD")

Streaming Data

  • Streaming data is originally recorded in the __UNPARTITIONED__ partition
  • Streaming records are NOT stored to the disk as soon as they are received
    • Recv >write, recv > write, … would be VERY inefficient.
  • They are instead first placed in a buffer.
  • When the buffer is full, or set time has passed, the contents are written in batches
  • SLA says the data must be available as soon as it is received, so queries against it will still work, even if it has not yet been written.
    • Query will hit both the table and the buffer
    • To the user, it appears the data is only coming from the table
  • Once the data is written to disk, it will be recorded in the appropriate partition

36. Integer based Partitioned Tables

https://www.udemy.com/course/bigquery/learn/lecture/22751527#questions

Integer Partitions

  • Requires at least one INTEGER primary column
    • Same rules apply as DATE and TIMESTAMP
  • Provide a Start, End and Interval
    • Start is Inclusive, End is Exclusive
      • Start will always be included in the Partitioning
      • End may or may not be, depending on the Interval size.
    • Example: Start = 2, End = 105, Interval = 10
      • Partition1 = 2-11
      • Partition2 = 12-21
      • Partition10 = 92-101
    • Data with -1, 1, 102, 105, 200 will be written to the __UNPARTITIONED__ partition
    • Null values will be written to the __NULL__ partition
select * from [ds.table$__PARTITIONS_SUMMARY__]

Example Query

select * from `proj.ds.table` where MY_INT between 30 and 50
  • This example will only scan the ’30’, ’40’ and ’50’ partitions

Partitioned Tables Pricing and Limitations

  • There is no extra cost to create Partitioned tables.
    • You’re still charged for your storage and queries
  • Limitations
    • Max Partitions allowed per table: 4000
    • Max partitions modified by a single job = 4000
      • Any job that affects more than 4000 are rejected by BQ
    • Max partitions modified per day
      • These include all operations
        • Load, query, DML Insert, updates, deletes that MODIFY partitions
        • Read only queries do not apply
      • Ingestion Time : 5000
      • Column partitioned : 30,000

37. ALTER, COPY operations on Partitioned Tables

https://www.udemy.com/course/bigquery/learn/lecture/22755517#questions

Modifying Partitioned Tables

  • No different than other tables.

Partition Expirations

alter table `proj.ds.table` set options (partition_expiration_days=10

  • Not currently supported in the console
  • You Cannot apply different expiration times to individual partitions
    • When you update the partition expiration, the settings apply to all partitions, regardless of when they are created
  • Table expiration takes precedence. If the table expires, all partitions will be deleted regardless of their expiration date

Levels of Expiration

  • Dataset Level
  • Table Level
  • Partition level

Copying Partition Tables

  • No different than standard tables
  • Data, partitions, metadata will be copied
  • Copy a non-partitioned table into a paritioned table.
    • Example: A new partition will be created for the current date
  • Copy a partitioned table into a non-partitioned
    • Data is appended or overwritten
  • Copy multiple non or partitioned into a partitioned table
    • You cannot copy a mixture of partitioned and non-partitioned tables in the same job
    • If copying multiple partitioned tables, the partition specs for all source tables must match the destination table’s specs.
      • if destination table partitions on ColX, then all source tables must also partition on ColX
    • Source and Destination tables must be in datasets in the same location
    • You can copy select partitions.
      • Not supported by console

38. DML operations on Partitioned Tables

https://www.udemy.com/course/bigquery/learn/lecture/22755691#questions

Ingestion Time Partitions

# Delete an entire partition
delete from `proj.ds.table` where _PARTITIONTIME = TIMESTAMP("YYYY-MM-DD")

# Insert a row
insert into `proj.ds.table` (_PARTITIONTIME, name, gender, count) values (TIMESTAMP("YYYY-MM-DD"), "Sophia", "F", 100)

# Update a row
update `proj.ds.table` set name = "Sofia" where name = "Sofia" AND _PARTITIONTIME = TIMESTAMP("YYYY-MM-DD")

39. Best Practices for Partitioning

https://www.udemy.com/course/bigquery/learn/lecture/22751531#questions

Best Practices

  • There must be at least one filter that only references the partition column. (Obviously)
  • Avoid extra filters in the where clause with your main partition filter.
    • i.e. avoid OR, but AND should be fine
  • Do not include any expressions using other columns in the filter
    • Example: dept is the INTEGER Partition field
    • select name from `proj.ds.table` where dept + count = 4
  • Filters that include subqueries can’t be used to limit the number of partitions scanned
    • Example: select name from `proj.ds.table` where dept = (select max(dept) from proj.ds.table)
    • This would be OK if we replaced the subquery with a direct value.
  • Always use the pseudo column by itself on the self side of a comparison
    • i.e. avoid putting the partition column with some calculations
    • Example: Don’t do this
    • select name from `proj.ds.table` where TIMESTAMP_ADD(_PARTITIONTIME, INTERVAL 1 DAY) > "YYY-MM-DD"
    • Do this instead:
    • select name from `proj.ds.table` where _PARTITIONTIME > TIMESTAMP_SUB(TIMESTAMP("YYYY-MM-DD"), INTERVAL 1 DAY)
  • Isolate the Partition column while expressing a filter.
    • Don’t compare the Partition column with other columns in the WHERE clause
    • Example: select name from `proj.ds.table` where dept = SOME_OTHER_INTEGER_COLUMN
  • Do not create too many partitions.
    • Every partition has to have metadata maintained.
    • As the number of partitions increases, so does the metadata overhead.
    • If you’re setting partitions on hourly data on years worth of data, then the performance will be about the same as if you had not partitioned at all.
      • In this case, you are better off using Cluster tables.  See next section.

 

LEAVE A COMMENT