BQFBD – Partitioned Tables in BigQuery

February 9, 2022 BigQuery, BigQuery for Big Data Engineers

Main Menu

Section 8: Partitioned Tables in BigQuery

33. What is Partitioning and its benefits

https://www.udemy.com/course/bigquery/learn/lecture/22751511#questions

Partition Tables

Special tables divided into segments, called partitions, based on specific criteria.
Queries can be performed on these partitions, reducing the scan size.
- Makes table management easy
- Enhances query performance
- Cost effective
Group the data and put an index on each group
- Example: Frequent queries on ‘department’
  - Group the data by department, making sure all similar department data is stored in the same location.
- Only a smaller subset of the whole table is required for queries.

Partition Table Advantages

Partitioning improves query performance
Cost effective
Increases parallelism
Independent partitions can be managed independently
Different partitions can live on different disk sub-systems
- Less frequently accessed data can live on slower disks
Helpful for upserts on bulk data
- Common for data warehousing for doing daily incremental loads.
- Example 1:Huge table in Data warehouse, non partitioned
  - Source table holds current month data to date with daily upserts
    - if today is 21st, source table holds 21 days of data
  - Uploading this source, all records in the source will get compared to all records in the data warehouse, which may contain years and years of data.
- Example 2: Same, but data warehouse is partitioned
  - Drop the current month partition from the Data warehouse
  - Replace it with the new data

34. Ingestion Time Partitioned Tables

https://www.udemy.com/course/bigquery/learn/lecture/22751515#questions

Tables can be partitioned by:
- Ingestion time
- Date/Timestamp column
- Integer Column
Partitioning and Partition type is set during table creation
These can be viewed in the table Details after creation

Ingestion Time

- BQ automatically loads data into daily, date based partitions based on the arrival date
  - UTC Only!

Example:

Create table same as any other.
Partition and cluster settings
- Partition by ingestion time
- Partitioning type:
  - By day
    - Default
    - Best for large tables spanning many months and years
  - By hour
    - Best for smaller date ranges (< 6mo)
    - Good for high volume data
    - Will adversely affect performance on very large tables spanning many months and years.
- Partition filter: [ X ] Require partition filter
  - This option will REQUIRE a query to use a WHERE clause.
  - May reduce costs and improve performance.

Pseudo columns

These tables will automatically create new pseudo columns depending on the type
- Daily
  - _PARTITIONTIME containing the timestamp (full, I’m guessing)
  - _PARTITIONDATE containing only the date
    - Guessing this is a much faster search
- Hourly
  - only _PARTITIONTIME for hourly tables.
Can use any standard filtering operators on these columns
- AND, OR, >, <, =, !=, etc.

Viewing pseudo-column data in a query

You must use an alias for the column name

select _PARTITIONTIME as pt, _PARTITIONDATE as pd, name from `proj.ds.table` where _PARTITION = DATE("YYYY-MM-DD")

ROW  pt                       pd          name
1    2022-02-09 00:00:00 UTC  2022-02-09  Mary
2    2022-02-09 00:00:00 UTC  2022-02-09  Jim

__PARTITIONS_SUMMARY__ Meta Tables

These tables contain the metadata about partitions in a time-partitioned table.
- Queries have to use Legacy SQL
- There will be 1 row for each partition, so each new date upload will get a new row
  - Multiple uploads during the same day will share the same row
  - They will just update the last_modified columns.

select * from [ds.table$__PARTITIONS_SUMMARY__]

Row
project_id: proj
dataset_id: ds
table_id: table
partition_id: YYYYMMDD (no hyphens)
creation_time: UNIX_TIMESTAMP
creation_timestamp: YYYY-MM-DD HH:MM:SS.MMM UTC
last_modified_time: UNIX_TIMESTAMP
last_modified_timestamp: YYYY-MM-DD HH:MM:SS.MMM UTC

Using Non-UTC Timestamps

Extra columns
- Create an extra column in your table that stores the current timestamp
- Handle the differences in your query using the pseudo column && current timestamp column
- select * from `proj.ds.table` where _PARTITIONTIME=TIMESTAMP_ADD(MY_TIMESTAMP_FIELD, INTERVAL 8 HOUR)
Use Partition Decorators to load data into a specific partition

35. Date or Timestamp Column Partitioned Tables

https://www.udemy.com/course/bigquery/learn/lecture/22751521#questions

Creating the Table

Create the table as usual
The partitioning field must be:
- DATE or TIMESTAMP
- Top-level field
  - Cannot be a ‘leaf’ field from a Struct
- REQUIRED or NULLABLE
  - Cannot be REPEATED
Partitioning
- Partition by field: Select your column
  - It will be available in the dropdown
The pseudo column will NOT be created since you are using an existing field
- You WILL still get the __PARTITIONS_SUMMARY__ table
Upload your data
- Data that is NULL or Invalid
  - NULL data will be stored in the __NULL__ partition
    - select * from [ds.table$__NULL__]
  - Invalid data will be stored in the __UNPARTITIONED__ partition
    - select * from [ds.table$__UNPARTITIONED__]

select * from `proj.ds.table` where DATE_FIELD = DATE("YYYY-MM-DD")

Streaming Data

Streaming data is originally recorded in the __UNPARTITIONED__ partition
Streaming records are NOT stored to the disk as soon as they are received
- Recv >write, recv > write, … would be VERY inefficient.
They are instead first placed in a buffer.
When the buffer is full, or set time has passed, the contents are written in batches
SLA says the data must be available as soon as it is received, so queries against it will still work, even if it has not yet been written.
- Query will hit both the table and the buffer
- To the user, it appears the data is only coming from the table
Once the data is written to disk, it will be recorded in the appropriate partition

36. Integer based Partitioned Tables

https://www.udemy.com/course/bigquery/learn/lecture/22751527#questions

Integer Partitions

Requires at least one INTEGER primary column
- Same rules apply as DATE and TIMESTAMP
Provide a Start, End and Interval
- Start is Inclusive, End is Exclusive
  - Start will always be included in the Partitioning
  - End may or may not be, depending on the Interval size.
- Example: Start = 2, End = 105, Interval = 10
  - Partition1 = 2-11
  - Partition2 = 12-21
  - …
  - Partition10 = 92-101
- Data with -1, 1, 102, 105, 200 will be written to the __UNPARTITIONED__ partition
- Null values will be written to the __NULL__ partition

select * from [ds.table$__PARTITIONS_SUMMARY__]

Example Query

select * from `proj.ds.table` where MY_INT between 30 and 50

This example will only scan the ’30’, ’40’ and ’50’ partitions

Partitioned Tables Pricing and Limitations

There is no extra cost to create Partitioned tables.
- You’re still charged for your storage and queries
Limitations
- Max Partitions allowed per table: 4000
- Max partitions modified by a single job = 4000
  - Any job that affects more than 4000 are rejected by BQ
- Max partitions modified per day
  - These include all operations
    - Load, query, DML Insert, updates, deletes that MODIFY partitions
    - Read only queries do not apply
  - Ingestion Time : 5000
  - Column partitioned : 30,000

37. ALTER, COPY operations on Partitioned Tables

https://www.udemy.com/course/bigquery/learn/lecture/22755517#questions

Modifying Partitioned Tables

No different than other tables.

Partition Expirations

alter table `proj.ds.table` set options (partition_expiration_days=10

Not currently supported in the console
You Cannot apply different expiration times to individual partitions
- When you update the partition expiration, the settings apply to all partitions, regardless of when they are created
Table expiration takes precedence. If the table expires, all partitions will be deleted regardless of their expiration date

Levels of Expiration

Dataset Level
Table Level
Partition level

Copying Partition Tables

No different than standard tables
Data, partitions, metadata will be copied
Copy a non-partitioned table into a paritioned table.
- Example: A new partition will be created for the current date
Copy a partitioned table into a non-partitioned
- Data is appended or overwritten
Copy multiple non or partitioned into a partitioned table
- You cannot copy a mixture of partitioned and non-partitioned tables in the same job
- If copying multiple partitioned tables, the partition specs for all source tables must match the destination table’s specs.
  - if destination table partitions on ColX, then all source tables must also partition on ColX
- Source and Destination tables must be in datasets in the same location
- You can copy select partitions.
  - Not supported by console

38. DML operations on Partitioned Tables

https://www.udemy.com/course/bigquery/learn/lecture/22755691#questions

Ingestion Time Partitions

# Delete an entire partition
delete from `proj.ds.table` where _PARTITIONTIME = TIMESTAMP("YYYY-MM-DD")

# Insert a row
insert into `proj.ds.table` (_PARTITIONTIME, name, gender, count) values (TIMESTAMP("YYYY-MM-DD"), "Sophia", "F", 100)

# Update a row
update `proj.ds.table` set name = "Sofia" where name = "Sofia" AND _PARTITIONTIME = TIMESTAMP("YYYY-MM-DD")

39. Best Practices for Partitioning

https://www.udemy.com/course/bigquery/learn/lecture/22751531#questions

Best Practices

There must be at least one filter that only references the partition column. (Obviously)
Avoid extra filters in the where clause with your main partition filter.
- i.e. avoid OR, but AND should be fine
Do not include any expressions using other columns in the filter
- Example: dept is the INTEGER Partition field
- select name from `proj.ds.table` where dept + count = 4
Filters that include subqueries can’t be used to limit the number of partitions scanned
- Example: select name from `proj.ds.table` where dept = (select max(dept) from proj.ds.table)
- This would be OK if we replaced the subquery with a direct value.
Always use the pseudo column by itself on the self side of a comparison
- i.e. avoid putting the partition column with some calculations
- Example: Don’t do this
- select name from `proj.ds.table` where TIMESTAMP_ADD(_PARTITIONTIME, INTERVAL 1 DAY) > "YYY-MM-DD"
- Do this instead:
- select name from `proj.ds.table` where _PARTITIONTIME > TIMESTAMP_SUB(TIMESTAMP("YYYY-MM-DD"), INTERVAL 1 DAY)
Isolate the Partition column while expressing a filter.
- Don’t compare the Partition column with other columns in the WHERE clause
- Example: select name from `proj.ds.table` where dept = SOME_OTHER_INTEGER_COLUMN
Do not create too many partitions.
- Every partition has to have metadata maintained.
- As the number of partitions increases, so does the metadata overhead.
- If you’re setting partitions on hourly data on years worth of data, then the performance will be about the same as if you had not partitioned at all.
  - In this case, you are better off using Cluster tables. See next section.