Section 8: Partitioned Tables in BigQuery
33. What is Partitioning and its benefits
https://www.udemy.com/course/bigquery/learn/lecture/22751511#questions
Partition Tables
- Special tables divided into segments, called partitions, based on specific criteria.
- Queries can be performed on these partitions, reducing the scan size.
- Makes table management easy
- Enhances query performance
- Cost effective
- Group the data and put an index on each group
- Example: Frequent queries on ‘department’
- Group the data by department, making sure all similar department data is stored in the same location.
- Only a smaller subset of the whole table is required for queries.
- Example: Frequent queries on ‘department’
Partition Table Advantages
- Partitioning improves query performance
- Cost effective
- Increases parallelism
- Independent partitions can be managed independently
- Different partitions can live on different disk sub-systems
- Less frequently accessed data can live on slower disks
- Helpful for upserts on bulk data
- Common for data warehousing for doing daily incremental loads.
- Example 1:Huge table in Data warehouse, non partitioned
- Source table holds current month data to date with daily upserts
- if today is 21st, source table holds 21 days of data
- Uploading this source, all records in the source will get compared to all records in the data warehouse, which may contain years and years of data.
- Source table holds current month data to date with daily upserts
- Example 2: Same, but data warehouse is partitioned
- Drop the current month partition from the Data warehouse
- Replace it with the new data
34. Ingestion Time Partitioned Tables
https://www.udemy.com/course/bigquery/learn/lecture/22751515#questions
- Tables can be partitioned by:
- Ingestion time
- Date/Timestamp column
- Integer Column
- Partitioning and Partition type is set during table creation
- These can be viewed in the table Details after creation
Ingestion Time
-
- BQ automatically loads data into daily, date based partitions based on the arrival date
- UTC Only!
- BQ automatically loads data into daily, date based partitions based on the arrival date
Example:
- Create table same as any other.
- Partition and cluster settings
- Partition by ingestion time
- Partitioning type:
- By day
- Default
- Best for large tables spanning many months and years
- By hour
- Best for smaller date ranges (< 6mo)
- Good for high volume data
- Will adversely affect performance on very large tables spanning many months and years.
- By day
- Partition filter: [ X ] Require partition filter
- This option will REQUIRE a query to use a WHERE clause.
- May reduce costs and improve performance.
Pseudo columns
- These tables will automatically create new pseudo columns depending on the type
- Daily
- _PARTITIONTIME containing the timestamp (full, I’m guessing)
- _PARTITIONDATE containing only the date
- Guessing this is a much faster search
- Hourly
- only _PARTITIONTIME for hourly tables.
- Daily
- Can use any standard filtering operators on these columns
- AND, OR, >, <, =, !=, etc.
Viewing pseudo-column data in a query
- You must use an alias for the column name
select _PARTITIONTIME as pt, _PARTITIONDATE as pd, name from `proj.ds.table` where _PARTITION = DATE("YYYY-MM-DD")
ROW pt pd name
1 2022-02-09 00:00:00 UTC 2022-02-09 Mary
2 2022-02-09 00:00:00 UTC 2022-02-09 Jim
__PARTITIONS_SUMMARY__ Meta Tables
- These tables contain the metadata about partitions in a time-partitioned table.
- Queries have to use Legacy SQL
- There will be 1 row for each partition, so each new date upload will get a new row
- Multiple uploads during the same day will share the same row
- They will just update the last_modified columns.
select * from [ds.table$__PARTITIONS_SUMMARY__]
- Row
- project_id: proj
- dataset_id: ds
- table_id: table
- partition_id: YYYYMMDD (no hyphens)
- creation_time: UNIX_TIMESTAMP
- creation_timestamp: YYYY-MM-DD HH:MM:SS.MMM UTC
- last_modified_time: UNIX_TIMESTAMP
- last_modified_timestamp: YYYY-MM-DD HH:MM:SS.MMM UTC
Using Non-UTC Timestamps
- Extra columns
- Create an extra column in your table that stores the current timestamp
- Handle the differences in your query using the pseudo column && current timestamp column
select * from `proj.ds.table` where _PARTITIONTIME=TIMESTAMP_ADD(MY_TIMESTAMP_FIELD, INTERVAL 8 HOUR)
- Use Partition Decorators to load data into a specific partition
35. Date or Timestamp Column Partitioned Tables
https://www.udemy.com/course/bigquery/learn/lecture/22751521#questions
Creating the Table
- Create the table as usual
- The partitioning field must be:
- DATE or TIMESTAMP
- Top-level field
- Cannot be a ‘leaf’ field from a Struct
- REQUIRED or NULLABLE
- Cannot be REPEATED
- Partitioning
- Partition by field: Select your column
- It will be available in the dropdown
- Partition by field: Select your column
- The pseudo column will NOT be created since you are using an existing field
- You WILL still get the __PARTITIONS_SUMMARY__ table
- Upload your data
- Data that is NULL or Invalid
- NULL data will be stored in the __NULL__ partition
select * from [ds.table$__NULL__]
- Invalid data will be stored in the __UNPARTITIONED__ partition
select * from [ds.table$__UNPARTITIONED__]
- NULL data will be stored in the __NULL__ partition
- Data that is NULL or Invalid
select * from `proj.ds.table` where DATE_FIELD = DATE("YYYY-MM-DD")
Streaming Data
- Streaming data is originally recorded in the __UNPARTITIONED__ partition
- Streaming records are NOT stored to the disk as soon as they are received
- Recv >write, recv > write, … would be VERY inefficient.
- They are instead first placed in a buffer.
- When the buffer is full, or set time has passed, the contents are written in batches
- SLA says the data must be available as soon as it is received, so queries against it will still work, even if it has not yet been written.
- Query will hit both the table and the buffer
- To the user, it appears the data is only coming from the table
- Once the data is written to disk, it will be recorded in the appropriate partition
36. Integer based Partitioned Tables
https://www.udemy.com/course/bigquery/learn/lecture/22751527#questions
Integer Partitions
- Requires at least one INTEGER primary column
- Same rules apply as DATE and TIMESTAMP
- Provide a Start, End and Interval
- Start is Inclusive, End is Exclusive
- Start will always be included in the Partitioning
- End may or may not be, depending on the Interval size.
- Example: Start = 2, End = 105, Interval = 10
- Partition1 = 2-11
- Partition2 = 12-21
- …
- Partition10 = 92-101
- Data with -1, 1, 102, 105, 200 will be written to the __UNPARTITIONED__ partition
- Null values will be written to the __NULL__ partition
- Start is Inclusive, End is Exclusive
select * from [ds.table$__PARTITIONS_SUMMARY__]
Example Query
select * from `proj.ds.table` where MY_INT between 30 and 50
- This example will only scan the ’30’, ’40’ and ’50’ partitions
Partitioned Tables Pricing and Limitations
- There is no extra cost to create Partitioned tables.
- You’re still charged for your storage and queries
- Limitations
- Max Partitions allowed per table: 4000
- Max partitions modified by a single job = 4000
- Any job that affects more than 4000 are rejected by BQ
- Max partitions modified per day
- These include all operations
- Load, query, DML Insert, updates, deletes that MODIFY partitions
- Read only queries do not apply
- Ingestion Time : 5000
- Column partitioned : 30,000
- These include all operations
37. ALTER, COPY operations on Partitioned Tables
https://www.udemy.com/course/bigquery/learn/lecture/22755517#questions
Modifying Partitioned Tables
- No different than other tables.
Partition Expirations
alter table `proj.ds.table` set options (partition_expiration_days=10
- Not currently supported in the console
- You Cannot apply different expiration times to individual partitions
- When you update the partition expiration, the settings apply to all partitions, regardless of when they are created
- Table expiration takes precedence. If the table expires, all partitions will be deleted regardless of their expiration date
Levels of Expiration
- Dataset Level
- Table Level
- Partition level
Copying Partition Tables
- No different than standard tables
- Data, partitions, metadata will be copied
- Copy a non-partitioned table into a paritioned table.
- Example: A new partition will be created for the current date
- Copy a partitioned table into a non-partitioned
- Data is appended or overwritten
- Copy multiple non or partitioned into a partitioned table
- You cannot copy a mixture of partitioned and non-partitioned tables in the same job
- If copying multiple partitioned tables, the partition specs for all source tables must match the destination table’s specs.
- if destination table partitions on ColX, then all source tables must also partition on ColX
- Source and Destination tables must be in datasets in the same location
- You can copy select partitions.
- Not supported by console
38. DML operations on Partitioned Tables
https://www.udemy.com/course/bigquery/learn/lecture/22755691#questions
Ingestion Time Partitions
# Delete an entire partition
delete from `proj.ds.table` where _PARTITIONTIME = TIMESTAMP("YYYY-MM-DD")
# Insert a row
insert into `proj.ds.table` (_PARTITIONTIME, name, gender, count) values (TIMESTAMP("YYYY-MM-DD"), "Sophia", "F", 100)
# Update a row
update `proj.ds.table` set name = "Sofia" where name = "Sofia" AND _PARTITIONTIME = TIMESTAMP("YYYY-MM-DD")
39. Best Practices for Partitioning
https://www.udemy.com/course/bigquery/learn/lecture/22751531#questions
Best Practices
- There must be at least one filter that only references the partition column. (Obviously)
- Avoid extra filters in the where clause with your main partition filter.
- i.e. avoid OR, but AND should be fine
- Do not include any expressions using other columns in the filter
- Example: dept is the INTEGER Partition field
select name from `proj.ds.table` where dept + count = 4
- Filters that include subqueries can’t be used to limit the number of partitions scanned
- Example:
select name from `proj.ds.table` where dept = (select max(dept) from proj.ds.table) - This would be OK if we replaced the subquery with a direct value.
- Example:
- Always use the pseudo column by itself on the self side of a comparison
- i.e. avoid putting the partition column with some calculations
- Example: Don’t do this
select name from `proj.ds.table` where TIMESTAMP_ADD(_PARTITIONTIME, INTERVAL 1 DAY) > "YYY-MM-DD"- Do this instead:
select name from `proj.ds.table` where _PARTITIONTIME > TIMESTAMP_SUB(TIMESTAMP("YYYY-MM-DD"), INTERVAL 1 DAY)
- Isolate the Partition column while expressing a filter.
- Don’t compare the Partition column with other columns in the WHERE clause
- Example:
select name from `proj.ds.table` where dept = SOME_OTHER_INTEGER_COLUMN
- Do not create too many partitions.
- Every partition has to have metadata maintained.
- As the number of partitions increases, so does the metadata overhead.
- If you’re setting partitions on hourly data on years worth of data, then the performance will be about the same as if you had not partitioned at all.
- In this case, you are better off using Cluster tables. See next section.