{"id":3758,"date":"2022-02-09T12:06:45","date_gmt":"2022-02-09T12:06:45","guid":{"rendered":"https:\/\/wiki.thomasandsofia.com\/?p=3758"},"modified":"2022-02-11T00:34:32","modified_gmt":"2022-02-11T00:34:32","slug":"bqfbd-partitioned-tables-in-bigquery","status":"publish","type":"post","link":"https:\/\/wiki.thomasandsofia.com\/?p=3758","title":{"rendered":"BQFBD &#8211; Partitioned Tables in BigQuery"},"content":{"rendered":"<p><a href=\"\/bigquery-for-big-data\/\">Main Menu<\/a><\/p>\n<h1>Section 8: Partitioned Tables in BigQuery<\/h1>\n<h2>33. What is Partitioning and its benefits<\/h2>\n<p>https:\/\/www.udemy.com\/course\/bigquery\/learn\/lecture\/22751511#questions<\/p>\n<h3>Partition Tables<\/h3>\n<ul>\n<li>Special tables divided into segments, called partitions, based on specific criteria.<\/li>\n<li>Queries can be performed on these partitions, reducing the scan size.\n<ul>\n<li>Makes table management easy<\/li>\n<li>Enhances query performance<\/li>\n<li>Cost effective<\/li>\n<\/ul>\n<\/li>\n<li>Group the data and put an index on each group\n<ul>\n<li>Example: Frequent queries on &#8216;department&#8217;\n<ul>\n<li>Group the data by department, making sure all similar department data is stored in the same location.<\/li>\n<\/ul>\n<\/li>\n<li>Only a smaller subset of the whole table is required for queries.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Partition Table Advantages<\/h3>\n<ul>\n<li>Partitioning improves query performance<\/li>\n<li>Cost effective<\/li>\n<li>Increases parallelism<\/li>\n<li>Independent partitions can be managed independently<\/li>\n<li>Different partitions can live on different disk sub-systems\n<ul>\n<li>Less frequently accessed data can live on slower disks<\/li>\n<\/ul>\n<\/li>\n<li>Helpful for upserts on bulk data\n<ul>\n<li>Common for data warehousing for doing daily incremental loads.<\/li>\n<li>Example 1:Huge table in Data warehouse, non partitioned\n<ul>\n<li>Source table holds current month data to date with daily upserts\n<ul>\n<li>if today is 21st, source table holds 21 days of data<\/li>\n<\/ul>\n<\/li>\n<li>Uploading this source, all records in the source will get compared to all records in the data warehouse, which may contain years and years of data.<\/li>\n<\/ul>\n<\/li>\n<li>Example 2: Same, but data warehouse is partitioned\n<ul>\n<li>Drop the current month partition from the Data warehouse<\/li>\n<li>Replace it with the new data<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>34. Ingestion Time Partitioned Tables<\/h2>\n<p>https:\/\/www.udemy.com\/course\/bigquery\/learn\/lecture\/22751515#questions<\/p>\n<ul>\n<li>Tables can be partitioned by:\n<ul>\n<li>Ingestion time<\/li>\n<li>Date\/Timestamp column<\/li>\n<li>Integer Column<\/li>\n<\/ul>\n<\/li>\n<li>Partitioning and Partition type is set during table creation<\/li>\n<li>These can be viewed in the table Details after creation<\/li>\n<\/ul>\n<h3>Ingestion Time<\/h3>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li>BQ automatically loads data into daily, date based partitions based on the arrival date\n<ul>\n<li>UTC Only!<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h4>Example:<\/h4>\n<ul>\n<li>Create table same as any other.<\/li>\n<li><strong>Partition and cluster settings<\/strong>\n<ul>\n<li>Partition by ingestion time<\/li>\n<li>Partitioning type:\n<ul>\n<li>By day\n<ul>\n<li>Default<\/li>\n<li>Best for large tables spanning many months and years<\/li>\n<\/ul>\n<\/li>\n<li>By hour\n<ul>\n<li>Best for smaller date ranges (&lt; 6mo)<\/li>\n<li>Good for high volume data<\/li>\n<li>Will adversely affect performance on very large tables spanning many months and years.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Partition filter: [ X ] Require partition filter\n<ul>\n<li>This option will REQUIRE a query to use a WHERE clause.<\/li>\n<li>May reduce costs and improve performance.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Pseudo columns<\/h3>\n<ul>\n<li>These tables will automatically create new pseudo columns depending on the type\n<ul>\n<li>Daily\n<ul>\n<li>_PARTITIONTIME containing the timestamp (full, I&#8217;m guessing)<\/li>\n<li>_PARTITIONDATE containing only the date\n<ul>\n<li>Guessing this is a much faster search<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Hourly\n<ul>\n<li>only _PARTITIONTIME for hourly tables.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Can use any standard filtering operators on these columns\n<ul>\n<li>AND, OR, &gt;, &lt;, =, !=, etc.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Viewing pseudo-column data in a query<\/h3>\n<ul>\n<li>You must use an alias for the column name<\/li>\n<\/ul>\n<pre>select _PARTITIONTIME as pt, _PARTITIONDATE as pd, name from `proj.ds.table` where _PARTITION = DATE(\"YYYY-MM-DD\")\r\n\r\nROW  pt                       pd          name\r\n1    2022-02-09 00:00:00 UTC  2022-02-09  Mary\r\n2    2022-02-09 00:00:00 UTC  2022-02-09  Jim<\/pre>\n<h3>__PARTITIONS_SUMMARY__ Meta Tables<\/h3>\n<ul>\n<li>These tables contain the metadata about partitions in a time-partitioned table.\n<ul>\n<li>Queries have to use Legacy SQL<\/li>\n<li>There will be 1 row for each partition, so each new date upload will get a new row\n<ul>\n<li>Multiple uploads during the same day will share the same row<\/li>\n<li>They will just update the last_modified columns.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<pre>select * from [ds.table$__PARTITIONS_SUMMARY__]<\/pre>\n<ul>\n<li>Row<\/li>\n<li>project_id: proj<\/li>\n<li>dataset_id: ds<\/li>\n<li>table_id: table<\/li>\n<li>partition_id: YYYYMMDD (no hyphens)<\/li>\n<li>creation_time: UNIX_TIMESTAMP<\/li>\n<li>creation_timestamp: YYYY-MM-DD HH:MM:SS.MMM UTC<\/li>\n<li>last_modified_time: UNIX_TIMESTAMP<\/li>\n<li>last_modified_timestamp: YYYY-MM-DD HH:MM:SS.MMM UTC<\/li>\n<\/ul>\n<h3>Using Non-UTC Timestamps<\/h3>\n<ul>\n<li>Extra columns\n<ul>\n<li>Create an extra column in your table that stores the current timestamp<\/li>\n<li>Handle the differences in your query using the pseudo column &amp;&amp; current timestamp column<\/li>\n<li><code>select * from `proj.ds.table` where _PARTITIONTIME=TIMESTAMP_ADD(MY_TIMESTAMP_FIELD, INTERVAL 8 HOUR)<\/code><\/li>\n<\/ul>\n<\/li>\n<li>Use Partition Decorators to load data into a specific partition<\/li>\n<\/ul>\n<h2>35. Date or Timestamp Column Partitioned Tables<\/h2>\n<p>https:\/\/www.udemy.com\/course\/bigquery\/learn\/lecture\/22751521#questions<\/p>\n<h3>Creating the Table<\/h3>\n<ul>\n<li>Create the table as usual<\/li>\n<li>The partitioning field must be:\n<ul>\n<li>DATE or TIMESTAMP<\/li>\n<li>Top-level field\n<ul>\n<li>Cannot be a &#8216;leaf&#8217; field from a Struct<\/li>\n<\/ul>\n<\/li>\n<li>REQUIRED or NULLABLE\n<ul>\n<li>Cannot be REPEATED<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Partitioning\n<ul>\n<li>Partition by field: Select your column\n<ul>\n<li>It will be available in the dropdown<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>The pseudo column will NOT be created since you are using an existing field\n<ul>\n<li>You WILL still get the __PARTITIONS_SUMMARY__ table<\/li>\n<\/ul>\n<\/li>\n<li>Upload your data\n<ul>\n<li>Data that is NULL or Invalid\n<ul>\n<li>NULL data will be stored in the __NULL__ partition\n<ul>\n<li><code>select * from [ds.table$__NULL__]<\/code><\/li>\n<\/ul>\n<\/li>\n<li>Invalid data will be stored in the __UNPARTITIONED__ partition\n<ul>\n<li><code>select * from [ds.table$__UNPARTITIONED__]<\/code><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<pre>select * from `proj.ds.table` where DATE_FIELD = DATE(\"YYYY-MM-DD\")<\/pre>\n<h3>Streaming Data<\/h3>\n<ul>\n<li>Streaming data is originally <strong>recorded<\/strong> in the __UNPARTITIONED__ partition<\/li>\n<li>Streaming records are NOT stored to the disk as soon as they are received\n<ul>\n<li>Recv &gt;write, recv &gt; write, &#8230; would be VERY inefficient.<\/li>\n<\/ul>\n<\/li>\n<li>They are instead first placed in a buffer.<\/li>\n<li>When the buffer is full, or set time has passed, the contents are written in batches<\/li>\n<li>SLA says the data must be available as soon as it is received, so queries against it will still work, even if it has not yet been written.\n<ul>\n<li>Query will hit both the table and the buffer<\/li>\n<li>To the user, it appears the data is only coming from the table<\/li>\n<\/ul>\n<\/li>\n<li>Once the data is written to disk, it will be recorded in the appropriate partition<\/li>\n<\/ul>\n<h2>36. Integer based Partitioned Tables<\/h2>\n<p>https:\/\/www.udemy.com\/course\/bigquery\/learn\/lecture\/22751527#questions<\/p>\n<h3>Integer Partitions<\/h3>\n<ul>\n<li>Requires at least one INTEGER primary column\n<ul>\n<li>Same rules apply as DATE and TIMESTAMP<\/li>\n<\/ul>\n<\/li>\n<li>Provide a Start, End and Interval\n<ul>\n<li>Start is Inclusive, End is Exclusive\n<ul>\n<li>Start will always be included in the Partitioning<\/li>\n<li>End may or may not be, depending on the Interval size.<\/li>\n<\/ul>\n<\/li>\n<li>Example: Start = 2, End = 105, Interval = 10\n<ul>\n<li>Partition1 = 2-11<\/li>\n<li>Partition2 = 12-21<\/li>\n<li>&#8230;<\/li>\n<li>Partition10 = 92-101<\/li>\n<\/ul>\n<\/li>\n<li>Data with -1, 1, 102, 105, 200 will be written to the __UNPARTITIONED__ partition<\/li>\n<li>Null values will be written to the __NULL__ partition<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<pre>select * from [ds.table$__PARTITIONS_SUMMARY__]<\/pre>\n<p><strong>Example Query<\/strong><\/p>\n<pre>select * from `proj.ds.table` where MY_INT between 30 and 50<\/pre>\n<ul>\n<li>This example will only scan the &#8217;30&#8217;, &#8217;40&#8217; and &#8217;50&#8217; partitions<\/li>\n<\/ul>\n<h3>Partitioned Tables Pricing and Limitations<\/h3>\n<ul>\n<li>There is no extra cost to create Partitioned tables.\n<ul>\n<li>You&#8217;re still charged for your storage and queries<\/li>\n<\/ul>\n<\/li>\n<li>Limitations\n<ul>\n<li>Max Partitions allowed per table: 4000<\/li>\n<li>Max partitions modified by a single job = 4000\n<ul>\n<li>Any job that affects more than 4000 are rejected by BQ<\/li>\n<\/ul>\n<\/li>\n<li>Max partitions modified per day\n<ul>\n<li>These include all operations\n<ul>\n<li>Load, query, DML Insert, updates, deletes that MODIFY partitions<\/li>\n<li>Read only queries do not apply<\/li>\n<\/ul>\n<\/li>\n<li>Ingestion Time : 5000<\/li>\n<li>Column partitioned : 30,000<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>37. ALTER, COPY operations on Partitioned Tables<\/h2>\n<p>https:\/\/www.udemy.com\/course\/bigquery\/learn\/lecture\/22755517#questions<\/p>\n<h3>Modifying Partitioned Tables<\/h3>\n<ul>\n<li>No different than other tables.<\/li>\n<\/ul>\n<h3>Partition Expirations<\/h3>\n<p><code>alter table `proj.ds.table` set options (partition_expiration_days=10<\/code><\/p>\n<ul>\n<li>Not currently supported in the console<\/li>\n<li>You Cannot apply different expiration times to individual partitions\n<ul>\n<li>When you update the partition expiration, the settings apply to all partitions, regardless of when they are created<\/li>\n<\/ul>\n<\/li>\n<li>Table expiration takes precedence. If the table expires, all partitions will be deleted regardless of their expiration date<\/li>\n<\/ul>\n<h3>Levels of Expiration<\/h3>\n<ul>\n<li>Dataset Level<\/li>\n<li>Table Level<\/li>\n<li>Partition level<\/li>\n<\/ul>\n<h3>Copying Partition Tables<\/h3>\n<ul>\n<li>No different than standard tables<\/li>\n<li>Data, partitions, metadata will be copied<\/li>\n<li>Copy a non-partitioned table into a paritioned table.\n<ul>\n<li>Example: A new partition will be created for the current date<\/li>\n<\/ul>\n<\/li>\n<li>Copy a partitioned table into a non-partitioned\n<ul>\n<li>Data is appended or overwritten<\/li>\n<\/ul>\n<\/li>\n<li>Copy multiple non or partitioned into a partitioned table\n<ul>\n<li>You cannot copy a mixture of partitioned and non-partitioned tables in the same job<\/li>\n<li>If copying multiple partitioned tables, the partition specs for all source tables must match the destination table&#8217;s specs.\n<ul>\n<li>if destination table partitions on ColX, then all source tables must also partition on ColX<\/li>\n<\/ul>\n<\/li>\n<li>Source and Destination tables must be in datasets in the same location<\/li>\n<li>You can copy select partitions.\n<ul>\n<li>Not supported by console<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>38. DML operations on Partitioned Tables<\/h2>\n<p>https:\/\/www.udemy.com\/course\/bigquery\/learn\/lecture\/22755691#questions<\/p>\n<h3>Ingestion Time Partitions<\/h3>\n<pre># Delete an entire partition\r\ndelete from `proj.ds.table` where _PARTITIONTIME = TIMESTAMP(\"YYYY-MM-DD\")\r\n\r\n# Insert a row\r\ninsert into `proj.ds.table` (_PARTITIONTIME, name, gender, count) values (TIMESTAMP(\"YYYY-MM-DD\"), \"Sophia\", \"F\", 100)\r\n\r\n# Update a row\r\nupdate `proj.ds.table` set name = \"Sofia\" where name = \"Sofia\" AND _PARTITIONTIME = TIMESTAMP(\"YYYY-MM-DD\")<\/pre>\n<h2>39. Best Practices for Partitioning<\/h2>\n<p>https:\/\/www.udemy.com\/course\/bigquery\/learn\/lecture\/22751531#questions<\/p>\n<p>Best Practices<\/p>\n<ul>\n<li>There must be at least one filter that only references the partition column. (Obviously)<\/li>\n<li>Avoid extra filters in the where clause with your main partition filter.\n<ul>\n<li>i.e. avoid OR, but AND should be fine<\/li>\n<\/ul>\n<\/li>\n<li>Do not include any expressions using other columns in the filter\n<ul>\n<li>Example: dept is the INTEGER Partition field<\/li>\n<li><code>select name from `proj.ds.table` where dept + count = 4<\/code><\/li>\n<\/ul>\n<\/li>\n<li>Filters that include subqueries can&#8217;t be used to limit the number of partitions scanned\n<ul>\n<li>Example: <code>select name from `proj.ds.table` where dept = (select max(dept) from proj.ds.table) <\/code><\/li>\n<li>This would be OK if we replaced the subquery with a direct value.<\/li>\n<\/ul>\n<\/li>\n<li>Always use the pseudo column by itself on the self side of a comparison\n<ul>\n<li>i.e. avoid putting the partition column with some calculations<\/li>\n<li>Example: Don&#8217;t do this<\/li>\n<li><code>select name from `proj.ds.table` where TIMESTAMP_ADD(_PARTITIONTIME, INTERVAL 1 DAY) &gt; \"YYY-MM-DD\"<\/code><\/li>\n<li>Do this instead:<\/li>\n<li><code>select name from `proj.ds.table` where _PARTITIONTIME &gt; TIMESTAMP_SUB(TIMESTAMP(\"YYYY-MM-DD\"), INTERVAL 1 DAY)<\/code><\/li>\n<\/ul>\n<\/li>\n<li>Isolate the Partition column while expressing a filter.\n<ul>\n<li>Don&#8217;t compare the Partition column with other columns in the WHERE clause<\/li>\n<li>Example: <code>select name from `proj.ds.table` where dept = SOME_OTHER_INTEGER_COLUMN<\/code><\/li>\n<\/ul>\n<\/li>\n<li>Do not create too many partitions.\n<ul>\n<li>Every partition has to have metadata maintained.<\/li>\n<li>As the number of partitions increases, so does the metadata overhead.<\/li>\n<li>If you&#8217;re setting partitions on hourly data on years worth of data, then the performance will be about the same as if you had not partitioned at all.\n<ul>\n<li>In this case, you are better off using Cluster tables.\u00a0 See next section.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Main Menu Section 8: Partitioned Tables in BigQuery 33. What is Partitioning and its benefits https:\/\/www.udemy.com\/course\/bigquery\/learn\/lecture\/22751511#questions Partition Tables Special tables divided into segments, called partitions, based on specific criteria. Queries can be performed on these partitions, reducing the scan size. Makes table management easy Enhances query performance Cost effective Group the data and put an ..<\/p>\n<div class=\"clear-fix\"><\/div>\n<p><a href=\"https:\/\/wiki.thomasandsofia.com\/?p=3758\" title=\"read more...\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[77,78],"tags":[],"class_list":["post-3758","post","type-post","status-publish","format-standard","hentry","category-bigquery","category-bigquery-for-big-data-engineers"],"_links":{"self":[{"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=\/wp\/v2\/posts\/3758","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3758"}],"version-history":[{"count":7,"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=\/wp\/v2\/posts\/3758\/revisions"}],"predecessor-version":[{"id":3767,"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=\/wp\/v2\/posts\/3758\/revisions\/3767"}],"wp:attachment":[{"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3758"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3758"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3758"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}