BQFBD – Materialized Views in BigQuery

  BigQuery, BigQuery for Big Data Engineers

Main Menu

50% Complete!!

Section 12: Materialized Views in BigQuery

51. What are Materialized Views

https://www.udemy.com/course/bigquery/learn/lecture/22758503#overview

Materialized Views

  • Stored physically on disk
    • These do occupy space and you will be charged
    • They do not rely on the base table unless you re-run or refresh the view query
  • Must be in the same Dataset at the source tables!
  • Inherit the expiration times of the source
  • Provide the latest data including streaming tables
  • Faster than normal views
  • Smart tuning
    • If a query or subquery on a table can be served with an already created Materialized View, BQ reroutes the query to use Materialized view for efficiency.
    • SELECT name, SUM(count) as total FROM `proj.ds.table` GROUP BY name
      • If this or a subquery of this can use a materialized view (mv), it will
    • Uses less bytes
  • Use Cases
    • ETL or BI pipelines
    • Highly computational queries like aggregations

Views vs. Materialized Views (Not part of the class)

Views

  • A short/simplified version of the query used to create the view.
  • Every time you run a query against the view, you are actually running the query that created the view in the background, then running your query against those results.
  • Your data is always up to date
  • If the “creation query” is expensive or slow, so will all queries against the view.
  • You can create a View in a different dataset than the original table(s)
    • This allows you to set different access permissions on the View
    • Example: You can grant access to the View while blocking access to the actual table

Materialized Views

  • Also a short/simplified version of the query used to create the view.
  • Materialized Views are actual tables created using the “create query”
  • Queries against the MV are ran against the table created, so are faster than Views
  • Your data can get stale if the MV are not refreshed
  • You cannot create a MV in a different dataset than the original tables.
  • You can use Partitioned Tables and Clustering

52. Create a Materialized View

https://www.udemy.com/course/bigquery/learn/lecture/22758507#overview

Note: Materialized Views are still in Beta and features and functionality are still under development

Create the view

  • Standard SQL
    • Cannot use Table Joins
    • Limited set of aggregation functions
      • APPROX_COUNT_DISTINCT
      • ARRAY_AGG
      • HLL_COUNT.INIT
      • MAX, MIN, SUM, COUNT, AVG
    • Stacked functions are not supported
      • ROUND(AVG(count))
  • CREATE MATERIALIZED VIEW proj.ds.table_mv as SELECT name, SUM(count) as total FROM proj.ds.table group by name
    • I don’t see any back-tics on the table names….
  • You cannot use `HAVING` clause.  To use this, use it on the MV.
  • UNLESS Auto-refresh is disabled, an auto-refresh will start running as soon as the initial build query runs.
  • Once it completes creation, it will appear in the table list.
  • You can create Partitioned and Clustered MVs
CREATE MATERIALIZED VIEW 
 proj.ds.mv 
 PARTITION BY date AS
SELECT 
 name, 
 SUM(count) as total,
 _PARTITIONDATE as date
FROM 
 proj.ds.partitioned_table 
 WHERE _PARTITONDATE = "YYYY_MM_DD"
GROUP BY
 1,3
  • 1, 3 refers to the position of the columns being returned

Smart Tuning Demonstration

  • Locate the code for the MV
    • MV > Details > Scroll down to Query
    • Copy/paste the query into the editor and run it.
      • Note that the source table is the original table and not an MV
    • Once the query completes, click the “Execution Details” tab near “Results”
    • Scroll down and expand a job. You’ll see it used the MV and not the original source table

To prevent smart tuning:

  • Add a where clause to the query
SELECT 
 name, 
 SUM(code) as total_count
FROM 
 proj.ds.table
  WHERE rand() < 1
GROUP BY 
 name

53. ALTER a Materialized View

https://www.udemy.com/course/bigquery/learn/lecture/22759071?start=0#overview

Allowed

  • CREATE MATERIALIZED VIEW…
  • ALTER MATERIALIZED VIEW `proj.ds.mv` SET OPTIONS…
    • (enable_refresh = true)
  • DROP MATERIALIZED VIEW `proj.ds.mv`

No others are allowed

Not allowed

  • Running COPY, IMPORT or EXPORT jobs where either the source or the destination is a materialized view
  • Writing query results into a materialized view
  • Using the BQ Storage API

 

54. Design an optimized query for Materialized View

https://www.udemy.com/course/bigquery/learn/lecture/22759107#overview

How to design a query such that it always hits the MV

Sub queries are supported

 

CREATE MATERIALIZED VIEW
  proj.ds.mv
AS SELECT
  name, 
  total_count
FROM (
  SELECT
    name,
    count(*) as total_count
  FROM
    proj.ds.table
  GROUP BY
    name
)

Use the grouping keys from the MV

  • Notice how you are actually calling the Table, not the Materialized View
  • The field ‘total_count’ does not exist on the table, so it will use the Materialized View instead.
SELECT
  name,
  total_count
FROM
  proj.ds.table

Performing computation on grouping keys

SELECT
  name,
  AVG(total_count) as ave_count
FROM
  proj.ds.table

Ad-hoc query contains filters on grouping columns OR direct filters of MV definition

Create MV

CREATE MATERIALIZED VIEW
  proj.ds.mv
AS SELECT
  DATE(shipdate) as shipdate,
  SUM(price) as sum_price
FROM 
  proj.ds.table
WHERE
  shipping_mode = "land"
GROUP BY 1

Ad-Hoc Query

SELECT
  shipdate,
  SUM(price) as sum_price
FROM
  proj.ds.table
WHERE
  [optional filter AND ]
  shipping_mode = "land"
GROUP BY
  1

Ad-hoc query contains a filter which is a subset filter of the MV definition

  • While the filter is not exactly the same, it is a subset of the Create view.

Create MV

CREATE MATERIALIZED VIEW
  proj.ds.mv
AS SELECT
  price,
  COUNT(*) as cnt
FROM 
  proj.ds.table
WHERE
  price IS NOT NULL
GROUP BY 1

Ad-hoc Query

SELECT
  price,
  COUNT(*) as cnt
FROM
  proj.ds.table
WHERE
  price > 100
GROUP BY 1

 

55. Auto and Manual Refreshes of Materialized Views

https://www.udemy.com/course/bigquery/learn/lecture/22766733#overview

Updates

  • If an update (alter) is performed prior to the next refresh
    • The portions that have changed will be invalidated and reread
    • If on a partition, only the partition will be invalidated and reread
  • If only inserts
    • The changes are added to the materialized view

Manual Refresh

  • REFRESH_MATERIALIZED_VIEW
    • BQ looks for the changes in the base table and applies those changes to the MV
    • CALL BQ.REFRESH_MATERIALIZED_VIEW('proj.ds.mv_table_name')
      • BQ is literal. It is not an abbreviation for ‘BigQuery’ in this case
      • The table name in the command is in single quotes, not back ticks.
      • View ‘Details’ to see the refresh time

Automatic

  • By default, Materialized Views are updated within 5 minutes of changes to the base table
    • Not ‘Every 5 minutes’
  • If constant changes are made to the base table
    • Refreshes are performed per Refresh Interval (ms) which is default 1,800,000 ms (30 minutes)
    • ‘Details’

Setting the Refresh Interval

  • Minimum refresh cap = 1 minute
  • Maximum = 7 days

At time of creation

CREATE MATERIALIZED VIEW
  proj.ds.mv
OPTIONS (enable_refresh = true, refresh_interval_minutes = 60)
AS SELECT ...

Alter an existing MV

ALTER MATERIALIZED VIEW 
  `proj.ds.mv` 
SET OPTIONS (enable_refresh = true, refresh_interval_minutes = 60)

Keeping the MV up to date

  • BQ ensures your query replies are ALWAYS up to date

Appends Only

  1. Fetch all materialized data
  2. Read the delta
  3. Adjust the aggregations as per delta
  4. Show results

Updates or Deletions

  • Materialized View might not be scanned
  • Query will hit the base table only
  • You might have to wait until the next refresh before the queries start using the MV again.

56. Limitations and Quotas of Materialized Views

https://www.udemy.com/course/bigquery/learn/lecture/22766737#overview

  • You cannot copy a MV either as a source or destination of a copy job
  • Cannot export data
  • Cannot load data into MVs using load query
  • Cannot write data to it – READ ONLY
  • No DML statements
  • SHOULD be in the same dataset location
  • Supports limited aggregation functions in query definitions
  • Only 1 reference table.  No joins during creation
  • Cannot be nested
  • Only Standard SQL
  • Maximum 20 MVs per base table

 

57. Best Practices

https://www.udemy.com/course/bigquery/learn/lecture/22766743#overview

  • Do not create Materialized views for every permutation of a query
    • Create them for broader use.
  • If your users often filter on a date range, add a date range filter on your query that includes that set of dates.
  • if the base table is large and partitioned, your MV should be partitioned as well.
    • If the MV is about the same size or larger than a partition, then partition the MV
  • Refreshes are expensive.  Do some analysis to determine the necessary refresh interval
    • if the base table is updated relatively infrequently, then refresh less often.
  • Batch DML statements and manual refresh your MVs after.

Work around for Joins on Materialized Views

  1. Create the MV on the aggregated view
  2. Run your query as shown

 

 

 

LEAVE A COMMENT