BQFBD – Using BigQuery Dashboard Options

  BigQuery, BigQuery for Big Data Engineers

Main Menu

Section 4: Using BigQuery Dashboard options

19. Running queries with various query settings

https://www.udemy.com/course/bigquery/learn/lecture/22729977#questions

Select top 10 male names from the table

select name, count from `PROJECT.DATASET.TABLE` where gender = "M" order by count desc limit 10

Notes:

  • “PROJECT.DATASET.TABLE” are enclosed in back tics (below the ~). Often these get converted to single quotes, so double check to make sure your query doesn’t fail.
  • quantum-balm-340112.troberts_dataset_oo1.names

Query results

  • Query complete (0.3 sec elapsed, 36.5 KB processed)
    • This query would result in some billing because data was processed.
  • Running the query again will result in 0.0 sec with no bytes processed because the result was cached.
    • No billing on a cached reply!

More Options

Format

  • Reformats your query string into a python-ish format, indenting the variables from the commands.
SELECT
  name,
  count
FROM
  `PROJECT.DATASET.TABLE`
WHERE
  ...

Query Settings

Query engine

  • BigQuery engine
    • BigQuery engine expects and processes the table data in form of batches.
    • This is because the table already existed
  • Cloud Dataflow engine
    • Used on streaming data, such as data from Pub/Sub.
    • This will be covered in depth later.

Destination

  • By default, BQ saves all query results to a table, temp or permanent
  • Temporary table
    • Randomly named saved in a special dataset
    • Not available for sharing, querying and are not visible to users
    • You are NOT charged for storing temporary tables
    • Responsible for caching the query results
    • Saved for ~24 hours
  • Permanent tables
    • Can be new or existing tables in any dataset that you have access to.
      • Provide the Project name, dataset name or table name
    • The table will automatically be created if it does not already exist
    • Results will be stored in the table with the selected Write preference
      • Write if empty
      • Append to table
      • Overwrite table
    • You WILL be charged for this storage
    • The table being queried must be in the same location as the dataset that contains the destination table.
  • Results size
    • Over-rides the maximum bytes the query can return
      • Default ~10GB
    • Checking this field will allow any number of bytes to be returned/written to the destination table.

Resource Management

  • Job priority
    • Interactive (default)
      • Start running immediately – basically on-demand
    • Batch
      • Start as soon as idle resources are available in the BQ shared resource pool which usually occurs within a few minutes. (WHAT???)
      • Do not have concurrent query rate limits.
        • You can run any number of queries in a single shot which are then queued and ran as soon as the resources are available in the shared pool.
      • If it hasn’t started within 24 hrs, BQ changes the job priority to interactive.
  • Cache preference
    • Use cached results (default) allows you to reuse the cached results
    • This means query attempts to reuse the results from a previous run of the same query unless the tables being queried have changed.
    • If you don’t want to use cached results, uncheck this box.

20. Caching features & limitations

https://www.udemy.com/course/bigquery/learn/lecture/22731425#questions

*This is really a continuation of Lesson 19. Running queries with various query settings.

  • To retrieve cached data, the query must be exactly the same as a previous query
  • Data is NOT cached:
    • When the Destination table is specified to store the results
    • If the tables/views used in the query have changed since the last cache.
    • If the tables have streaming ingestion.
    • If the query uses non-deterministic functions
      • Examples:
          • CURRENT_TIMESTAMP()
          • NOW()
          • CURRENT_USER()
    • If the query is ran against an external data source like BigTable or CloudStorage
    • If the result set is greater than the maximum response size
      • 10GB compressed by default.
  • It is possible to bind queries to only look for cached results.
    • If none are found, these queries will fail.
    • Very useful if you have a large team firing the same queries over and over.
    • Not available in UI yet.
    • Available via command line or Python client.

Additional Settings

  • SQL Dialect
    • https://learnsql.com/blog/what-sql-dialect-to-learn/
    • Standard
      • Used mostly
    • Legacy
  • Processing location
    • Where do you want to process the query?
    • Keep Auto-select
      • Data is processed in the same location the data is stored.
      • Attempting to change this will cause your query to fail…
        • Yup, this is a useless option.

Advanced Options

Encryption

  • Google-managed key
  • Customer-managed key

Maximum Bytes billed

  • Specify the maximum number of bytes you can pay for the query
  • If it will require more than the specified number, the query will fail.
    • You will not be charged for these.
  • Test by setting your limit really low. 🙂

21. Querying Wildcard Tables

https://www.udemy.com/course/bigquery/learn/lecture/22729991#questions

Additional Link: https://cloud.google.com/bigquery/docs/querying-wildcard-tables

  • Wildcard table queries allow you to query multiple tables using a wildcard.
  • These are the union of all tables that match a wildcard expression.
  • This works as long as you have multiple tables with similar names
    • This also requires that the tables have the same, or compatible schemas
    • se_table_1, se_table_2, se_table_3 …
    • select col1, col2 from `PROJECT.DATASET.se_table_*`

Example from BigQuery Example datasets

NOAA GSOD

  • This dataset contains tables for each year from 1929 through today.
    • gsod1929
    • gsod1930
    • gsod2022
  • We can query these all from the editor
    • Select the table, then click Preview in the bottom pane.
  • Without the ability to use wildcard tables this would be difficult to query
    • select col1, col2 from table_1 union select col1, col2 from table2 union select ...
  • Instead, we use wildcards
SELECT 
  max, 
  ROUND((max-32)*5/9,1) celsius, 
  mo, da, year 
FROM 
  `bigquery-public-data.noaa_gsod.gsod195*` 
WHERE 
  max != 9999.9 #9999.9 represents missing data in these tables
ORDER BY
  max DESC
    • Note this query limited the data to the 1950’s
    • Query took 8.1 seconds on 138.8MB data

Using _TABLE_SUFFIX pseudo columns

  • Table wildcard queries include an automatic pseudo column for the results of the wildcard.
  • These can then be filtered in the WHERE clause
WHERE
  max != 9999.9
  AND (
    _TABLE_SUFFIX = '1' OR_TABLE_SUFFIX = '2' OR_TABLE_SUFFIX = '3'
  )
ORDER BY
  ...
  • You can also use the BETWEEN operator here
    • `_TABLE_SUFFIX BETWEEN ‘1’ AND ‘3’`
  • Using Table Suffixes will reduce the number of queries scanned, reducing costs.
  • Filters on _TABLE_SUFFIX that include subqueries querying datasets cannot be used to limit the number of tables scanned for a wildcard table.

Basic filter

SELECT max FROM `bigquery-public-data.noaa_gsod.gsod19*` WHERE _TABLE_SUFFIX = '49'
  • 3.4MB Data

Using a subquery

SELECT max 
FROM `bigquery-public-data.noaa_gsod.gsod19*` 
WHERE 
  _TABLE_SUFFIX = (SELECT '49' AS A FROM bigquery-public-data.noaa_gsod.gsod1949)
  • 634.2MB Data
  • This scan limits the search to tables 1900 – 1999 only!

This, however, works:

SELECT max 
FROM `bigquery-public-data.noaa_gsod.gsod19*` 
WHERE 
_TABLE_SUFFIX = (SELECT 49 FROM (SELECT '49' AS a))

??

22. Wildcard Table Limitations

https://www.udemy.com/course/bigquery/learn/lecture/22734259#questions

  • These only support native BigQuery storage
    • External tables and views are not allowed
  • Cached results are not supported.
    • You will be billed for each run.
  • Queries that use DML statements cannot use wildcard tables as the target
    • You cannot insert or update a wildcard table.
    • Use the actual table name!

How BQ determines the schema

  • When executing an SQL query that uses a wildcard table, BQ automatically infers the schema of the most recently created wildcard table as the schema for ALL wildcard tables
    • Example:
      • Create tables exam1, exam2, exam3
      • Now create table exam4
    • BQ will use the schema from exam4 because it was the most recently created table.
  • If the schema of any of the tables does not include the columns in the most recent table, the query will throw an error
    • I think a better way of saying this is, if the query requests data from a column that does not exist in the schema of the most recent table, the query will fail
      • exam1 = ‘day’, ‘mo’
      • exam4 = ‘day’, ‘year’
      • Query for ‘mo’ will fail, but query for ‘year’ should be OK.
  • Long table prefixes perform better than shorter ones

Long Prefix: Better performance

SELECT max FROM `PROJECT.DATASET.TABLE_NAME_195*` WHERE _TABLE_SUFFIX BETWEEN '0' AND '1'

Short Prefix: Worse performance

SELECT max FROM `PROJECT.DATASET.TABLE_NAME_*` WHERE _TABLE_SUFFIX BETWEEN '1950' AND '1951'

23. Schedule, Save, Share a Query

https://www.udemy.com/course/bigquery/learn/lecture/22730003?start=0#questions

  1. Type the ‘standard’ query into the Query Editor
    • Can include DDLs and DMLs
      • DDL: Data Definition Language
        • CREATE, ALTER, DROP, TRUNCATE
      • DML: Data Manipulation Language
        • SELECT, INSERT, UPDATE, DELETE
  2. Click Schedule Query
    • Note: Requires Transfer Service, so must be enabled!
      • Search transfer api
        • Select “Marketplace: BigQuery Transfer API”
        • Click [ Enable ]
    • Enable scheduled queries
    • Update scheduled query
    • Create new scheduled query
      • Details and schedule
        • Name for the schedule query
      • Schedule options
        • Repeats: Hourly, Daily, Weekly, Monthly, Custom (cron style), On demand
          • Shortest allowed period is 15 min.
        • Start date and run time
      • Destination for query results
        • DDL and DML queries do not require this since you’re updating the current table
        • Destination must be in same Region as source table
      • Notification options
        • Emails are sent only to the user that created the schedule.
        • This cannot be changed.

Back Fill

  • used to run specific or missed queries for a speficied date range.
  • Times listed will be in UTC, so you’ll need to enter carefully

Save Results

  • After running the query, you can [Save results]
    • CSV:Google drive or local
    • JSON: Google or local
    • BQ Table
    • Google sheets
    • Clipboard

Save Query

  • Visible only to creater
  • Visible to project members with appropriate permissions.

Deleting Data

  • If you don’t delete, eventually you will be charged.
  • Click the dataset, then delete.

24. Schema Auto Detection

https://www.udemy.com/course/bigquery/learn/lecture/22730011#questions

  • Only available for CSV and JSON file format
    • Other formats have the schema included, so it’s not required.
  • When auto detection is enabled
    • Selects a random file in the source
    • Scans up to 100 rows as a sample
    • Attempts to assign data type based on sample values.
  • After detection, it may change a field name to make it compatible with BQ SQL syntax
  • Auto detect recognizes
    • Gzip compatible file compression
    • Commas, Pipes, Tab delimeters
    • Headers from the file based on other rows in the dataset
      • if the first line contains only strings, but other have numberic…
      • it will use the header values if it can
    • End of line, unless the “\n” is enquoted.
    • Dates must be YYYY-MM-DD or it will just interpret as a string
    • Timestamps are pretty flexable
      • yyyy-mm-dd hh:mm:ss, yyyy-mm-dd hh:mm:ss.mm, yyyy-mm-dd
    • Default nullable fields

Always Confirm the Auto Detect was accurate and if not, fix it!

 

LEAVE A COMMENT