BQFBD – Using BigQuery Dashboard Options

February 8, 2022 BigQuery, BigQuery for Big Data Engineers

Main Menu

Section 4: Using BigQuery Dashboard options

19. Running queries with various query settings

https://www.udemy.com/course/bigquery/learn/lecture/22729977#questions

Select top 10 male names from the table

select name, count from `PROJECT.DATASET.TABLE` where gender = "M" order by count desc limit 10

Notes:

“PROJECT.DATASET.TABLE” are enclosed in back tics (below the ~). Often these get converted to single quotes, so double check to make sure your query doesn’t fail.
quantum-balm-340112.troberts_dataset_oo1.names

Query results

Query complete (0.3 sec elapsed, 36.5 KB processed)
- This query would result in some billing because data was processed.
Running the query again will result in 0.0 sec with no bytes processed because the result was cached.
- No billing on a cached reply!

More Options

Format

Reformats your query string into a python-ish format, indenting the variables from the commands.

SELECT
  name,
  count
FROM
  `PROJECT.DATASET.TABLE`
WHERE
  ...

Query Settings

Query engine

BigQuery engine
- BigQuery engine expects and processes the table data in form of batches.
- This is because the table already existed
Cloud Dataflow engine
- Used on streaming data, such as data from Pub/Sub.
- This will be covered in depth later.

Destination

By default, BQ saves all query results to a table, temp or permanent
Temporary table
- Randomly named saved in a special dataset
- Not available for sharing, querying and are not visible to users
- You are NOT charged for storing temporary tables
- Responsible for caching the query results
- Saved for ~24 hours
Permanent tables
- Can be new or existing tables in any dataset that you have access to.
  - Provide the Project name, dataset name or table name
- The table will automatically be created if it does not already exist
- Results will be stored in the table with the selected Write preference
  - Write if empty
  - Append to table
  - Overwrite table
- You WILL be charged for this storage
- The table being queried must be in the same location as the dataset that contains the destination table.
Results size
- Over-rides the maximum bytes the query can return
  - Default ~10GB
- Checking this field will allow any number of bytes to be returned/written to the destination table.

Resource Management

Job priority
- Interactive (default)
  - Start running immediately – basically on-demand
- Batch
  - Start as soon as idle resources are available in the BQ shared resource pool which usually occurs within a few minutes. (WHAT???)
  - Do not have concurrent query rate limits.
    - You can run any number of queries in a single shot which are then queued and ran as soon as the resources are available in the shared pool.
  - If it hasn’t started within 24 hrs, BQ changes the job priority to interactive.
Cache preference
- Use cached results (default) allows you to reuse the cached results
- This means query attempts to reuse the results from a previous run of the same query unless the tables being queried have changed.
- If you don’t want to use cached results, uncheck this box.

20. Caching features & limitations

https://www.udemy.com/course/bigquery/learn/lecture/22731425#questions

*This is really a continuation of Lesson 19. Running queries with various query settings.

To retrieve cached data, the query must be exactly the same as a previous query
Data is NOT cached:
- When the Destination table is specified to store the results
- If the tables/views used in the query have changed since the last cache.
- If the tables have streaming ingestion.
- If the query uses non-deterministic functions
  - Examples:
    - - CURRENT_TIMESTAMP()
      - NOW()
      - CURRENT_USER()
- If the query is ran against an external data source like BigTable or CloudStorage
- If the result set is greater than the maximum response size
  - 10GB compressed by default.
It is possible to bind queries to only look for cached results.
- If none are found, these queries will fail.
- Very useful if you have a large team firing the same queries over and over.
- Not available in UI yet.
- Available via command line or Python client.

Additional Settings

SQL Dialect
- https://learnsql.com/blog/what-sql-dialect-to-learn/
- Standard
  - Used mostly
- Legacy
Processing location
- Where do you want to process the query?
- Keep Auto-select
  - Data is processed in the same location the data is stored.
  - Attempting to change this will cause your query to fail…
    - Yup, this is a useless option.

Advanced Options

Encryption

Google-managed key
Customer-managed key

Maximum Bytes billed

Specify the maximum number of bytes you can pay for the query
If it will require more than the specified number, the query will fail.
- You will not be charged for these.
Test by setting your limit really low. 🙂

21. Querying Wildcard Tables

https://www.udemy.com/course/bigquery/learn/lecture/22729991#questions

Additional Link: https://cloud.google.com/bigquery/docs/querying-wildcard-tables

Wildcard table queries allow you to query multiple tables using a wildcard.
These are the union of all tables that match a wildcard expression.
This works as long as you have multiple tables with similar names
- This also requires that the tables have the same, or compatible schemas
- se_table_1, se_table_2, se_table_3 …
- select col1, col2 from `PROJECT.DATASET.se_table_*`

Example from BigQuery Example datasets

NOAA GSOD

This dataset contains tables for each year from 1929 through today.
- gsod1929
- gsod1930
- …
- gsod2022
We can query these all from the editor
- Select the table, then click Preview in the bottom pane.
Without the ability to use wildcard tables this would be difficult to query
- select col1, col2 from table_1 union select col1, col2 from table2 union select ...
Instead, we use wildcards

SELECT 
  max, 
  ROUND((max-32)*5/9,1) celsius, 
  mo, da, year 
FROM 
  `bigquery-public-data.noaa_gsod.gsod195*` 
WHERE 
  max != 9999.9 #9999.9 represents missing data in these tables
ORDER BY
  max DESC

- Note this query limited the data to the 1950’s
- Query took 8.1 seconds on 138.8MB data

Using _TABLE_SUFFIX pseudo columns

Table wildcard queries include an automatic pseudo column for the results of the wildcard.
These can then be filtered in the WHERE clause

WHERE
  max != 9999.9
  AND (
    _TABLE_SUFFIX = '1' OR_TABLE_SUFFIX = '2' OR_TABLE_SUFFIX = '3'
  )
ORDER BY
  ...

You can also use the BETWEEN operator here
- `_TABLE_SUFFIX BETWEEN ‘1’ AND ‘3’`
Using Table Suffixes will reduce the number of queries scanned, reducing costs.
Filters on _TABLE_SUFFIX that include subqueries querying datasets cannot be used to limit the number of tables scanned for a wildcard table.

Basic filter

SELECT max FROM `bigquery-public-data.noaa_gsod.gsod19*` WHERE _TABLE_SUFFIX = '49'

3.4MB Data

Using a subquery

SELECT max 
FROM `bigquery-public-data.noaa_gsod.gsod19*` 
WHERE 
  _TABLE_SUFFIX = (SELECT '49' AS A FROM bigquery-public-data.noaa_gsod.gsod1949)

634.2MB Data
This scan limits the search to tables 1900 – 1999 only!

This, however, works:

SELECT max 
FROM `bigquery-public-data.noaa_gsod.gsod19*` 
WHERE 
_TABLE_SUFFIX = (SELECT 49 FROM (SELECT '49' AS a))

22. Wildcard Table Limitations

https://www.udemy.com/course/bigquery/learn/lecture/22734259#questions

These only support native BigQuery storage
- External tables and views are not allowed
Cached results are not supported.
- You will be billed for each run.
Queries that use DML statements cannot use wildcard tables as the target
- You cannot insert or update a wildcard table.
- Use the actual table name!

How BQ determines the schema

When executing an SQL query that uses a wildcard table, BQ automatically infers the schema of the most recently created wildcard table as the schema for ALL wildcard tables
- Example:
  - Create tables exam1, exam2, exam3
  - Now create table exam4
- BQ will use the schema from exam4 because it was the most recently created table.
If the schema of any of the tables does not include the columns in the most recent table, the query will throw an error
- I think a better way of saying this is, if the query requests data from a column that does not exist in the schema of the most recent table, the query will fail
  - exam1 = ‘day’, ‘mo’
  - exam4 = ‘day’, ‘year’
  - Query for ‘mo’ will fail, but query for ‘year’ should be OK.
Long table prefixes perform better than shorter ones

Long Prefix: Better performance

SELECT max FROM `PROJECT.DATASET.TABLE_NAME_195*` WHERE _TABLE_SUFFIX BETWEEN '0' AND '1'

Short Prefix: Worse performance

SELECT max FROM `PROJECT.DATASET.TABLE_NAME_*` WHERE _TABLE_SUFFIX BETWEEN '1950' AND '1951'

23. Schedule, Save, Share a Query

https://www.udemy.com/course/bigquery/learn/lecture/22730003?start=0#questions

Type the ‘standard’ query into the Query Editor
- Can include DDLs and DMLs
  - DDL: Data Definition Language
    - CREATE, ALTER, DROP, TRUNCATE
  - DML: Data Manipulation Language
    - SELECT, INSERT, UPDATE, DELETE
Click Schedule Query
- Note: Requires Transfer Service, so must be enabled!
  - Search transfer api
    - Select “Marketplace: BigQuery Transfer API”
    - Click [ Enable ]
- Enable scheduled queries
- Update scheduled query
- Create new scheduled query
  - Details and schedule
    - Name for the schedule query
  - Schedule options
    - Repeats: Hourly, Daily, Weekly, Monthly, Custom (cron style), On demand
      - Shortest allowed period is 15 min.
    - Start date and run time
  - Destination for query results
    - DDL and DML queries do not require this since you’re updating the current table
    - Destination must be in same Region as source table
  - Notification options
    - Emails are sent only to the user that created the schedule.
    - This cannot be changed.

Back Fill

used to run specific or missed queries for a speficied date range.
Times listed will be in UTC, so you’ll need to enter carefully

Save Results

After running the query, you can [Save results]
- CSV:Google drive or local
- JSON: Google or local
- BQ Table
- Google sheets
- Clipboard

Save Query

Visible only to creater
Visible to project members with appropriate permissions.

Deleting Data

If you don’t delete, eventually you will be charged.
Click the dataset, then delete.

24. Schema Auto Detection

https://www.udemy.com/course/bigquery/learn/lecture/22730011#questions

Only available for CSV and JSON file format
- Other formats have the schema included, so it’s not required.
When auto detection is enabled
- Selects a random file in the source
- Scans up to 100 rows as a sample
- Attempts to assign data type based on sample values.
After detection, it may change a field name to make it compatible with BQ SQL syntax
Auto detect recognizes
- Gzip compatible file compression
- Commas, Pipes, Tab delimeters
- Headers from the file based on other rows in the dataset
  - if the first line contains only strings, but other have numberic…
  - it will use the header values if it can
- End of line, unless the “\n” is enquoted.
- Dates must be YYYY-MM-DD or it will just interpret as a string
- Timestamps are pretty flexable
  - yyyy-mm-dd hh:mm:ss, yyyy-mm-dd hh:mm:ss.mm, yyyy-mm-dd
- Default nullable fields

Always Confirm the Auto Detect was accurate and if not, fix it!