Section 4: Using BigQuery Dashboard options
19. Running queries with various query settings
https://www.udemy.com/course/bigquery/learn/lecture/22729977#questions
Select top 10 male names from the table
select name, count from `PROJECT.DATASET.TABLE` where gender = "M" order by count desc limit 10
Notes:
- “PROJECT.DATASET.TABLE” are enclosed in back tics (below the ~). Often these get converted to single quotes, so double check to make sure your query doesn’t fail.
- quantum-balm-340112.troberts_dataset_oo1.names
Query results
- Query complete (0.3 sec elapsed, 36.5 KB processed)
- This query would result in some billing because data was processed.
- Running the query again will result in 0.0 sec with no bytes processed because the result was cached.
- No billing on a cached reply!
More Options
Format
- Reformats your query string into a python-ish format, indenting the variables from the commands.
SELECT name, count FROM `PROJECT.DATASET.TABLE` WHERE ...
Query Settings
Query engine
- BigQuery engine
- BigQuery engine expects and processes the table data in form of batches.
- This is because the table already existed
- Cloud Dataflow engine
- Used on streaming data, such as data from Pub/Sub.
- This will be covered in depth later.
Destination
- By default, BQ saves all query results to a table, temp or permanent
- Temporary table
- Randomly named saved in a special dataset
- Not available for sharing, querying and are not visible to users
- You are NOT charged for storing temporary tables
- Responsible for caching the query results
- Saved for ~24 hours
- Permanent tables
- Can be new or existing tables in any dataset that you have access to.
- Provide the Project name, dataset name or table name
- The table will automatically be created if it does not already exist
- Results will be stored in the table with the selected Write preference
- Write if empty
- Append to table
- Overwrite table
- You WILL be charged for this storage
- The table being queried must be in the same location as the dataset that contains the destination table.
- Can be new or existing tables in any dataset that you have access to.
- Results size
- Over-rides the maximum bytes the query can return
- Default ~10GB
- Checking this field will allow any number of bytes to be returned/written to the destination table.
- Over-rides the maximum bytes the query can return
Resource Management
- Job priority
- Interactive (default)
- Start running immediately – basically on-demand
- Batch
- Start as soon as idle resources are available in the BQ shared resource pool which usually occurs within a few minutes. (WHAT???)
- Do not have concurrent query rate limits.
- You can run any number of queries in a single shot which are then queued and ran as soon as the resources are available in the shared pool.
- If it hasn’t started within 24 hrs, BQ changes the job priority to interactive.
- Interactive (default)
- Cache preference
- Use cached results (default) allows you to reuse the cached results
- This means query attempts to reuse the results from a previous run of the same query unless the tables being queried have changed.
- If you don’t want to use cached results, uncheck this box.
20. Caching features & limitations
https://www.udemy.com/course/bigquery/learn/lecture/22731425#questions
*This is really a continuation of Lesson 19. Running queries with various query settings.
- To retrieve cached data, the query must be exactly the same as a previous query
- Data is NOT cached:
- When the Destination table is specified to store the results
- If the tables/views used in the query have changed since the last cache.
- If the tables have streaming ingestion.
- If the query uses non-deterministic functions
- Examples:
-
- CURRENT_TIMESTAMP()
- NOW()
- CURRENT_USER()
-
- Examples:
- If the query is ran against an external data source like BigTable or CloudStorage
- If the result set is greater than the maximum response size
- 10GB compressed by default.
- It is possible to bind queries to only look for cached results.
- If none are found, these queries will fail.
- Very useful if you have a large team firing the same queries over and over.
- Not available in UI yet.
- Available via command line or Python client.
Additional Settings
- SQL Dialect
- https://learnsql.com/blog/what-sql-dialect-to-learn/
- Standard
- Used mostly
- Legacy
- Processing location
- Where do you want to process the query?
- Keep Auto-select
- Data is processed in the same location the data is stored.
- Attempting to change this will cause your query to fail…
- Yup, this is a useless option.
Advanced Options
Encryption
- Google-managed key
- Customer-managed key
Maximum Bytes billed
- Specify the maximum number of bytes you can pay for the query
- If it will require more than the specified number, the query will fail.
- You will not be charged for these.
- Test by setting your limit really low. 🙂
21. Querying Wildcard Tables
https://www.udemy.com/course/bigquery/learn/lecture/22729991#questions
Additional Link: https://cloud.google.com/bigquery/docs/querying-wildcard-tables
- Wildcard table queries allow you to query multiple tables using a wildcard.
- These are the union of all tables that match a wildcard expression.
- This works as long as you have multiple tables with similar names
- This also requires that the tables have the same, or compatible schemas
- se_table_1, se_table_2, se_table_3 …
select col1, col2 from `PROJECT.DATASET.se_table_*`
Example from BigQuery Example datasets
NOAA GSOD
- This dataset contains tables for each year from 1929 through today.
- gsod1929
- gsod1930
- …
- gsod2022
- We can query these all from the editor
- Select the table, then click Preview in the bottom pane.
- Without the ability to use wildcard tables this would be difficult to query
select col1, col2 from table_1 union select col1, col2 from table2 union select ...
- Instead, we use wildcards
SELECT max, ROUND((max-32)*5/9,1) celsius, mo, da, year FROM `bigquery-public-data.noaa_gsod.gsod195*` WHERE max != 9999.9 #9999.9 represents missing data in these tables ORDER BY max DESC
-
- Note this query limited the data to the 1950’s
- Query took 8.1 seconds on 138.8MB data
Using _TABLE_SUFFIX pseudo columns
- Table wildcard queries include an automatic pseudo column for the results of the wildcard.
- These can then be filtered in the
WHEREclause
WHERE
max != 9999.9
AND (
_TABLE_SUFFIX = '1' OR_TABLE_SUFFIX = '2' OR_TABLE_SUFFIX = '3'
)
ORDER BY
...
- You can also use the BETWEEN operator here
- `_TABLE_SUFFIX BETWEEN ‘1’ AND ‘3’`
- Using Table Suffixes will reduce the number of queries scanned, reducing costs.
- Filters on _TABLE_SUFFIX that include subqueries querying datasets cannot be used to limit the number of tables scanned for a wildcard table.
Basic filter
SELECT max FROM `bigquery-public-data.noaa_gsod.gsod19*` WHERE _TABLE_SUFFIX = '49'
- 3.4MB Data
Using a subquery
SELECT max
FROM `bigquery-public-data.noaa_gsod.gsod19*`
WHERE
_TABLE_SUFFIX = (SELECT '49' AS A FROM bigquery-public-data.noaa_gsod.gsod1949)
- 634.2MB Data
- This scan limits the search to tables 1900 – 1999 only!
This, however, works:
SELECT max FROM `bigquery-public-data.noaa_gsod.gsod19*` WHERE _TABLE_SUFFIX = (SELECT 49 FROM (SELECT '49' AS a))
??
22. Wildcard Table Limitations
https://www.udemy.com/course/bigquery/learn/lecture/22734259#questions
- These only support native BigQuery storage
- External tables and views are not allowed
- Cached results are not supported.
- You will be billed for each run.
- Queries that use DML statements cannot use wildcard tables as the target
- You cannot insert or update a wildcard table.
- Use the actual table name!
How BQ determines the schema
- When executing an SQL query that uses a wildcard table, BQ automatically infers the schema of the most recently created wildcard table as the schema for ALL wildcard tables
- Example:
- Create tables
exam1,exam2,exam3 - Now create table
exam4
- Create tables
- BQ will use the schema from
exam4because it was the most recently created table.
- Example:
- If the schema of any of the tables does not include the columns in the most recent table, the query will throw an error
- I think a better way of saying this is, if the query requests data from a column that does not exist in the schema of the most recent table, the query will fail
- exam1 = ‘day’, ‘mo’
- exam4 = ‘day’, ‘year’
- Query for ‘mo’ will fail, but query for ‘year’ should be OK.
- I think a better way of saying this is, if the query requests data from a column that does not exist in the schema of the most recent table, the query will fail
- Long table prefixes perform better than shorter ones
Long Prefix: Better performance
SELECT max FROM `PROJECT.DATASET.TABLE_NAME_195*` WHERE _TABLE_SUFFIX BETWEEN '0' AND '1'
Short Prefix: Worse performance
SELECT max FROM `PROJECT.DATASET.TABLE_NAME_*` WHERE _TABLE_SUFFIX BETWEEN '1950' AND '1951'
23. Schedule, Save, Share a Query
https://www.udemy.com/course/bigquery/learn/lecture/22730003?start=0#questions
- Type the ‘standard’ query into the Query Editor
- Can include DDLs and DMLs
- DDL: Data Definition Language
- CREATE, ALTER, DROP, TRUNCATE
- DML: Data Manipulation Language
- SELECT, INSERT, UPDATE, DELETE
- DDL: Data Definition Language
- Can include DDLs and DMLs
- Click Schedule Query
- Note: Requires Transfer Service, so must be enabled!
- Search transfer api
- Select “Marketplace: BigQuery Transfer API”
- Click [ Enable ]
- Search transfer api
- Enable scheduled queries
- Update scheduled query
- Create new scheduled query
- Details and schedule
- Name for the schedule query
- Schedule options
- Repeats: Hourly, Daily, Weekly, Monthly, Custom (cron style), On demand
- Shortest allowed period is 15 min.
- Start date and run time
- Repeats: Hourly, Daily, Weekly, Monthly, Custom (cron style), On demand
- Destination for query results
- DDL and DML queries do not require this since you’re updating the current table
- Destination must be in same Region as source table
- Notification options
- Emails are sent only to the user that created the schedule.
- This cannot be changed.
- Details and schedule
- Note: Requires Transfer Service, so must be enabled!
Back Fill
- used to run specific or missed queries for a speficied date range.
- Times listed will be in UTC, so you’ll need to enter carefully
Save Results
- After running the query, you can [Save results]
- CSV:Google drive or local
- JSON: Google or local
- BQ Table
- Google sheets
- Clipboard
Save Query
- Visible only to creater
- Visible to project members with appropriate permissions.
Deleting Data
- If you don’t delete, eventually you will be charged.
- Click the dataset, then delete.
24. Schema Auto Detection
https://www.udemy.com/course/bigquery/learn/lecture/22730011#questions
- Only available for CSV and JSON file format
- Other formats have the schema included, so it’s not required.
- When auto detection is enabled
- Selects a random file in the source
- Scans up to 100 rows as a sample
- Attempts to assign data type based on sample values.
- After detection, it may change a field name to make it compatible with BQ SQL syntax
- Auto detect recognizes
- Gzip compatible file compression
- Commas, Pipes, Tab delimeters
- Headers from the file based on other rows in the dataset
- if the first line contains only strings, but other have numberic…
- it will use the header values if it can
- End of line, unless the “\n” is enquoted.
- Dates must be YYYY-MM-DD or it will just interpret as a string
- Timestamps are pretty flexable
- yyyy-mm-dd hh:mm:ss, yyyy-mm-dd hh:mm:ss.mm, yyyy-mm-dd
- Default nullable fields
Always Confirm the Auto Detect was accurate and if not, fix it!