Section 10: Loading and Querying External Data Sources
44. Introduction and Create Cloud Storage Bucket
https://www.udemy.com/course/bigquery/learn/lecture/22899018#questions
Sources
- Querying data does not require that the data is loaded into BQ
- Public datasets
- Shared datasets
- External sources
- You can also use federated queries to query data where it resides
- Other situations require that the data is loaded into BQ
Loading external data into BQ
- File Formats
- CSV, JSON, Avro, Parquet, ORC, Datastore exports or Firestore exports
- Each file type has its own set of quotas and limitations (none are perfect)
- If Read speed is good, write speed is likely compromised
- If you want more compression, both read and write will be affected
- Avro is preferred for loading compressed and uncompressed
- Data can be read in parallel, even when data blocks are compressed
- File compression itself is not supported
- Compressed data blocks are
- BQ supports DEFLATE and Snappy codecs for compressed data blocks in Avro files
- Data can be read in parallel, even when data blocks are compressed
- Non-Avro formats
- BQ can load uncompressed files significantly faster than compressed
- These can be read in parallel
- But these can lead to bandwidth limitations and higher storage costs
- BQ can load uncompressed files significantly faster than compressed
- Limitations of file storage formats: Batch file loading data limitations
- Cloud Storage Overview
- Data is stored in Buckets (Object storage?)
- Creating a bucket
- Dashboard > Storage > Create Bucket
- Naming Guidelines
- Names must be globally unique: Same as AWS
- Names are publicly visible
- Put the bucket in the same region as your BQ Dataset
- Access Control: Up to you. Video had default “fine grained”, mine was Uniform. I decided to use Fine grained.
- Dashboard > Storage > Create Bucket
- You can upload single files or complete folders
- URI: gs://BUCKET_NAME/DIR_NAME/FILENAME.EXT
- If data is sorted into multiple files, you can use a wildcard
- gs://BUCKET_NAME/DIR_NAME/FILENAME*.EXT
- Loading from Cloud Storage
- Most common data storage for BQ
- All GCP applications may store data there in stuctured and unstructured formats
- CSV, JSON, Avro, Parquet, ORC, Datastore exports or Firestore exports
- Each file type has its own set of quotas and limitations
- All GCP applications may store data there in stuctured and unstructured formats
- Cloud storage provides both options
- Loading into BQ
- Querying directly without loading
- Most common data storage for BQ
45. Create & Query Permanent Tables on Cloud Storage
https://www.udemy.com/course/bigquery/learn/lecture/22899032#questions
- Dataset > Create Table > Google Cloud Storage > Search for file (Bucket > Folder > File).
- Table Type
- Native table: The data will be imported into BQ
- You will have ability to preview the data
- External table: The data will remain in GCP Storage
- Limitations
- You cannot preview
- You will not get an estimated bytes for the query
- etc
- Additional pseudo column: _FILE_NAME
- select *, _FILE_NAME as fn from `proj.ds.table` limit 1
1 Mary F 7065 gs://tas_gcp/names/yob1880.txt
- select *, _FILE_NAME as fn from `proj.ds.table` limit 1
- Limitations
- Native table: The data will be imported into BQ
- Table Type
46. External data source limitations
https://www.udemy.com/course/bigquery/learn/lecture/22899044#questions
- BQ does not guarantee data consistency for external data sources
- Changes to the underlying data while the query is running can result in unexpected behavior
- Query performance for external data sources may not be as high as querying data in native BQ table
- You cannot run BQ job that exports data from an external source
- Workaround: Save query results to an intermediate destination table, the run the export job against that table
- You cannot reference an external data source in a wildcard table query
- Table partitioning and clustering is supported in limited ways
- Hive partitioning layout
- External data sources query results are not cached
- You will be charged for each query, even if it is exactly the same as the previous.
- Querying data in Cloud Bigtable is only available in certain regions and zones
- us-central1: us-central1-a, b, c, f
- europe-west1: europe-west1-b, c, d
- Limited to 4 concurrent queries against a Cloud Bigtable external data source