Section 10: Loading and Querying External Data Sources

44. Introduction and Create Cloud Storage Bucket

https://www.udemy.com/course/bigquery/learn/lecture/22899018#questions

Sources

Querying data does not require that the data is loaded into BQ
- Public datasets
- Shared datasets
- External sources
You can also use federated queries to query data where it resides
Other situations require that the data is loaded into BQ

Loading external data into BQ

File Formats
- CSV, JSON, Avro, Parquet, ORC, Datastore exports or Firestore exports
- Each file type has its own set of quotas and limitations (none are perfect)
  - If Read speed is good, write speed is likely compromised
  - If you want more compression, both read and write will be affected
- Avro is preferred for loading compressed and uncompressed
  - Data can be read in parallel, even when data blocks are compressed
    - File compression itself is not supported
    - Compressed data blocks are
  - BQ supports DEFLATE and Snappy codecs for compressed data blocks in Avro files
- Non-Avro formats
  - BQ can load uncompressed files significantly faster than compressed
    - These can be read in parallel
    - But these can lead to bandwidth limitations and higher storage costs
- Limitations of file storage formats: Batch file loading data limitations
Cloud Storage Overview
- Data is stored in Buckets (Object storage?)
- Creating a bucket
  - Dashboard > Storage > Create Bucket
    - Naming Guidelines
    - Names must be globally unique: Same as AWS
    - Names are publicly visible
    - Put the bucket in the same region as your BQ Dataset
    - Access Control: Up to you. Video had default “fine grained”, mine was Uniform. I decided to use Fine grained.
- You can upload single files or complete folders
- URI: gs://BUCKET_NAME/DIR_NAME/FILENAME.EXT
- If data is sorted into multiple files, you can use a wildcard
  - gs://BUCKET_NAME/DIR_NAME/FILENAME*.EXT
Loading from Cloud Storage
- Most common data storage for BQ
  - All GCP applications may store data there in stuctured and unstructured formats
    - CSV, JSON, Avro, Parquet, ORC, Datastore exports or Firestore exports
    - Each file type has its own set of quotas and limitations
- Cloud storage provides both options
  - Loading into BQ
  - Querying directly without loading

45. Create & Query Permanent Tables on Cloud Storage

https://www.udemy.com/course/bigquery/learn/lecture/22899032#questions

Dataset > Create Table > Google Cloud Storage > Search for file (Bucket > Folder > File).
- Table Type
  - Native table: The data will be imported into BQ
    - You will have ability to preview the data
  - External table: The data will remain in GCP Storage
    - Limitations
      - You cannot preview
      - You will not get an estimated bytes for the query
      - etc
    - Additional pseudo column: _FILE_NAME
      - select *, _FILE_NAME as fn from `proj.ds.table` limit 1
        
        1 Mary F 7065 gs://tas_gcp/names/yob1880.txt

46. External data source limitations

https://www.udemy.com/course/bigquery/learn/lecture/22899044#questions

BQ does not guarantee data consistency for external data sources
- Changes to the underlying data while the query is running can result in unexpected behavior
Query performance for external data sources may not be as high as querying data in native BQ table
You cannot run BQ job that exports data from an external source
- Workaround: Save query results to an intermediate destination table, the run the export job against that table
You cannot reference an external data source in a wildcard table query
Table partitioning and clustering is supported in limited ways
- Hive partitioning layout
External data sources query results are not cached
- You will be charged for each query, even if it is exactly the same as the previous.
Querying data in Cloud Bigtable is only available in certain regions and zones
- us-central1: us-central1-a, b, c, f
- europe-west1: europe-west1-b, c, d
Limited to 4 concurrent queries against a Cloud Bigtable external data source

BQFBD – Loading and Querying External Data Sources

Section 10: Loading and Querying External Data Sources

44. Introduction and Create Cloud Storage Bucket

Sources

Loading external data into BQ

45. Create & Query Permanent Tables on Cloud Storage

46. External data source limitations

LEAVE A COMMENT Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta