BQFBD – Loading and Querying External Data Sources

  BigQuery, BigQuery for Big Data Engineers

Main Menu

Section 10: Loading and Querying External Data Sources

44. Introduction and Create Cloud Storage Bucket

https://www.udemy.com/course/bigquery/learn/lecture/22899018#questions

Sources

  • Querying data does not require that the data is loaded into BQ
    • Public datasets
    • Shared datasets
    • External sources
  • You can also use federated queries to query data where it resides
  • Other situations require that the data is loaded into BQ

Loading external data into BQ

  • File Formats
    • CSV, JSON, Avro, Parquet, ORC, Datastore exports or Firestore exports
    • Each file type has its own set of quotas and limitations (none are perfect)
      • If Read speed is good, write speed is likely compromised
      • If you want more compression, both read and write will be affected
    • Avro is preferred for loading compressed and uncompressed
      • Data can be read in parallel, even when data blocks are compressed
        • File compression itself is not supported
        • Compressed data blocks are
      • BQ supports DEFLATE and Snappy codecs for compressed data blocks in Avro files
    • Non-Avro formats
      • BQ can load uncompressed files significantly faster than compressed
        • These can be read in parallel
        • But these can lead to bandwidth limitations and higher storage costs
    • Limitations of file storage formats: Batch file loading data limitations
  • Cloud Storage Overview
    • Data is stored in Buckets (Object storage?)
    • Creating a bucket
      • Dashboard > Storage > Create Bucket
        • Naming Guidelines
        • Names must be globally unique: Same as AWS
        • Names are publicly visible
        • Put the bucket in the same region as your BQ Dataset
        • Access Control: Up to you.  Video had default “fine grained”, mine was Uniform.  I decided to use Fine grained.
    • You can upload single files or complete folders
    • URI:  gs://BUCKET_NAME/DIR_NAME/FILENAME.EXT
    • If data is sorted into multiple files, you can use a wildcard
      • gs://BUCKET_NAME/DIR_NAME/FILENAME*.EXT
  • Loading from Cloud Storage
    • Most common data storage for BQ
      • All GCP applications may store  data there in stuctured and unstructured formats
        • CSV, JSON, Avro, Parquet, ORC, Datastore exports or Firestore exports
        • Each file type has its own set of quotas and limitations
    • Cloud storage provides both options
      • Loading into BQ
      • Querying directly without loading

45. Create & Query Permanent Tables on Cloud Storage

https://www.udemy.com/course/bigquery/learn/lecture/22899032#questions

  • Dataset > Create Table > Google Cloud Storage > Search for file (Bucket > Folder > File).
    • Table Type
      • Native table: The data will be imported into BQ
        • You will have ability to preview the data
      • External table: The data will remain in GCP Storage
        • Limitations
          • You cannot preview
          • You will not get an estimated bytes for the query
          • etc
        • Additional pseudo column: _FILE_NAME
          • select *, _FILE_NAME as fn from `proj.ds.table` limit 1
            • 1 Mary F 7065 gs://tas_gcp/names/yob1880.txt

46. External data source limitations

https://www.udemy.com/course/bigquery/learn/lecture/22899044#questions

  • BQ does not guarantee data consistency for external data sources
    • Changes to the underlying data while the query is running can result in unexpected behavior
  • Query performance for external data sources may not be as high as querying data in native BQ table
  • You cannot run BQ job that exports data from an external source
    • Workaround: Save query results to an intermediate destination table, the run the export job against that table
  • You cannot reference an external data source in a wildcard table query
  • Table partitioning and clustering is supported in limited ways
    • Hive partitioning layout
  • External data sources query results are not cached
    • You will be charged for each query, even if it is exactly the same as the previous.
  • Querying data in Cloud Bigtable is only available in certain regions and zones
    • us-central1: us-central1-a, b, c, f
    • europe-west1: europe-west1-b, c, d
  • Limited to 4 concurrent queries against a Cloud Bigtable external data source

LEAVE A COMMENT