{"id":3774,"date":"2022-02-12T16:16:29","date_gmt":"2022-02-12T16:16:29","guid":{"rendered":"https:\/\/wiki.thomasandsofia.com\/?p=3774"},"modified":"2022-02-13T19:04:01","modified_gmt":"2022-02-13T19:04:01","slug":"bqfbd-loading-and-querying-external-data-sources","status":"publish","type":"post","link":"https:\/\/wiki.thomasandsofia.com\/?p=3774","title":{"rendered":"BQFBD &#8211; Loading and Querying External Data Sources"},"content":{"rendered":"<p><a href=\"\/bigquery-for-big-data\/\">Main Menu<\/a><\/p>\n<h1>Section 10: Loading and Querying External Data Sources<\/h1>\n<h2>44. Introduction and Create Cloud Storage Bucket<\/h2>\n<p>https:\/\/www.udemy.com\/course\/bigquery\/learn\/lecture\/22899018#questions<\/p>\n<h3>Sources<\/h3>\n<ul>\n<li>Querying data does not require that the data is loaded into BQ\n<ul>\n<li>Public datasets<\/li>\n<li>Shared datasets<\/li>\n<li>External sources<\/li>\n<\/ul>\n<\/li>\n<li>You can also use federated queries to query data <strong>where<\/strong> it resides<\/li>\n<li>Other situations require that the data is loaded into BQ<\/li>\n<\/ul>\n<h3>Loading external data into BQ<\/h3>\n<ul>\n<li>File Formats\n<ul>\n<li>CSV, JSON, Avro, Parquet, ORC, Datastore exports or Firestore exports<\/li>\n<li>Each file type has its own set of quotas and limitations (none are perfect)\n<ul>\n<li>If Read speed is good, write speed is likely compromised<\/li>\n<li>If you want more compression, both read and write will be affected<\/li>\n<\/ul>\n<\/li>\n<li>Avro is preferred for loading compressed and uncompressed\n<ul>\n<li>Data can be read in parallel, even when data blocks are compressed\n<ul>\n<li>File compression itself is not supported<\/li>\n<li>Compressed data blocks are<\/li>\n<\/ul>\n<\/li>\n<li>BQ supports DEFLATE and Snappy codecs for compressed data blocks in Avro files<\/li>\n<\/ul>\n<\/li>\n<li>Non-Avro formats\n<ul>\n<li>BQ can load uncompressed files significantly faster than compressed\n<ul>\n<li>These can be read in parallel<\/li>\n<li>But these can lead to bandwidth limitations and higher storage costs<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Limitations of file storage formats: <a href=\"https:\/\/cloud.google.com\/bigquery\/docs\/batch-loading-data#limitations\">Batch file loading data limitations<\/a><\/li>\n<\/ul>\n<\/li>\n<li>Cloud Storage Overview\n<ul>\n<li>Data is stored in Buckets (Object storage?)<\/li>\n<li>Creating a bucket\n<ul>\n<li>Dashboard &gt; Storage &gt; Create Bucket\n<ul>\n<li><a href=\"https:\/\/cloud.google.com\/storage\/docs\/naming-buckets?_ga=2.19040446.-704302698.1643802767\">Naming Guidelines<\/a><\/li>\n<li>Names must be globally unique: Same as AWS<\/li>\n<li>Names are publicly visible<\/li>\n<li>Put the bucket in the same region as your BQ Dataset<\/li>\n<li>Access Control: Up to you.\u00a0 Video had default &#8220;fine grained&#8221;, mine was Uniform.\u00a0 I decided to use Fine grained.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>You can upload single files or complete folders<\/li>\n<li>URI:\u00a0 gs:\/\/BUCKET_NAME\/DIR_NAME\/FILENAME.EXT<\/li>\n<li>If data is sorted into multiple files, you can use a wildcard\n<ul>\n<li>gs:\/\/BUCKET_NAME\/DIR_NAME\/FILENAME*.EXT<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Loading from Cloud Storage\n<ul>\n<li>Most common data storage for BQ\n<ul>\n<li>All GCP applications may store\u00a0 data there in stuctured and unstructured formats\n<ul>\n<li>CSV, JSON, Avro, Parquet, ORC, Datastore exports or Firestore exports<\/li>\n<li>Each file type has its own set of quotas and limitations<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Cloud storage provides both options\n<ul>\n<li>Loading into BQ<\/li>\n<li>Querying directly without loading<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>45. Create &amp; Query Permanent Tables on Cloud Storage<\/h2>\n<p>https:\/\/www.udemy.com\/course\/bigquery\/learn\/lecture\/22899032#questions<\/p>\n<ul>\n<li>Dataset &gt; Create Table &gt; Google Cloud Storage &gt; Search for file (Bucket &gt; Folder &gt; File).\n<ul>\n<li>Table Type\n<ul>\n<li>Native table: The data will be imported into BQ\n<ul>\n<li>You will have ability to preview the data<\/li>\n<\/ul>\n<\/li>\n<li>External table: The data will remain in GCP Storage\n<ul>\n<li>Limitations\n<ul>\n<li>You cannot preview<\/li>\n<li>You will not get an estimated bytes for the query<\/li>\n<li>etc<\/li>\n<\/ul>\n<\/li>\n<li>Additional pseudo column: _FILE_NAME\n<ul>\n<li>select *, _FILE_NAME as fn from `proj.ds.table` limit 1\n<ul>\n<li><code>1 Mary F 7065 gs:\/\/tas_gcp\/names\/yob1880.txt<\/code><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>46. External data source limitations<\/h2>\n<p>https:\/\/www.udemy.com\/course\/bigquery\/learn\/lecture\/22899044#questions<\/p>\n<ul>\n<li>BQ does not guarantee data consistency for external data sources\n<ul>\n<li>Changes to the underlying data while the query is running can result in unexpected behavior<\/li>\n<\/ul>\n<\/li>\n<li>Query performance for external data sources may not be as high as querying data in native BQ table<\/li>\n<li>You cannot run BQ job that exports data from an external source\n<ul>\n<li>Workaround: Save query results to an intermediate destination table, the run the export job against that table<\/li>\n<\/ul>\n<\/li>\n<li>You cannot reference an external data source in a wildcard table query<\/li>\n<li>Table partitioning and clustering is supported in limited ways\n<ul>\n<li>Hive partitioning layout<\/li>\n<\/ul>\n<\/li>\n<li>External data sources query results are not cached\n<ul>\n<li>You will be charged for each query, even if it is exactly the same as the previous.<\/li>\n<\/ul>\n<\/li>\n<li>Querying data in Cloud Bigtable is only available in certain regions and zones\n<ul>\n<li>us-central1: us-central1-a, b, c, f<\/li>\n<li>europe-west1: europe-west1-b, c, d<\/li>\n<\/ul>\n<\/li>\n<li>Limited to 4 concurrent queries against a Cloud Bigtable external data source<\/li>\n<li><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Main Menu Section 10: Loading and Querying External Data Sources 44. Introduction and Create Cloud Storage Bucket https:\/\/www.udemy.com\/course\/bigquery\/learn\/lecture\/22899018#questions Sources Querying data does not require that the data is loaded into BQ Public datasets Shared datasets External sources You can also use federated queries to query data where it resides Other situations require that the data ..<\/p>\n<div class=\"clear-fix\"><\/div>\n<p><a href=\"https:\/\/wiki.thomasandsofia.com\/?p=3774\" title=\"read more...\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[77,78],"tags":[],"class_list":["post-3774","post","type-post","status-publish","format-standard","hentry","category-bigquery","category-bigquery-for-big-data-engineers"],"_links":{"self":[{"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=\/wp\/v2\/posts\/3774","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3774"}],"version-history":[{"count":4,"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=\/wp\/v2\/posts\/3774\/revisions"}],"predecessor-version":[{"id":3779,"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=\/wp\/v2\/posts\/3774\/revisions\/3779"}],"wp:attachment":[{"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3774"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3774"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wiki.thomasandsofia.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3774"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}