I am trying to transfer some files to BigQuery which are stored in my VM Instances. Normally we do a two steps process:
Transfer files from VM instances to Cloud Storage bucket.
Getting data from Cloud Storage bucket to BigQuery.
Now, I want to take files directly from VM Instances to BigQuery platform. Is there any way to do it?
You can load data directly from a readable data source (such as your local machine) by using:
The Cloud Console or the classic BigQuery web UI
The bq command-line tool's bq load command
The API
The client libraries
Please, follow the official documentation to see examples of using each way.
Moreover, if you want to stay with idea of sending your files to Cloud Storage bucket, you can think about using Dataflow templates:
Cloud Storage Text to BigQuery (Stream)
Cloud Storage Text to BigQuery (Batch)
which allows you to read text files stored in Cloud Storage, transform them using a JavaScript User Defined Function (UDF) that you provide, and output the result to BigQuery. It is automated solution.
I hope you find the above pieces of information useful.
The solution would be to use bq command for this.
The command would be like this:
bq load --autodetect --source_format=CSV x.y abc.csv
Related
We have a script exporting csv-files from another database and uploading them to a bucket on GCP cloud storage. Now I know there's the possibility to schedule loads into BigQuery using BigQuery Data Transfer Service but I am a bit surprised that there doesn't seem to be a solution which triggers automatically when a file-upload is finished.
Did I miss something?
You might need to handle that event (google.storage.object.finalize) by your own means.
For example, that event can trigger a cloud function (Google Cloud Storage Triggers), which can do various things - from triggering a load job, to implmenting a complex data processing (cleaning, validation, merging, etc.) while the data from the file is being loaded to the BigQuery table.
I am very new to Google Cloud.
I was querying some public datasets in google BigQuery.
I wanted to know if there was any possible way in which we can know whether the data being queried is from a Google Cloud bucket.
I have tried using T-SQL queries on it but it didnt work.
Any kind of storage metadata regarding the dataset that I am scanning would be useful.
Is it even possible to know whether the queried dataset is from a Google Cloud Bucket? If yes, how would I find where the bucket is located?
You could scan the relevant INFORMATION_SCHEMA view for tables of type EXTERNAL, which would identify which tables are potentially defined against an external source such as Cloud Storage. However, that view doesn't expose the details of the external definition, so you'd need to fallback to inspecting the tables individually (or via something like tables.get in the API) to get all the details since you appear to be after specific storage URIs.
As for the location of a bucket, such a thing would need to be interrogated against another source such as the cloud storage libraries, or a tool such as the cloud console or the gsutil command.
We are using MySQL (Cloud SQL) for the metadata repository for Dataproc. This doesn't store any pieces of information of GCS files which are not part of Hive external tables.
Can anyone suggest the best way to store all the file/data details in one catalog in Google Cloud?
Google Cloud Data Catalog beta doesn't work with GCS or Hive Metastore. See this doc
Tagging Cloud Storage assets (for example, buckets and objects) is unavailable in the Data Catalog beta release.
But it works with BigQuery, see this quickstart example.
dvorzhak,
Data Catalog became GA: Data Catalog GA
And they have updated the docs for Filesets:
Data Catalog Filesets
Also if you want to create Data Catalog assets for each of your cloud storage objects, you may use this open source script: datacatalog-util which has an option to create Entries for your files.
Finally there's an open source connector script, if you want to ingest Hive Databases/Tables into Data Catalog.
Though Google Data catalog is in beta phase, currently it provides data catalog service support to BigQuery and cloud pub/sub services, not for Google cloud storage(in beta phase).
Is there any way using existing components/services we could build data catalog for assets stored in Google cloud storage(buckets, objects, ..) and when could we possibly expect direct support for GCS in Google data catalog.
According to Google's documentation
Tagging Cloud Storage assets (for example, buckets and objects) is unavailable in the Data Catalog beta release.
Full support for Cloud Data Catalog is scheduled for the last quarter of this year, 2019. (from October-on)
Data Catalog became GA: Data Catalog GA
And they have updated the docs for Filesets:
Data Catalog Filesets
Finally if you want to create Data Catalog assets for each of your cloud storage objects, you may use this open source script: datacatalog-util which has an option to create Entries for your files.
I have a csv hosted on a server which updates daily. I'd like to setup a transfer to load this into Google Cloud Storage so that I can then query it using BigQuery.
I'm looking at the transfer service and it doesn't seem to have what I need e.g. only accepts csvs or files from other google storage buckets or amazon s3 buckets.
Thanks in advance
You can also use a URL to a TSV file as explained here and configure the transfer to run daily at the time of your choice.
Alternatively, if it still doesn't fit your need, you may install gsutil on your remote machine and use the gsutil rsync command and schedule it to run daily.