I have some questions related to Cloud Composer and BigQuery. We need to import and create an automated process to export tables from BigQuery to Storage.
I have 4 options at the moment:
bigquery_to_gcs Operator
BashOperator: Executing the "bq" command provided by the Cloud SDK on Cloud Composer.
Python Function: Create a Python function using the BigQuery API, almost the same as bigquery_to_gcs and execute this function with Airflow.
DataFlow: The job will be executed with Airflow too.
I have some thoughts related to the first 3 options thought. If the table is huge, is there a chance to consume a big part of the resources of Cloud Composer? I've been searching if the bashoperator and bigquery operator consumes some resources of Cloud Composer. Always thinking that this process is going to be in production in the future and more dags are running at the same time. If that’s true, Dataflow will be a more convenient option?
A good approach of dataflow is that we can export the table in just one file if we want, that's not possible with the other options if the table is more than 1GB.
BigQuery itself has a feature to export data to GCS. This means that if you use any of the things you mentioned (except for the Dataflow job), you will simply trigger an export job that will be performed and managed by BigQuery.
This means that you do not need to worry about the consumption of cluster resources in Composer. bigquery_to_gcs operator is simply the controller instructing BigQuery to do an export.
So, from the options you mention: bigquery_to_gcs operator, BashOperator, and Python function will incur a similar low cost. Just use whichever you find easier to manage.
Related
I have some CSV files generated by raspberry pi that needs to be pushed into bigquery tables.
Currently, we have a python script using bigquery.LoadJobConfig for batch upload and I run it manually. The goal is to have streaming data(or every 15 minutes) in a simple way.
I explored different solutions:
Using airflow to run the python script (high complexity and maintenance)
Dataflow (I am not familiar with it but if it does the job I will use it)
Scheduling pipeline to run the script through GitLab CI (cron syntax: */15 * * * * )
Could you please help me and suggest to me the best way to push CSV files into bigquery tables in real-time or every 15 minutes?
Good news, you have many options! Perhaps the easiest would be to automate the python script that you have currently, since it does what you need. Assuming you are running it manually on a local machine, you could upload it to a lightweight VM on Google Cloud, the use CRON on the VM to automate the running of it, I used used this approach in the past and it worked well.
Another option would be to deploy your Python code to a Google Cloud Function, a way to let GCP run the code without you having to worry about maintaining the backend resource.
Find out more about Cloud Functions here: https://cloud.google.com/functions
A third option, depending on where your .csv files are being generated, perhaps you could use the BigQuery Data Transfer service to handle the imports into BigQuery.
More on that here: https://cloud.google.com/bigquery/docs/dts-introduction
Good luck!
Adding to #Ben's answer, you can also implement Cloud Composer to orchestrate this workflow. It is built on Apache Airflow and you can use Airflow-native tools, such as the powerful Airflow web interface and command-line tools, Airflow scheduler etc without worrying about your infrastructure and maintenance.
You can implement DAGs to
upload CSV from local to GCS then
GCS to BQ using GCSToBigQueryOperator
More on Cloud Composer
New to Airflow here.
I have a Python code that reads a BigQuery table, makes some transformations as a pandas DataFrame and save it as a file.
Using Airflow, I need a DAG that executes my code and save it as a file in a Google Cloud Storage bucket.
The Airflow is deployed on Composer.
How am I supposed to do that ?
If your transformation can be expressed in BigQuery QL you can use BQ to GCS operator:
https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/transfers/bigquery_to_gcs/index.html
Examples here:
https://github.com/apache/airflow/blob/main/airflow/providers/google/cloud/example_dags/example_bigquery_to_gcs.py
If you need to to do more complex transformation that you have no external service that you can orchestrate, create a custom operator that uses BigQuery Hook and GCS Hook and does what you want to do. It is easier than you think - just take a look at the BQToGCS operator and you will see that it's rather straightforward.
https://github.com/apache/airflow/blob/main/airflow/providers/google/cloud/transfers/bigquery_to_gcs.py
Airflow is all Python - so it does not really change much if you compose existing operators in a DAG, or write your own operators (and then compose them). It's all python code. Airflow implemented Hook abstractions, specifically for the reason to be able to hide the complexity of communication with the services, but allow you as the DAG/Operator's writer to write the operator's code using the hooks and doing some extra operations.
actually the following steps to my data:
new objects in GCS bucket trigger a Google Cloud function that create a BigQuery Job to load this data to BigQuery.
I need low cost solution to know when this Big Query Job is finished and trigger a Dataflow Pipeline only after the job is completed.
Obs:
I know about BigQuery alpha trigger for Google Cloud Function but i
dont know if is a good idea,from what I saw this trigger uses the job
id, which from what I saw can not be fixed and whenever running a job
apparently would have to deploy the function again. And of course
it's an alpha solution.
I read about a Stackdriver Logging->Pub/Sub -> Google cloud function -> Dataflow solution, but i didn't find any log that
indicates that the job finished.
My files are large so isn't a good idea to use a Google Cloud Function to wait until the job finish.
Despite your mention about Stackdriver logging, you can use it with this filter
resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobStatus.state="DONE"
severity="INFO"
You can add dataset filter in addition if needed.
Then create a sink into Function on this advanced filter and run your dataflow job.
If this doesn't match your expectation, can you detail why?
You can look at Cloud Composer which is managed Apache Airflow for orchestrating jobs in a sequential fashion. Composer creates a DAG and executes each node of the DAG and also checks for dependencies to ensure that things either run in parallel or sequentially based on the conditions that you have defined.
You can take a look at the example mentioned here - https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloud-composer-examples/composer_dataflow_examples
I read many articles and solutions regarding scheduling queries to external storage places in Google Big Query but they didn't seem to be that clear.
Note: My company has subscription only to Google Big Query and not to the complete cloud Services (Google Cloud Platform).
I know how to do it manually but I am looking to automate the process since I need the same data every week.
Any suggestions will be appreciated. Thank you.
Option 1
You can use Apache Airflow which provides the option to create schedule task on to of BigQuery using BigQuery operator.
You can find in this link the basic steps required to start setting this up
option 2
You can use the Google BigQuery command line to export your data as you do from the webUI, for example:
bq --location=[LOCATION] extract --destination_format [FORMAT] --compression [COMPRESSION_TYPE] --field_delimiter [DELIMITER] --print_header [BOOLEAN] [PROJECT_ID]:[DATASET].[TABLE] gs://[BUCKET]/[FILENAME]
Once you get this working you can use any schedule process of your liking to schedule the run of this job
BTW: Airflow has a connector which enables you to run the command line tool
Once the file in GCP you can use Box G suite integration to see and manage your files
I am new to GAE and I am trying to quickly find a way to retrieve logs in DataStore, clean them to my specs, and then save them to a table to be called on later for a reports view in my app. I was thinking of using Google Data Flow and creating batch jobs (app is python/Django) but the documentation does not seem to fit my use case so maybe data flow is not the answer. I could create a python script with BigQuery and schedule through CRON but then I would have to contend with errors and it would seem that there is a faster way to solve this problem.
Any help/thoughts/suggestions is always greatly appreciated.
You can use Dataflow/Beam Python SDK to develop a pipeline that read entities from Datastore [1], transform data, and write a table to BigQuery [2]. To schedule this job to run regularly you'll have to use a third party mechanism such as a cron job. Note that Dataflow performs automatic scaling and perform retries to handle errors so you are not expected to manually address these complexities.
[1] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/datastore/v1/datastoreio.py
[2] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py