How to export BigQuery partitions to the GCS using Airflow - google-cloud-platform

I need to create a Airflow job that exports the partitions in the BigQuery table to GCS between the given range of _PARTITIONDATE. I need partitions to be in separate file with the date of partitions. How can I achieve this?
I have tried using airflow tasks that uses SQL to fetch the _PARTITIONDATE, but can I do it programatically?

For this, I recommend you to perform a loop in your dag definition (your loop is in Python code and you will add a lot of step in the DAG. By definition, the DAG can't contain loop).
The algorithm should be like that
For all days in the range
Query BigQuery on this day and save the result to a temporary table, the name of the table contain the date. Use BigqueryOperator
Extract the temporary table to GCS. Use BigQueryToCloudStorageOperator

Just follow the link below
these are the guide to export Bigquery partitions to the GCS using Airflow:
https://m.youtube.com/watch?v=wAyu5BN3VpY&t=28s

Related

How to Transfer a BigQuery view to a Google Cloud Storage bucket as a csv file

I need to export the content of a BigQuery view to the csv file in GCP, with Airflow DAG. To export the content of the BQ TABLE, I can use BigQueryToCloudStorageOperator. But in my case I need to use an existing view, and BigQueryToCloudStorageOperator fails with this error, which I see while checking the logs for failed DAG:
BigQuery job failed: my_view is not allowed for this operation because it is currently a VIEW
So, what options do I have here? I can't use a regular table, so may be there is another operator that would work with a view data stored in BQ, instead of table? Or may be the same operator would work with some addition options (although I don't see anything useful in here Apache documentation for BigQueryToCloudStorageOperator)?
I think the Bigquery client doesn’t give the possibility to export a view to a GCS file.
It’s not perfect but I propose you 2 solutions
First solution (more native with existing operators) :
Create a staging table to export it to GCS
At the beginning of your DAG, create a task that truncate this staging table
Add a task with a select on your view and an insert in your staging table (insert/select)
Use the bigquery_to_gcs operator from your staging table
Second solution (less native with Python clients and PythonOperator) :
Use a PythonOperator
In this operator, use a Bigquery Python client to load data from your view as Dict and the storage Python client to generate a file to GCS from this Dict
I have a preference for the first solution, even if it forces me to create a staging table.
I ended up with a kind of combined solution, part of it is what Mazlum Tosun suggested in his answer: in my DAG I added an extra first step, a DataLakeKubernetesPodOperator, which runs a Python file. In that Python file there are calls to SQL files, which contain simple queries (put in the await asyncio.wait(...) block and executed with bq_execute() ): truncate an existing table (to prepare it for a new data), and then copy (insert) data from the view to the truncated table (as Mazlum Tosun suggested).
After that step, the rest is the same as before: I use BigQueryToCloudStorageOperator to copy data from the regular table (which now contains data from the view) to google cloud storage bucket, and now it works fine.

How to display Airflow DAG status in Big Query tables

I want to show the DAG (airflow) final status (success/Failure) to a table in BQ.
Like that table can contains: Date-Time,DAG-Name,Status etc columns and it will get populated according to the final status of the DAG.
Please help; how can this be achieved?
There's no native out-of-the-box method to achieve this in Airflow. However, you could implement a function yourself which writes data to BigQuery and run it via a DAG's on_success_callback and on_failure_callback methods.
Note: BigQuery is not a transactional database and has limits on the number of inserts per day. For a large number of DAG runs, you might want to think of writing results in batches to BigQuery.
If you need the data in real-time, I would go with somethign along the lines of the approach #Bas has suggested, maybe with firestore or Cloud SQL. However note his comments on the inserts per day if you go with BigQuery.
If you can wait on the results on a daily basis you can do a log sink to BigQuery as described here:
https://cloud.google.com/bigquery/docs/reference/auditlogs#stackdriver_logging_exports
In the filter criteria you can either bring in all of the Airflow logs or just the ones from the worker/scheduler.
Ex criteria:
resource.type="cloud_composer_environment"
logName="projects/{YOUR-PROJECT}/logs/airflow-worker"
In the log textPayload you will see something like:
Marking task as SUCCESS. dag_id=thing, task_id=stuff, executiondate=20220307T111111, start_date=20220307T114858, end_date=20220307T114859
You can then parse for what you need in BigQuery
To complement the answer of user Bas Harenslak. There are these options also that you can explore:
You can make use of TriggerDagRunOperator. By using it you can have one dag (a recap-dag) which will be referenced by your DAGs to populate the record into your destination dataset.
trigger_recap_dag = TriggerDagRunOperator(
task_id="trigger_recap_dag",
trigger_dag_id="recap-dag",
wait_for_completion=False,
allowed_states=['success']
conf='{"Time": datetime.now() ,"DAG": "recap-dag","Status":"success"}'
)
ingestion >> transformation >> save >> send_notification >> trigger_recap_dag
If you see fit, This recap-dag can also be independent and only run every hour/day/week of your election and checks your DAGs status.
with DAG(
'recap-dag',
schedule_interval='#daily',
start_date=datetime(2021, 1, 1),
catchup=False,
) as dag:
...
# Airflow >= 2.0.0
# Inside a python Operator
def GetRunningDagsInfo():
dag_runs = DagRun.find(
dag_id=your_dag_id,
execution_start_date=your_start_date
execution_end_date=your_end_date
)
...
You can make use of prior options and come with a solution like this:
After you dag (or dags) complete, it will fire the trigger dag. this recap-dag will saves your dag records into a custom table or
file and then your independent DAG runs and retrieves the datasets
that have been created so far and push the data into your BigQuery
Table.
Another option is by looking into your Airflow Database to retrieve running information. Know as Data Profiling. It has been deprecated in latest versions due to security concerns.

Speed up BigQuery query job to import from Cloud SQL

I am performing a query to generate a new BigQuery table of of size ~1 Tb (a few billion rows), as part of migrating a Cloud SQL table to BigQuery, using Federated query. I use the BigQuery Python client to submit the query job, in the query I select all from the the Cloud SQL database table and use EXTERNAL_QUERY.
I find that the query can take 6+ hours (and fails with "Operation timed out after 6.0 hour")! Even if it didn't fail, I would like to speed it up as I may need to perform this migration again.
I see that the PostgreSQL egress is 20Mb/sec, consistent with a job that would take half a day. Would it help if I consider something more distributed with Dataflow? Or simpler, extend my Python code using the BigQuery client to generate multiple queries, which can run async by BigQuery?
Or is it possible to still use that single query but increase the egress traffic (database configuration)?
I think it is more suitable to use the dump export.
Running a query on large table is an inefficient job.
I recommend to export Cloud SQL data to a CSV file.
BigQuery can import CSV format file, So you can use this file to create your new bigQuery table.
I'm not sure of how long this job will takes, But at least will not be failed.
Refer here to get more detailed job about export Cloud SQL to CSV dump.

Spanner to CSV DataFlow

I am trying to copy table from spanner to big query. I have created two dataflow. One which copies from spanner to text file and other one that imports text file into bigquery.
Table has a column which has JSON string as a value. Issue is seen when dataflow job runs while importing from text file to bigquery. Job throws below error :
INVALD JSON: :1:38 Expected eof but found, "602...
Is there anyway I can exclude this column while copying or any way I can copy JSON object as it is? I tried excluding this column in schema file but it did not help.
Thank you!
Looking at https://cloud.google.com/dataflow/docs/guides/templates/provided-batch#cloud-spanner-to-cloud-storage-text there are options on BigQuery import jobs that would allow to skip columns, neither Cloud Spanner options that would skip a column when extracting.
I think your best shot is to write a custom processor that will drop the column, similar to Cleaning data in CSV files using dataflow.
it's more complicated but you can also try DataPrep: http://cloud/dataprep/docs/html/Drop-Transform_57344635. It should be possible to run DataPrep jobs as a DataFlow template.

Process data from BigQuery using Dataflow

I want to retrieve data from BigQuery that arrived every hour and do some processing and pull the new calculate variables in a new BigQuery table. The things is that I've never worked with gcp before and I have to for my job now.
I already have my code in python to process the data but it's work only with a "static" dataset
As your source and sink of that are both in BigQuery, I would recommend you to do your transformations inside BigQuery.
If you need a scheduled job that runs in a pre determined time, you can use Scheduled Queries.
With Scheduled Queries you are able to save some query, execute it periodically and save the results to another table.
To create a scheduled query follow the steps:
In BigQuery Console, write your query
After writing the correct query, click in Schedule query and then in Create new scheduled query as you can see in the image below
Pay attention in this two fields:
Schedule options: there are some pre-configured schedules such as daily, monthly, etc.. If you need to execute it every two hours, for example, you can set the Repeat option as Custom and set your Custom schedule as 'every 2 hours'. In the Start date and run time field, select the time and data when your query should start being executed.
Destination for query results: here you can set the dataset and table where your query's results will be saved. Please keep in mind that this option is not available if you use scripting. In other words, you should use only SQL and not scripting in your transformations.
Click on Schedule
After that your query will start being executed according to your schedule and destination table configurations.
According with Google recommendation, when your data are in BigQuery and when you want to transform them to store them in BigQuery, it's always quicker and cheaper to do this in BigQuery if you can express your processing in SQL.
That's why, I don't recommend you dataflow for your use case. If you don't want, or you can't use directly the SQL, you can create User Defined Function (UDF) in BigQuery in Javascript.
EDIT
If you have no information when the data are updated into BigQuery, Dataflow won't help you on this. Dataflow can process realtime data only if these data are present into PubSub. If not, it's not magic!!
Because you haven't the information of when a load is performed, you have to run your process on a schedule. For this, Scheduled Queries is the right solution is you use BigQuery for your processing.