Process data from BigQuery using Dataflow - google-cloud-platform

I want to retrieve data from BigQuery that arrived every hour and do some processing and pull the new calculate variables in a new BigQuery table. The things is that I've never worked with gcp before and I have to for my job now.
I already have my code in python to process the data but it's work only with a "static" dataset

As your source and sink of that are both in BigQuery, I would recommend you to do your transformations inside BigQuery.
If you need a scheduled job that runs in a pre determined time, you can use Scheduled Queries.
With Scheduled Queries you are able to save some query, execute it periodically and save the results to another table.
To create a scheduled query follow the steps:
In BigQuery Console, write your query
After writing the correct query, click in Schedule query and then in Create new scheduled query as you can see in the image below
Pay attention in this two fields:
Schedule options: there are some pre-configured schedules such as daily, monthly, etc.. If you need to execute it every two hours, for example, you can set the Repeat option as Custom and set your Custom schedule as 'every 2 hours'. In the Start date and run time field, select the time and data when your query should start being executed.
Destination for query results: here you can set the dataset and table where your query's results will be saved. Please keep in mind that this option is not available if you use scripting. In other words, you should use only SQL and not scripting in your transformations.
Click on Schedule
After that your query will start being executed according to your schedule and destination table configurations.

According with Google recommendation, when your data are in BigQuery and when you want to transform them to store them in BigQuery, it's always quicker and cheaper to do this in BigQuery if you can express your processing in SQL.
That's why, I don't recommend you dataflow for your use case. If you don't want, or you can't use directly the SQL, you can create User Defined Function (UDF) in BigQuery in Javascript.
EDIT
If you have no information when the data are updated into BigQuery, Dataflow won't help you on this. Dataflow can process realtime data only if these data are present into PubSub. If not, it's not magic!!
Because you haven't the information of when a load is performed, you have to run your process on a schedule. For this, Scheduled Queries is the right solution is you use BigQuery for your processing.

Related

Run update on multiple tables in BigQuery

I have lake dataset which take data from a OLTP system, with the nature of transactions we have lot of updates the next day, so to keep track of the latest record we are using active_flag = '1'.
We also created a update script which retires old records and updates active_flag = '0'.
Now the main question: how can i execute a update statement by changing table name automatically(programmatically).
I know we have a option of using cloudfunctions but it'll expire in 9 mins and I have atleast 350 tables to update.
Has anyone faced this situation earlier??
You can easily do this with Cloud Workflows.
There you setup the template calls to Bigquery as a substeps, and then you pass a list of tables, and loop through the items and invoke the BigQuery step for each item/table.
I wrote an article with samples that you can adapt: Automate the execution of BigQuery queries with Cloud Workflows

How to monitor if a BigQuery table contains current data and send an alert if not?

I have a BigQuery table and an external data import process that should add entries every day. I need to verify that the table contains current data (with a timestamp of today). Writing the SQL-query is not a problem.
My question is how to best install such a monitoring in GCP? Can Stackdriver execute custom BigQuery SQL? Or would a CloudFunction be more suitable? An AppEngine application with a cronjob? What's the best practise?
Not sure what's the best practice here, but one simple solution is to use BigQuery scheduled query. Schedule query, make it fail is something is wrong using ERROR() function, configure scheduled query to notify (it sends email) if it fails.

How best cache bigquery table for fast lookup of individual row?

I have a raw data table in bigquery that has hundreds of millions of rows. I run a scheduled query every 24 hours to produce some aggregations that results a table in the ballmark of 33 million rows (6gb) but may be expected to grow slowly to approximately double its current size.
I need a way to get 1 row at a time quick access lookup by id to that aggregate table in a separate event driven pipeline. i.e. A process is notified that person A just took an action, what do we know about this person's history from the aggregation table?
Clearly bigquery is the right tool to produce the aggregate table, but not the right tool for the quick lookups. So I need to offset it to a secondary datastore like firestore. But what is the best process to do so?
I can envision a couple strategies:
1) Schedule a dump of agg table to GCS. Kick off a dataflow job to stream contents of gcs dump to pubsub. Create a serverless function to listen to pubsub topic and insert rows into firestore.
2) A long running script on compute engine which just streams the table directly from BQ and runs inserts. (Seems slower than strategy 1)
3) Schedule a dump of agg table to GCS. Format it in such a way that can be directly imported to firestore via gcloud beta firestore import gs://[BUCKET_NAME]/[EXPORT_PREFIX]/
4) Maybe some kind of dataflow job that performs lookups directly against the bigquery table? Not played with this approach before. No idea how costly / performant.
5) some other option I've not considered?
The ideal solution would allow me quick access in milliseconds to an agg row which would allow me to append data to the real time event.
Is there a clear best winner here in the strategy I should persue?
Remember that you could also CLUSTER your table by id - making your lookup queries way faster and less data consuming. They will still take more than a second to run though.
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
You could also set up exports from BigQuery to CloudSQL, for subsecond results:
https://medium.com/#gabidavila/how-to-serve-bigquery-results-from-mysql-with-cloud-sql-b7ddacc99299
And remember, now BigQuery can read straight out of CloudSQL if you'd like it to be your source of truth for "hot-data":
https://medium.com/google-cloud/loading-mysql-backup-files-into-bigquery-straight-from-cloud-sql-d40a98281229

How to monitor the number of records loaded into BQ table while using big query streaming?

We are trying to insert data into bigquery (streaming) using dataflow. Is there a way where we can keep a check on the number of records inserted into Bigquery? We need this data for reconciliation purpose.
Add a step to your dataflow which calls Google API Tables.get OR run this query before and after the flow (Both are equally good).
select row_count, table_id from `dataset.__TABLES__` where table_id = 'audit'
As an example, the query returns this
You also may be able to examine the "Elements added" by clicking on the step writing to bigquery in the Dataflow UI.

Read spanner data from a table which is simultaneously being written

I'm copying Spanner data to BigQuery through a Dataflow job. The job is scheduled to run every 15 minutes. The problem is, if the data is read from a Spanner table which is also being written at the same time, some of the records get missed while copying to BigQuery.
I'm using readOnlyTransaction() while reading Spanner data. Is there any other precaution that I must take while doing this activity?
It is recommended to use Cloud Spanner commit timestamps to populate columns like update_date. Commit timestamps allow applications to determine the exact ordering of mutations.
Using commit timestamps for update_date and specifying an exact timestamp read, the Dataflow job will be able to find all existing records written/committed since the previous run.
https://cloud.google.com/spanner/docs/commit-timestamp
https://cloud.google.com/spanner/docs/timestamp-bounds
if the data is read from a Spanner table which is also being written at the same time, some of the records get missed while copying to BigQuery
This is how transactions work. They present a 'snapshot view' of the database at the time the transaction was created, so any rows written after this snapshot is taken will not be included.
As #rose-liu mentioned, using commit timestamps on your rows, and keeping track of the timestamp when you last exported (available from the ReadOnlyTransaction object) will allow you to accurately select 'new/updated rows since last export'