Google Dataflow design - google-cloud-platform

We need your guidance on the dataflow design for the below scenario.
Requirement:
We need to build a dataflow job to read dataflow MS SQL database and write to Bigquery.
We need the dataflow job to take as input “the list of table names” (source and target table names) to read from and write to the data.
Question:
On a daily schedule, would it be possible for a dataflow to take the list of tables (i.e. 50 table names) as input and copy data from source to target or should this be designed as 50 independent dataflow jobs.
Would dataflow automatically adjust the number of workers – without bringing down the source MS SQL server ?
Key Info:
Source: MS SQL database
Target: Bigquery
No of Table: 50
Schedule: Every day , say 8 am
Write Disposition: Write Truncate (or Write Append)

You have to create a dataflow template to be able to trigger it on a schedule. On that template, you have to define a input variable in which you can put your table list.
Then, in a same dataflow job, you can have 50 independent pipeline, each reading in a table and sinking the data in BigQuery. You can't run 50 dataflow jobs in parallel because of quotas (limit of 25 per projects). In addition, it will be less cost efficient.
Indeed, Dataflow is able to parallelize on the same worker different pipeline (in different thread) and to scale up and down the cluster size according to the workload requirements.

Related

Process data from BigQuery using Dataflow

I want to retrieve data from BigQuery that arrived every hour and do some processing and pull the new calculate variables in a new BigQuery table. The things is that I've never worked with gcp before and I have to for my job now.
I already have my code in python to process the data but it's work only with a "static" dataset
As your source and sink of that are both in BigQuery, I would recommend you to do your transformations inside BigQuery.
If you need a scheduled job that runs in a pre determined time, you can use Scheduled Queries.
With Scheduled Queries you are able to save some query, execute it periodically and save the results to another table.
To create a scheduled query follow the steps:
In BigQuery Console, write your query
After writing the correct query, click in Schedule query and then in Create new scheduled query as you can see in the image below
Pay attention in this two fields:
Schedule options: there are some pre-configured schedules such as daily, monthly, etc.. If you need to execute it every two hours, for example, you can set the Repeat option as Custom and set your Custom schedule as 'every 2 hours'. In the Start date and run time field, select the time and data when your query should start being executed.
Destination for query results: here you can set the dataset and table where your query's results will be saved. Please keep in mind that this option is not available if you use scripting. In other words, you should use only SQL and not scripting in your transformations.
Click on Schedule
After that your query will start being executed according to your schedule and destination table configurations.
According with Google recommendation, when your data are in BigQuery and when you want to transform them to store them in BigQuery, it's always quicker and cheaper to do this in BigQuery if you can express your processing in SQL.
That's why, I don't recommend you dataflow for your use case. If you don't want, or you can't use directly the SQL, you can create User Defined Function (UDF) in BigQuery in Javascript.
EDIT
If you have no information when the data are updated into BigQuery, Dataflow won't help you on this. Dataflow can process realtime data only if these data are present into PubSub. If not, it's not magic!!
Because you haven't the information of when a load is performed, you have to run your process on a schedule. For this, Scheduled Queries is the right solution is you use BigQuery for your processing.

Use case for dataflow (small SQL queries)

We're using Cloud Function to transform our datas in BigQuery :
- all datas are in BigQuery
- to transform data, we only use SQL queries in BigQuery
- each query runs once a day
- our biggest SQL query runs for about 2 to 3 minutes, but most queries runs for less than 30 seconds
- we have about 50 queries executed once a day, and this number is increasing
We tried at first to do the same thing (SQL queries in BigQuery) with Dataflow, but :
- it took about 10 to 15 minutes just to start dataflow
- it is more complicated to code than our cloud functions
- at that time, Dataflow SQL was not implemented
Every time we talk with someone using GCP (users, trainers or auditers), they recommend using Dataflow.
So did we miss something "magic" with Dataflow, in our use case? Is there a way to make it start in seconds and not in minutes?
Also, if we use streaming in Dataflow, how are costs calculated? I understand that in batch we pay for what we use, but what if we use streaming? Is it counted as a full-time running service?
Thanks for your help
For the first part, BigQuery VS Dataflow, I discussed this with Google weeks ago and their advice is clear:
When you can express your transformation in SQL, and you can reach your data with BigQuery (external table), it's always quicker and cheaper with BigQuery. Even if the request is complex.
For all the other use cases, Dataflow is the most recommended.
For realtime (with true need of realtime, with metrics figured out on the fly with windowing)
When you need to reach external API (ML, external service,...)
When you need to sink into something else than BigQuery (Firestore, BigTable, Cloud SQL,...) or read from a source not reachable by BigQuery.
And yes, Dataflow start in 3 minutes and stop in again 3 minutes. It's long... and you pay for this useless time.
For batch, like for streaming, you simply pay for the number (and the size) of the Compute Engine used for your pipeline. Dataflow scale automatically in the boundaries that you provide. Streaming pipeline don't scale to 0. If you haven't message in your PubSub, you still have at least 1 VM up and you pay for it.

How best cache bigquery table for fast lookup of individual row?

I have a raw data table in bigquery that has hundreds of millions of rows. I run a scheduled query every 24 hours to produce some aggregations that results a table in the ballmark of 33 million rows (6gb) but may be expected to grow slowly to approximately double its current size.
I need a way to get 1 row at a time quick access lookup by id to that aggregate table in a separate event driven pipeline. i.e. A process is notified that person A just took an action, what do we know about this person's history from the aggregation table?
Clearly bigquery is the right tool to produce the aggregate table, but not the right tool for the quick lookups. So I need to offset it to a secondary datastore like firestore. But what is the best process to do so?
I can envision a couple strategies:
1) Schedule a dump of agg table to GCS. Kick off a dataflow job to stream contents of gcs dump to pubsub. Create a serverless function to listen to pubsub topic and insert rows into firestore.
2) A long running script on compute engine which just streams the table directly from BQ and runs inserts. (Seems slower than strategy 1)
3) Schedule a dump of agg table to GCS. Format it in such a way that can be directly imported to firestore via gcloud beta firestore import gs://[BUCKET_NAME]/[EXPORT_PREFIX]/
4) Maybe some kind of dataflow job that performs lookups directly against the bigquery table? Not played with this approach before. No idea how costly / performant.
5) some other option I've not considered?
The ideal solution would allow me quick access in milliseconds to an agg row which would allow me to append data to the real time event.
Is there a clear best winner here in the strategy I should persue?
Remember that you could also CLUSTER your table by id - making your lookup queries way faster and less data consuming. They will still take more than a second to run though.
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
You could also set up exports from BigQuery to CloudSQL, for subsecond results:
https://medium.com/#gabidavila/how-to-serve-bigquery-results-from-mysql-with-cloud-sql-b7ddacc99299
And remember, now BigQuery can read straight out of CloudSQL if you'd like it to be your source of truth for "hot-data":
https://medium.com/google-cloud/loading-mysql-backup-files-into-bigquery-straight-from-cloud-sql-d40a98281229

Google Dataflow job monitor

I am writing an app to monitor and view Google dataflow jobs.
To get the metadata about google dataflow jobs, I am exploring the REST APIs listed here :
https://developers.google.com/apis-explorer/#search/dataflow/dataflow/v1b3/
I was wondering if there are any APIs that could do the following :
1) Get the job details if we provide a list of job Ids (there is an API for one individual job ID, but I wanted the same for a list of Ids)
2)Search or filter jobs on the basis of job name.Or for that matter, filtering of jobs of any other criteria apart from the job state.
3)Get log messages associated with a dataflow job
4)Get the records of "all" jobs, from the beginning of time. The current APIs seem to give records only of jobs in the last 30 days.
Any help would be greatly appreciated. Thank You
There is additional documentation about the Dataflow REST API at: https://cloud.google.com/dataflow/docs/reference/rest/
Addressing each of your questions separately:
1) Get the job details if we provide a list of job Ids (there is an API for one individual job ID, but I wanted the same for a list of Ids)
No, there is no batch method for a list of jobs. You'll need to query them individually with projects.jobs.get.
2)Search or filter jobs on the basis of job name.Or for that matter, filtering of jobs of any other criteria apart from the job state.
The only other filter currently available is location.
3)Get log messages associated with a dataflow job
In Dataflow there are two types of log messages:
"Job Logs" are generated by the Dataflow service and provide high-level information about the overall job execution. These are available via the projects.jobs.messages.list API.
There are also "Worker Logs" written by the SDK and user code running in the pipeline. These are generated on the distributed VMs associated with a pipeline and ingested into Stackdriver. They can be queried via the Stackdriver Logging entries.list API by including in your filter:
resource.type="dataflow_step"
resource.labels.job_id="<YOUR JOB ID>"
4)Get the records of "all" jobs, from the beginning of time. The current APIs seem to give records only of jobs in the last 30 days.
Dataflow jobs are only retained by the service for 30 days. Older jobs are deleted and thus not available in the UI or APIs.
In our case we implemented such functionality by tracking the job stages and by using schedulers/cron jobs to report the details of running job in one file. This file withing 1 bucket is watched by our job which just gives all status to our application

Does bigquery maintain concurrency?

Let's consider I have multiple jobs which are updating/loading the same table. As per the semaphore concept, if any 1 process is loading data to the table other processes will wait till the resource for that table gets free. I would like to know is there any semaphore concepts for loading data into BigQuery table using dataflow? if yes, then how to handle such scenario for BigQuery table load using dataflow?
I don't believe that dataflow has knowledge of the table activity, they just send the requested update as a Job to bigquery.
Bigquery receives the job and then sends it to a queue of the given table. So all the "semaphore concept" is handled internally by bigquery and the given table.
So for example, imagine that in parallel you are running three queries that update a table, two of them run via dataflow and the other via script.
The three ones go to the same queue and bigquery process one by one(one after the other completed) in the order they arrived to bigquery.