Does bigquery maintain concurrency? - google-cloud-platform

Let's consider I have multiple jobs which are updating/loading the same table. As per the semaphore concept, if any 1 process is loading data to the table other processes will wait till the resource for that table gets free. I would like to know is there any semaphore concepts for loading data into BigQuery table using dataflow? if yes, then how to handle such scenario for BigQuery table load using dataflow?

I don't believe that dataflow has knowledge of the table activity, they just send the requested update as a Job to bigquery.
Bigquery receives the job and then sends it to a queue of the given table. So all the "semaphore concept" is handled internally by bigquery and the given table.
So for example, imagine that in parallel you are running three queries that update a table, two of them run via dataflow and the other via script.
The three ones go to the same queue and bigquery process one by one(one after the other completed) in the order they arrived to bigquery.

Related

Google Dataflow design

We need your guidance on the dataflow design for the below scenario.
Requirement:
We need to build a dataflow job to read dataflow MS SQL database and write to Bigquery.
We need the dataflow job to take as input “the list of table names” (source and target table names) to read from and write to the data.
Question:
On a daily schedule, would it be possible for a dataflow to take the list of tables (i.e. 50 table names) as input and copy data from source to target or should this be designed as 50 independent dataflow jobs.
Would dataflow automatically adjust the number of workers – without bringing down the source MS SQL server ?
Key Info:
Source: MS SQL database
Target: Bigquery
No of Table: 50
Schedule: Every day , say 8 am
Write Disposition: Write Truncate (or Write Append)
You have to create a dataflow template to be able to trigger it on a schedule. On that template, you have to define a input variable in which you can put your table list.
Then, in a same dataflow job, you can have 50 independent pipeline, each reading in a table and sinking the data in BigQuery. You can't run 50 dataflow jobs in parallel because of quotas (limit of 25 per projects). In addition, it will be less cost efficient.
Indeed, Dataflow is able to parallelize on the same worker different pipeline (in different thread) and to scale up and down the cluster size according to the workload requirements.

Run update on multiple tables in BigQuery

I have lake dataset which take data from a OLTP system, with the nature of transactions we have lot of updates the next day, so to keep track of the latest record we are using active_flag = '1'.
We also created a update script which retires old records and updates active_flag = '0'.
Now the main question: how can i execute a update statement by changing table name automatically(programmatically).
I know we have a option of using cloudfunctions but it'll expire in 9 mins and I have atleast 350 tables to update.
Has anyone faced this situation earlier??
You can easily do this with Cloud Workflows.
There you setup the template calls to Bigquery as a substeps, and then you pass a list of tables, and loop through the items and invoke the BigQuery step for each item/table.
I wrote an article with samples that you can adapt: Automate the execution of BigQuery queries with Cloud Workflows

Process data from BigQuery using Dataflow

I want to retrieve data from BigQuery that arrived every hour and do some processing and pull the new calculate variables in a new BigQuery table. The things is that I've never worked with gcp before and I have to for my job now.
I already have my code in python to process the data but it's work only with a "static" dataset
As your source and sink of that are both in BigQuery, I would recommend you to do your transformations inside BigQuery.
If you need a scheduled job that runs in a pre determined time, you can use Scheduled Queries.
With Scheduled Queries you are able to save some query, execute it periodically and save the results to another table.
To create a scheduled query follow the steps:
In BigQuery Console, write your query
After writing the correct query, click in Schedule query and then in Create new scheduled query as you can see in the image below
Pay attention in this two fields:
Schedule options: there are some pre-configured schedules such as daily, monthly, etc.. If you need to execute it every two hours, for example, you can set the Repeat option as Custom and set your Custom schedule as 'every 2 hours'. In the Start date and run time field, select the time and data when your query should start being executed.
Destination for query results: here you can set the dataset and table where your query's results will be saved. Please keep in mind that this option is not available if you use scripting. In other words, you should use only SQL and not scripting in your transformations.
Click on Schedule
After that your query will start being executed according to your schedule and destination table configurations.
According with Google recommendation, when your data are in BigQuery and when you want to transform them to store them in BigQuery, it's always quicker and cheaper to do this in BigQuery if you can express your processing in SQL.
That's why, I don't recommend you dataflow for your use case. If you don't want, or you can't use directly the SQL, you can create User Defined Function (UDF) in BigQuery in Javascript.
EDIT
If you have no information when the data are updated into BigQuery, Dataflow won't help you on this. Dataflow can process realtime data only if these data are present into PubSub. If not, it's not magic!!
Because you haven't the information of when a load is performed, you have to run your process on a schedule. For this, Scheduled Queries is the right solution is you use BigQuery for your processing.

How best cache bigquery table for fast lookup of individual row?

I have a raw data table in bigquery that has hundreds of millions of rows. I run a scheduled query every 24 hours to produce some aggregations that results a table in the ballmark of 33 million rows (6gb) but may be expected to grow slowly to approximately double its current size.
I need a way to get 1 row at a time quick access lookup by id to that aggregate table in a separate event driven pipeline. i.e. A process is notified that person A just took an action, what do we know about this person's history from the aggregation table?
Clearly bigquery is the right tool to produce the aggregate table, but not the right tool for the quick lookups. So I need to offset it to a secondary datastore like firestore. But what is the best process to do so?
I can envision a couple strategies:
1) Schedule a dump of agg table to GCS. Kick off a dataflow job to stream contents of gcs dump to pubsub. Create a serverless function to listen to pubsub topic and insert rows into firestore.
2) A long running script on compute engine which just streams the table directly from BQ and runs inserts. (Seems slower than strategy 1)
3) Schedule a dump of agg table to GCS. Format it in such a way that can be directly imported to firestore via gcloud beta firestore import gs://[BUCKET_NAME]/[EXPORT_PREFIX]/
4) Maybe some kind of dataflow job that performs lookups directly against the bigquery table? Not played with this approach before. No idea how costly / performant.
5) some other option I've not considered?
The ideal solution would allow me quick access in milliseconds to an agg row which would allow me to append data to the real time event.
Is there a clear best winner here in the strategy I should persue?
Remember that you could also CLUSTER your table by id - making your lookup queries way faster and less data consuming. They will still take more than a second to run though.
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
You could also set up exports from BigQuery to CloudSQL, for subsecond results:
https://medium.com/#gabidavila/how-to-serve-bigquery-results-from-mysql-with-cloud-sql-b7ddacc99299
And remember, now BigQuery can read straight out of CloudSQL if you'd like it to be your source of truth for "hot-data":
https://medium.com/google-cloud/loading-mysql-backup-files-into-bigquery-straight-from-cloud-sql-d40a98281229

Redshift performance: SQL queries vs table normalization

I'm working on building a redshift database by listening to events from from different sources and pump that data into a redshift cluster.
The idea is to use Kinesis firehose to pump data to redshift using COPY command. But I have a dilemma here: I wish to first query some information from redshift using a select query such as the one below:
select A, B, C from redshift__table where D='x' and E = 'y';
After getting the required information from redshift, I will combine that information with my event notification data and issue a request to kinesis. Kinesis will then do its job and issue the required COPY command.
Now my question is that is it a good idea to repeatedly query redshift like say every second since that is the expected time after which I will get event notifications?
Now let me describe an alternate scenario:
If I normalize my table and separate out some fields into a separate table then, I will have to perform fewer redshift queries with the normalized design (may be once every 30 seconds)
But the downside of this approach is that once I have the data into redshift, I will have to carry out table joins while performing real time analytics on my redshift data.
So I wish to know on a high level which approach would be better:
Have a single flat table but query it before issuing a request to kinesis on an event notification. There wont be any table joins while performing analytics.
Have 2 tables and query redshift less often. But perform a table join while displaying results using BI/analytical tools.
Which of these 2 do you think is a better option? Let us assume that I will use appropriate sort keys/distribution keys in either cases.
I'd definitely go with your second option, which involves JOINing with queries. That's what Amazon Redshift is good at doing (especially if you have your SORTKEY and DISTKEY set correctly).
Let the streaming data come into Redshift in the most efficient manner possible, then join when doing queries. You'll have a lot less queries that way.
Alternatively, you could run a regular job (eg hourly) to batch process the data into a wide table. It depends how quickly you'll need to query the data after loading.