We have a scenario where we need to have BigQuery.Read later in the pipeline (not at PBegin). Is there a way to implement it?
We are trying to run sequence of steps where we load a pcollection to bigquery table and then fetch data from that table (with some filters) after the load for our next steps. We are able to do this in multiple pipelines having bigqueryio.read at the starting of each pipeline. However, it would bea easier for our batch control if we can have it in single dataflow pipeline (loading entire bigquery tables initially and working completely off that pcollection is expensive)
Related
I have a GCP DataFlow pipeline configured with a select SQL query that selects specific rows from a Postgres table and then inserts these rows automatically into the BigQuery dataset. This pipeline is configured to run daily at 12am UTC.
When the pipeline initiates a job, it runs successfully and copies the desired rows. However, when the next job runs, it copies the same set of rows again into the BigQuery table, hence resulting in data duplication.
I wanted to know if there is a way to truncate the BigQuery dataset table before the pipeline runs. It seems like a common problem so looking if there's an easy solution without going into a custom DataFlow template.
BigQueryIO has an option called WriteDisposition, where you can use WRITE_TRUNCATE.
From the link above, WRITE_TRUNCATE means:
Specifies that write should replace a table.
The replacement may occur in multiple steps - for instance by first removing the existing table, then creating a replacement, then filling it in. This is not an atomic operation, and external programs may see the table in any of these intermediate steps.
If your use case can not afford the table being unavailable during the operation, a common pattern is moving the data to a secondary / staging table, and then using atomic operations on BigQuery to replace the original table (e.g., using CREATE OR REPLACE TABLE).
I have a JDBC connection to an RDS instance and a crawler set up to populate the Data Catalog.
What's the best practice when setting up scheduled runs in order to avoid duplicates and still make the run as efficient as possible? The ETL job output source is S3. The data will then be visualized in QuickSight using Athena or possibly a direct S3 connection, not sure which one is favorable. In the ETL job script (pyspark), different tables are joined and new columns are calculated before storing the final data frame/dynamic frame in S3.
First job run: The data looks something like this (in real life, with a lot more columns and rows):
First job run
Second job run: After some time when the job is scheduled to run again, the data has changed too (notice the changes marked with the red boxes): Second job run
Upcoming job run: After some more time has passed the job is scheduled to run again and some more changes could be seen and so on.
What is the recommended setup for an ETL job like this?
Bookmarks: As for my understanding will produce multiple files in S3 which in turn creates duplicates that could be solved using another script.
Overwrite: Using the 'overwrite' option for the data frame
df.repartition(1).write.mode('overwrite').parquet("s3a://target/name")
Today I've been using an overwrite method but it gave me some issues: At some point when I needed to change the ETL job script and the update changed the data stored in S3 too much, my QuickSight Dashboards crashed and could not be replaced with the new data set (build on the new data frame stored in S3) which meant I had to rebuild the Dashboard all over again.
Please give me your best tips and tricks for smoothly performing ETL jobs on randomly updating tables in AWS Glue!
We need your guidance on the dataflow design for the below scenario.
Requirement:
We need to build a dataflow job to read dataflow MS SQL database and write to Bigquery.
We need the dataflow job to take as input “the list of table names” (source and target table names) to read from and write to the data.
Question:
On a daily schedule, would it be possible for a dataflow to take the list of tables (i.e. 50 table names) as input and copy data from source to target or should this be designed as 50 independent dataflow jobs.
Would dataflow automatically adjust the number of workers – without bringing down the source MS SQL server ?
Key Info:
Source: MS SQL database
Target: Bigquery
No of Table: 50
Schedule: Every day , say 8 am
Write Disposition: Write Truncate (or Write Append)
You have to create a dataflow template to be able to trigger it on a schedule. On that template, you have to define a input variable in which you can put your table list.
Then, in a same dataflow job, you can have 50 independent pipeline, each reading in a table and sinking the data in BigQuery. You can't run 50 dataflow jobs in parallel because of quotas (limit of 25 per projects). In addition, it will be less cost efficient.
Indeed, Dataflow is able to parallelize on the same worker different pipeline (in different thread) and to scale up and down the cluster size according to the workload requirements.
I want to retrieve data from BigQuery that arrived every hour and do some processing and pull the new calculate variables in a new BigQuery table. The things is that I've never worked with gcp before and I have to for my job now.
I already have my code in python to process the data but it's work only with a "static" dataset
As your source and sink of that are both in BigQuery, I would recommend you to do your transformations inside BigQuery.
If you need a scheduled job that runs in a pre determined time, you can use Scheduled Queries.
With Scheduled Queries you are able to save some query, execute it periodically and save the results to another table.
To create a scheduled query follow the steps:
In BigQuery Console, write your query
After writing the correct query, click in Schedule query and then in Create new scheduled query as you can see in the image below
Pay attention in this two fields:
Schedule options: there are some pre-configured schedules such as daily, monthly, etc.. If you need to execute it every two hours, for example, you can set the Repeat option as Custom and set your Custom schedule as 'every 2 hours'. In the Start date and run time field, select the time and data when your query should start being executed.
Destination for query results: here you can set the dataset and table where your query's results will be saved. Please keep in mind that this option is not available if you use scripting. In other words, you should use only SQL and not scripting in your transformations.
Click on Schedule
After that your query will start being executed according to your schedule and destination table configurations.
According with Google recommendation, when your data are in BigQuery and when you want to transform them to store them in BigQuery, it's always quicker and cheaper to do this in BigQuery if you can express your processing in SQL.
That's why, I don't recommend you dataflow for your use case. If you don't want, or you can't use directly the SQL, you can create User Defined Function (UDF) in BigQuery in Javascript.
EDIT
If you have no information when the data are updated into BigQuery, Dataflow won't help you on this. Dataflow can process realtime data only if these data are present into PubSub. If not, it's not magic!!
Because you haven't the information of when a load is performed, you have to run your process on a schedule. For this, Scheduled Queries is the right solution is you use BigQuery for your processing.
My team is attempting to use Redshift to consolidate information from several different databases. In our first attempt to implement this solution, we used Kinesis Firehose to write records of POSTs to our APIs to S3 then issued a COPY command to write the data being inserted to the correct tables in Redshift. However, this only allowed us to insert new data and did not let us transform data, update rows when altered, or delete rows.
What is the best way to maintain an updated data warehouse in Redshift without using batch transformation? Ideally, we would like updates to occur "automatically" (< 5min) whenever data is altered in our local databases.
Firehose or Redshift don't have triggers, however you could potentially use the approach using Lambda and Firehose to pre-process the data before it gets inserted as described here: https://blogs.aws.amazon.com/bigdata/post/Tx2MUQB5PRWU36K/Persist-Streaming-Data-to-Amazon-S3-using-Amazon-Kinesis-Firehose-and-AWS-Lambda
In your case, you could extend it to use Lambda on S3 as Firehose is creating new files, which would then execute COPY/SQL update.
Another alternative is just writing your own KCL client that would implement what Firehose does, and then executing the required updates after COPY of micro-batches (500-1000 rows).
I've done such an implementation (we needed to update old records based on new records) and it works alright from consistency point of view, though I'd advise against such architecture in general due to bad Redshift performance with regards to updates. Based on my experience, the key rule is that Redshift data is append-only, and it is often faster to use filters to remove unnecessary rows (with optional regular pruning, like daily) than to delete/update those rows in real-time.
Yet another alernative, is to have Firehose dump data into staging table(s), and then have scheduled jobs take whatever is in that table, do processing, move the data, and rotate tables.
As a general reference architecture for real-time inserts into Redshift, take a look at this: https://blogs.aws.amazon.com/bigdata/post/Tx2ANLN1PGELDJU/Best-Practices-for-Micro-Batch-Loading-on-Amazon-Redshift
This has been implemented multiple times, and works well.