BigQueryIO - only first day table can be created, despite having CreateDisposition.CREATE_IF_NEEDED - google-cloud-platform

I have a dataflow job processing data from pub/sub defined like this:
read from pub/sub -> process (my function) -> group into day windows -> write to BQ
I'm using Write.Method.FILE_LOADS because of bounded input.
My job works fine, processing lots of GBs of data but it fails and tries to retry forever when it gets to create another table. The job is meant to run continuously and create day tables on its own, it does fine on the first few ones but then gives me indefinitely:
Processing stuck in step write-bq/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at least 05h30m00s without outputting or completing in state finish
Before this happens it also throws:
Load job <job_id> failed, will retry: {"errorResult":{"message":"Not found: Table <name_of_table> was not found in location US","reason":"notFound"}
It is indeed a right error because this table doesn't exists. Problem is that the job should create it on its own because of defined option CreateDisposition.CREATE_IF_NEEDED.
The number of day tables that it creates correctly without a problem depens on number of workers. It seems that when some worker creates one table its CreateDisposition changes to CREATE_NEVER causing the problem, but it's only my guess.
The similar problem was reported here but without any definite answer:
https://issues.apache.org/jira/browse/BEAM-3772?focusedCommentId=16387609&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16387609
ProcessElement definition here seems to give some clues but I cannot really say how it works with multiple workers: https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L138
I use 2.15.0 Apache SDK.

I encountered the same issue, which is still not fixed in BEAM 2.27.0 of january 2021. Therefore I had to develop a workaround: a custom PTransform which checks if the target table exist before the the BigQueryIO stage. It uses the bigquery java client for this and a Guava cache, as well as a windowing strategy (fixed, check every 15s) to sustain a heavy traffic of about 5000 elements per second. Here is the code: https://gist.github.com/matthieucham/85459eff5fdea8d115be520e2dd5ccc1

There was a bug in the past that caused this error, but that particular one was fixed in commit https://github.com/apache/beam/commit/d6b4dcec5f297f5c1bd08f345f0e1e5c756775c2#diff-3f40fd931c8b8b972772724369cea310 Can you check to see if the version of Beam you are running includes this commit?

Related

Writing SSM ParameterStore update once when many are trying from Lambda?

We've inherited an application that runs on Lambda. On initialization, this App reads a configuration stored on ParameterStore and if a certain value is out of date, then that value is updated and the app can continue.
The problem is that many users (about 7 concurrent at any one time) can run this app, and they could hit the same update nearly at the same time and therefore cause ParameterStore to throw a PutParameter throttle-back error.
To avoid that and to update the ParameterStore value only once per day rather than on every run, we've devised a simple solution, and we're wanting to get some advice on whether it makes sense or not...
Here are the steps we're thinking:
Random sleep between 1 and 10 seconds ("queue" up calls to minimize clashes -- perhaps this isn't necessary??)
Does an S3 object exist for today: obj_MMDDYYYY (signals that the PStore update has been done today)
If YES then skip updating ParameterStore value
If NO then create S3 obj_MMDDYYYY (today's date), run the Update ParameterStore value
Any advice appreciated.
Thanks!

GCP Dataflow running streaming inserts into BigQuery: GC Thrashing

I am using Apache Beam 2.13.0 with GCP Dataflow runner.
I have a problem with streaming ingest to BigQuery from a batch pipeline:
PCollection<BigQueryInsertError> stageOneErrors =
destinationTableSelected
.apply("Write BQ Attempt 1",
BigQueryIO.<KV<TableDestination, TableRow>>write()
.withMethod(STREAMING_INSERTS)
.to(new KVTableDestination())
.withFormatFunction(new KVTableRow())
.withExtendedErrorInfo()
.withFailedInsertRetryPolicy(InsertRetryPolicy.neverRetry())
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND))
.getFailedInsertsWithErr();
The error:
Shutting down JVM after 8 consecutive periods of measured GC thrashing.
Memory is used/total/max = 15914/18766/18766 MB,
GC last/max = 99.17/99.17 %, #pushbacks=0, gc thrashing=true.
Heap dump not written.
Same code working in the streaming mode correctly (if the with explicit method setting omitted).
The code works on reasonably small datasets (less than 2 million records). Fails on 2,5 million plus.
On the surface it appears to be a similar problem to the one described here: Shutting down JVM after 8 consecutive periods of measured GC thrashing
Creating a separate question to add additional details.
Is there anything I could do to fix this? Looks like the issue is within the BigQueryIO component itself - GroupBy key fails.
The problem with transforms that contain GroupByKey is that it will wait until all the data for the current window has been received before grouping.
In Streaming mode, this is normally fine as the incoming elements are windowed into separate windows, so the GroupByKey only operates on a small(ish) chunk of data.
In Batch mode, however, the current window is the Global Window, meaning that GroupByKey will wait for the entire input dataset to be read and received before the grouping starts to be performed. If the input dataset is large, then your worker will run out of memory, which explains what you are seeing here.
This brings up the question: Why are you using BigQuery Streaming insert when processing Batch data? Streaming inserts are relatively expensive (compared to bulk which is free!) and have smaller quota/limits than Bulk import: even if you work around the issues you are seeing, there may be more issues yet to be discovered in Bigquery itself..
After extensive discussions with the support and the developers it has been communicated that using BigQuery streaming ingress from a batch pipeline is discouraged and currently (as of 2.13.0) not supported.

How to use Apache beam to process Historic Time series data?

I have the Apache Beam model to process multiple time series in real time. Deployed on GCP DataFlow, it combines multiple time series into windows, and calculates the aggregate etc.
I now need to perform the same operations over historic data (the same (multiple) time series data) stretching all the way back to 2017. How can I achieve this using Apache beam?
I understand that I need to use the windowing property of Apache Beam to calculate the aggregates etc, but it should accept data from 2 years back onwards
Effectively, I need data as would have been available had I deployed the same pipeline 2 years. This is needed for testing/model training purposes
That sounds like a perfect use case of Beam's focus on event-time processing. You can run the pipeline against any legacy data and get correct results as long as events have timestamps. Without additional context I think you will need to have an explicit step in your pipeline to assign custom timestamps (from 2017) that you will need to extract from the data. To do this you can probably use either:
context.outputWithTimestamp() in your DoFn;
WithTimestamps PTransform;
You might need to have to configure allowed timestamp skew if you have the timestamp ordering issues.
See:
outputWithTimestamp example: https://github.com/apache/beam/blob/efcb20abd98da3b88579e0ace920c1c798fc959e/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/windowing/WindowingTest.java#L248
documentation for WithTimestamps: https://beam.apache.org/releases/javadoc/2.13.0/org/apache/beam/sdk/transforms/WithTimestamps.html#of-org.apache.beam.sdk.transforms.SerializableFunction-
similar question: Assigning to GenericRecord the timestamp from inner object
another question that may have helpful details: reading files and folders in order with apache beam

First-run of queries are extremely slow

Our Redshift queries are extremely slow during their first execution. Subsequent executions are much faster (e.g., 45 seconds -> 2 seconds). After investigating this problem, the query compilation appears to be the culprit. This is a known issue and is even referenced on the AWS Query Planning And Execution Workflow and Factors Affecting Query Performance pages. Amazon itself is quite tight lipped about how the query cache works (tl;dr it's a magic black box that you shouldn't worry about).
One of the things that we tried was increasing the number of nodes we had, however we didn't expect it to solve anything seeing as how query compilation is a single-node operation anyway. It did not solve anything but it was a fun diversion for a bit.
As noted, this is a known issue, however, anywhere it is discussed online, the only takeaway is either "this is just something you have to live with using Redshift" or "here's a super kludgy workaround that only works part of the time because we don't know how the query cache works".
Is there anything we can do to speed up the compilation process or otherwise deal with this? So far about the best solution that's been found is "pre-run every query you might expect to run in a given day on a schedule" which is....not great, especially given how little we know about how the query cache works.
there are 3 things to consider
The first run of any query causes the query to be "compiled" by
redshift . this can take 2-20 seconds depending on how big it is.
subsequent executions of the same query use the same compiled code,
even if the where clause parameters change there is no re-compile.
Data is measured as marked as "hot" when a query has been run
against it, and is cached in redshift memory. you cannot (reliably) manually
clear this in any way EXCEPT a restart of the cluster.
Redshift will "results cache", depending on your redshift parameters
(enabled by default) redshift will quickly return the same result
for the exact same query, if the underlying data has not changed. if
your query includes current_timestamp or similar, then this will
stop if from caching. This can be turned off with SET enable_result_cache_for_session TO OFF;.
Considering your issue, you may need to run some example queries to pre compile or redesign your queries ( i guess you have some dynamic query building going on that changes the shape of the query a lot).
In my experience, more nodes will increase the compile time. this process happens on the master node not the data nodes, and is made more complex by having more data nodes to consider.
The query is probably not actually running a second time -- rather, Redshift is just returning the same result for the same query.
This can be tested by turning off the cache. Run this command:
SET enable_result_cache_for_session TO OFF;
Then, run the query twice. It should take the same time for each execution.
The result cache is great for repeated queries. Rather than being disappointed that the first execution is 'slow', be happy that subsequent cached queries are 'fast'!

Use a long running database migration script

I'm trialing FluentMigrator as a way of keeping my database schema up to date with minimum effort.
For the release I'm currently building, I need to run a database script to make a simple change to a large number of rows of existing data (around 2% of 21,000,000 rows need to be updated).
There's too much data for to be updated in a single transaction (the transaction log gets full and the script aborts), so I use a WHILE loop to iterate through the table, updating 10,000 rows at a time, each batch in a separate transacticon. This works, and takes around 15 minutes to run to completion.
Now I have the script complete, I'm trying to integrate it into FluentMigrator.
FluentMigrator seems to run all the migrations for a single batch in one transaction.
How do I get FM to run each migration in a separate transaction?
Can I tell FM to not use a transaction for a specific migration?
This is not possible as of now.
There are ongoing discussions and some work already in progress.
Check it out here : https://github.com/schambers/fluentmigrator/pull/178
But your use case will surely help in pushing the things in the right direction.
You are welcome to take part to the discussion!
Maybe someone will find a temporary workaround?