GCP Dataflow running streaming inserts into BigQuery: GC Thrashing - google-cloud-platform

I am using Apache Beam 2.13.0 with GCP Dataflow runner.
I have a problem with streaming ingest to BigQuery from a batch pipeline:
PCollection<BigQueryInsertError> stageOneErrors =
destinationTableSelected
.apply("Write BQ Attempt 1",
BigQueryIO.<KV<TableDestination, TableRow>>write()
.withMethod(STREAMING_INSERTS)
.to(new KVTableDestination())
.withFormatFunction(new KVTableRow())
.withExtendedErrorInfo()
.withFailedInsertRetryPolicy(InsertRetryPolicy.neverRetry())
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND))
.getFailedInsertsWithErr();
The error:
Shutting down JVM after 8 consecutive periods of measured GC thrashing.
Memory is used/total/max = 15914/18766/18766 MB,
GC last/max = 99.17/99.17 %, #pushbacks=0, gc thrashing=true.
Heap dump not written.
Same code working in the streaming mode correctly (if the with explicit method setting omitted).
The code works on reasonably small datasets (less than 2 million records). Fails on 2,5 million plus.
On the surface it appears to be a similar problem to the one described here: Shutting down JVM after 8 consecutive periods of measured GC thrashing
Creating a separate question to add additional details.
Is there anything I could do to fix this? Looks like the issue is within the BigQueryIO component itself - GroupBy key fails.

The problem with transforms that contain GroupByKey is that it will wait until all the data for the current window has been received before grouping.
In Streaming mode, this is normally fine as the incoming elements are windowed into separate windows, so the GroupByKey only operates on a small(ish) chunk of data.
In Batch mode, however, the current window is the Global Window, meaning that GroupByKey will wait for the entire input dataset to be read and received before the grouping starts to be performed. If the input dataset is large, then your worker will run out of memory, which explains what you are seeing here.
This brings up the question: Why are you using BigQuery Streaming insert when processing Batch data? Streaming inserts are relatively expensive (compared to bulk which is free!) and have smaller quota/limits than Bulk import: even if you work around the issues you are seeing, there may be more issues yet to be discovered in Bigquery itself..

After extensive discussions with the support and the developers it has been communicated that using BigQuery streaming ingress from a batch pipeline is discouraged and currently (as of 2.13.0) not supported.

Related

BigQueryIO - only first day table can be created, despite having CreateDisposition.CREATE_IF_NEEDED

I have a dataflow job processing data from pub/sub defined like this:
read from pub/sub -> process (my function) -> group into day windows -> write to BQ
I'm using Write.Method.FILE_LOADS because of bounded input.
My job works fine, processing lots of GBs of data but it fails and tries to retry forever when it gets to create another table. The job is meant to run continuously and create day tables on its own, it does fine on the first few ones but then gives me indefinitely:
Processing stuck in step write-bq/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at least 05h30m00s without outputting or completing in state finish
Before this happens it also throws:
Load job <job_id> failed, will retry: {"errorResult":{"message":"Not found: Table <name_of_table> was not found in location US","reason":"notFound"}
It is indeed a right error because this table doesn't exists. Problem is that the job should create it on its own because of defined option CreateDisposition.CREATE_IF_NEEDED.
The number of day tables that it creates correctly without a problem depens on number of workers. It seems that when some worker creates one table its CreateDisposition changes to CREATE_NEVER causing the problem, but it's only my guess.
The similar problem was reported here but without any definite answer:
https://issues.apache.org/jira/browse/BEAM-3772?focusedCommentId=16387609&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16387609
ProcessElement definition here seems to give some clues but I cannot really say how it works with multiple workers: https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L138
I use 2.15.0 Apache SDK.
I encountered the same issue, which is still not fixed in BEAM 2.27.0 of january 2021. Therefore I had to develop a workaround: a custom PTransform which checks if the target table exist before the the BigQueryIO stage. It uses the bigquery java client for this and a Guava cache, as well as a windowing strategy (fixed, check every 15s) to sustain a heavy traffic of about 5000 elements per second. Here is the code: https://gist.github.com/matthieucham/85459eff5fdea8d115be520e2dd5ccc1
There was a bug in the past that caused this error, but that particular one was fixed in commit https://github.com/apache/beam/commit/d6b4dcec5f297f5c1bd08f345f0e1e5c756775c2#diff-3f40fd931c8b8b972772724369cea310 Can you check to see if the version of Beam you are running includes this commit?

How to use Apache beam to process Historic Time series data?

I have the Apache Beam model to process multiple time series in real time. Deployed on GCP DataFlow, it combines multiple time series into windows, and calculates the aggregate etc.
I now need to perform the same operations over historic data (the same (multiple) time series data) stretching all the way back to 2017. How can I achieve this using Apache beam?
I understand that I need to use the windowing property of Apache Beam to calculate the aggregates etc, but it should accept data from 2 years back onwards
Effectively, I need data as would have been available had I deployed the same pipeline 2 years. This is needed for testing/model training purposes
That sounds like a perfect use case of Beam's focus on event-time processing. You can run the pipeline against any legacy data and get correct results as long as events have timestamps. Without additional context I think you will need to have an explicit step in your pipeline to assign custom timestamps (from 2017) that you will need to extract from the data. To do this you can probably use either:
context.outputWithTimestamp() in your DoFn;
WithTimestamps PTransform;
You might need to have to configure allowed timestamp skew if you have the timestamp ordering issues.
See:
outputWithTimestamp example: https://github.com/apache/beam/blob/efcb20abd98da3b88579e0ace920c1c798fc959e/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/windowing/WindowingTest.java#L248
documentation for WithTimestamps: https://beam.apache.org/releases/javadoc/2.13.0/org/apache/beam/sdk/transforms/WithTimestamps.html#of-org.apache.beam.sdk.transforms.SerializableFunction-
similar question: Assigning to GenericRecord the timestamp from inner object
another question that may have helpful details: reading files and folders in order with apache beam

Delays when streaming data from the Google Search console API to BigQuery

So I've been trying to stream data from Google Search Console API to BigQuery in real time.
The data are retrieved from GSC API and streamed to the BigQuery stream buffer. However, I experience high latency before the streaming buffer can be flushed (up to 2 hours or more). So, the data stays in the streaming buffer but is not in the table.
The data are also not visible in the preview and the table size is 0B with 0 rows (actually after waiting for >1day I still see 0B even though there are more than 0 rows).
Another issue is that, some time after the data is stored in the table (table size and number of rows are correct), it simply disappears from it and appears in the streaming buffer (I only saw this once). -> This was explained by the second bullet in shollyman's answer.
What I want is to have the data in the table in real time. According to the documentation this seems possible but doesn't work in my case (2h of delay as stated above).
Here's the code responsible for that part:
for row in response['rows']:
keys = ','.join(row['keys'])
# Data Manipulation Languate (DML) Insert one row each time to BigQuery
row_to_stream = {'keys':keys, 'f1':row['f1'], 'f2':row['f2'], 'ctr':row['ctr'], 'position':row['position']}
insert_all_data = {
"kind": "bigquery#tableDataInsertAllRequest",
"skipInvaliedRows": True,
"ignoreUnknownValues": True,
'rows':[{
'insertId': str(uuid.uuid4()),
'json': row_to_stream,
}]
}
build('bigquery', 'v2', cache_discovery=False).tabledata().insertAll(
projectId=projectid,
datasetId=dataset_id,
tableId=tableid,
body=insert_all_data).execute(num_retries=5)
I've seen questions that seem very similar to mine on here but I haven't really found an answer. I therefore have 2 questions.
1. What could cause this issue?
Also, I'm new to GCP and I've seen other options (at least they seemed like options to me) for real time streaming of data to BigQuery (e.g., using PubSub and a few projects around real time Twitter data analysis).
2. How do you pick the best option for a particular task?
By default, the BigQuery web UI doesn't automatically refresh the state of a table. There is a Refresh button when you click into the details of a table, that should show you the updated size information for both managed storage and the streaming buffer (displayed below the main table details). Rows in the buffer are available to queries, but the preview button may not show results until some data is extracted from the streaming buffer to managed storage.
I suspect the case where you observed data disappearing from managed storage and appearing back in the streaming buffer may have been a case where the table was deleted and recreated with the same name, or was truncated in some fashion and streaming restarted. Data doesn't transition from managed storage back to the buffer.
Deciding what technology to use for streaming depends on your needs. Pub/Sub is a great choice when you have multiple consumers of the information (multiple pub/sub subscribers consuming the same stream of messages independently), or you need to apply additional transformations of the data between the producer and consumer. To get the data from pub/sub to BigQuery, you'll still need a subscriber to write the messages into BigQuery, as the two have no direct integration.

Dataflow doesn't update GroupByKey's "Output collections" field

Dataflow does not update GroupByKey's Output Collections even though I can see the output data in cloud storage (the next transform in the pipeline writes the output to GCS). None of the transforms after GroupByKey show input/output collections either. I have also tried to implement data-driven triggering using AfterPane.elementCountAtLeast(3000) but then too the Output Collections does not get updated after 3000 elements have been input.
The problem is that the Estimated Size keeps on increasing and I am afraid that it will eventually lead to more workers costing me more money. I have been waiting for more than an hour but it still doesn't get updated. I have an unbounded PCollection and I have set the windowing and triggering as shown below
input
.apply(
Window
.into[String](FixedWindows.of(Duration.standardMinutes(windowSize)))
.withAllowedLateness(Duration.standardMinutes(windowSize))
.discardingFiredPanes()
.triggering(AfterPane.elementCountAtLeast(3000)))
What might be the issue?

Could not allocate a new page for database ‘TEMPDB’ because of insufficient disk space in filegroup ‘DEFAULT’

ETL developer reports they have been trying to run our weekly and daily processes on ADW consistently. While for the most part they are executing without exception, I am now getting this error:
“Could not allocate a new page for database ‘TEMPDB’ because of insufficient disk space in filegroup ‘DEFAULT’. Create the necessary space by dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.”
Is there a limit on TEMPDB space associated with the DWU setting?
The database is limited to 100TB (per the portal) and not full.
Azure SQL Data Warehouse does allocate space for a tempdb, at around 399 GB per 100 DWU. Reference here.
What DWU are you using at the moment? Consider temporarily raising your DWU aka service objective or refactoring your job to be less dependent on tempdb. Lower it when your batch process is finished.
It might also be worth checking your workload for anything like cartesian products, excessive sorting, over-dependency on temp tables etc to see if any optimisation can be done.
Have a look at the Explain Plans for your code, and see whether you have a lot more data movement going on than you expect. If you find that one query does moved a lot more into Q tables, you can probably tune it to avoid the data movement (which may mean redesigning tables to distribute in a different key).