Dataflow doesn't update GroupByKey's "Output collections" field - google-cloud-platform

Dataflow does not update GroupByKey's Output Collections even though I can see the output data in cloud storage (the next transform in the pipeline writes the output to GCS). None of the transforms after GroupByKey show input/output collections either. I have also tried to implement data-driven triggering using AfterPane.elementCountAtLeast(3000) but then too the Output Collections does not get updated after 3000 elements have been input.
The problem is that the Estimated Size keeps on increasing and I am afraid that it will eventually lead to more workers costing me more money. I have been waiting for more than an hour but it still doesn't get updated. I have an unbounded PCollection and I have set the windowing and triggering as shown below
input
.apply(
Window
.into[String](FixedWindows.of(Duration.standardMinutes(windowSize)))
.withAllowedLateness(Duration.standardMinutes(windowSize))
.discardingFiredPanes()
.triggering(AfterPane.elementCountAtLeast(3000)))
What might be the issue?

Related

Proper conversion of AWS Log Insights to Metrics for visualization and monitoring

TL;DR;
What is the proper way to create a metric so that it generates reliable information about the log insights?
What is desired
The current Log insights can be seen similar to the following
However, it becomes easier to analyse these logs using the metrics (mostly because you can have multiple sources of data in the same plot and even perform math operations between them).
Solution according to docs
Allegedly, a log can be converted to a metric filter following a guide like this. However, this approach does not seem to work entirely right (I guess because of the time frames that have to be imposed in the metric plots), providing incorrect information, for example:
Issue with solution
In the previous image I've created a dashboard containing the metric count (the number 7), corresponding to the sum of events each 5 minutes. Also I've added a preview of the log insight corresponding to the information used to create the event.
However, as it can be seen, the number of logs is 4, but the event count displays 7. Changing the time frame in the metric generates other types of issues (e.g., selecting a very small time frame like 1 sec won't retrieve any data, or a slightly smaller time frame will now provide another wrong number: 3, when there are 4 logs, for example).
P.S.
I've also tried converting the log insights to metrics using this lambda function as suggested by Danil Smirnov to no avail, as it seems to generate the same issues.

GCP Dataflow running streaming inserts into BigQuery: GC Thrashing

I am using Apache Beam 2.13.0 with GCP Dataflow runner.
I have a problem with streaming ingest to BigQuery from a batch pipeline:
PCollection<BigQueryInsertError> stageOneErrors =
destinationTableSelected
.apply("Write BQ Attempt 1",
BigQueryIO.<KV<TableDestination, TableRow>>write()
.withMethod(STREAMING_INSERTS)
.to(new KVTableDestination())
.withFormatFunction(new KVTableRow())
.withExtendedErrorInfo()
.withFailedInsertRetryPolicy(InsertRetryPolicy.neverRetry())
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND))
.getFailedInsertsWithErr();
The error:
Shutting down JVM after 8 consecutive periods of measured GC thrashing.
Memory is used/total/max = 15914/18766/18766 MB,
GC last/max = 99.17/99.17 %, #pushbacks=0, gc thrashing=true.
Heap dump not written.
Same code working in the streaming mode correctly (if the with explicit method setting omitted).
The code works on reasonably small datasets (less than 2 million records). Fails on 2,5 million plus.
On the surface it appears to be a similar problem to the one described here: Shutting down JVM after 8 consecutive periods of measured GC thrashing
Creating a separate question to add additional details.
Is there anything I could do to fix this? Looks like the issue is within the BigQueryIO component itself - GroupBy key fails.
The problem with transforms that contain GroupByKey is that it will wait until all the data for the current window has been received before grouping.
In Streaming mode, this is normally fine as the incoming elements are windowed into separate windows, so the GroupByKey only operates on a small(ish) chunk of data.
In Batch mode, however, the current window is the Global Window, meaning that GroupByKey will wait for the entire input dataset to be read and received before the grouping starts to be performed. If the input dataset is large, then your worker will run out of memory, which explains what you are seeing here.
This brings up the question: Why are you using BigQuery Streaming insert when processing Batch data? Streaming inserts are relatively expensive (compared to bulk which is free!) and have smaller quota/limits than Bulk import: even if you work around the issues you are seeing, there may be more issues yet to be discovered in Bigquery itself..
After extensive discussions with the support and the developers it has been communicated that using BigQuery streaming ingress from a batch pipeline is discouraged and currently (as of 2.13.0) not supported.

Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

I am using Python SDK for Apache Beam to run a feature extraction pipeline on Google DataFlow. I need to run multiple transformations all of which expect items to be grouped by key.
Based on the answer to this question, DataFlow is unable to automatically spot and reuse repeated transformations like GroupBy, so I hoped to run GroupBy first and then feed the result PCollection to other transformations (see sample code below).
I wonder if this is supposed to work efficiently in DataFlow. If not, what is a recommended workaround in Python SDK? Is there an efficient way to have multiple Map or Write transformations taking results of the same GroupBy? In my case, I observe DataFlow scale to the max number of workers at 5% utilization and make no progress at the steps following the GroupBy as described in this question.
Sample code. For simplicity, only 2 transformations are shown.
# Group by key once.
items_by_key = raw_items | GroupByKey()
# Write groupped items to a file.
(items_by_key | FlatMap(format_item) | WriteToText(path))
# Run another transformation over the same group.
features = (items_by_key | Map(extract_features))
Feeding output of a single GroupByKey step into multiple transforms should work fine. But the amount of parallelization you can get depends on the total number of keys available in the original GroupByKey step. If any one of the downstream steps are high fanout, consider adding a Reshuffle step after those steps which will allow Dataflow to further parallelize execution.
For example,
pipeline | Create([<list of globs>]) | ParDo(ExpandGlobDoFn()) | Reshuffle() | ParDo(MyreadDoFn()) | Reshuffle() | ParDo(MyProcessDoFn())
Here,
ExpandGlobDoFn: expands input globs and generates files
MyReadDoFn: reads a given file
MyProcessDoFn: processes an element read from a file
I used two Reshuffles here (note that Reshuffle has a GroupByKey in it) to allow (1) parallelizing reading of files from a given glob (2) parallelizing processing of elements from a given file.
Based on my experience in troubleshooting this SO question, reusing GroupBy output in more than one transformation can make your pipeline extremely slow. At least this was my experience with Apache Beam SDK 2.11.0 for Python.
Common sense told me that branching out from a single GroupBy in the execution graph should make my pipeline run faster. After 23 hours of running on 120+ workers, the pipeline was not able to make any significant progress. I tried adding reshuffles, using a combiner where possible and disabling the experimental shuffling service.
Nothing helped until I split the pipeline into two ones. The first pipeline computes the GroupBy and stores it in a file (I need to ingest it "as is" to the DB). The second reads the file with a GroupBy output, reads additional inputs and run further transformations. The result - all transformations successfully finished under 2 hours. I think if I just duplicated the GroupBy in my original pipeline, I would have probably achieved the same results.
I wonder if this is a bug in the DataFlow execution engine or the Python SDK, or it works as intended. If it is by design, then at least it should be documented, and the pipeline like this should not be accepted when submitted, or there should be a warning.
You can spot this issue by looking at the 2 branches coming out of the "Group keywords" step. It looks like the solution is to rerun GroupBy for each branch separately.

Delays when streaming data from the Google Search console API to BigQuery

So I've been trying to stream data from Google Search Console API to BigQuery in real time.
The data are retrieved from GSC API and streamed to the BigQuery stream buffer. However, I experience high latency before the streaming buffer can be flushed (up to 2 hours or more). So, the data stays in the streaming buffer but is not in the table.
The data are also not visible in the preview and the table size is 0B with 0 rows (actually after waiting for >1day I still see 0B even though there are more than 0 rows).
Another issue is that, some time after the data is stored in the table (table size and number of rows are correct), it simply disappears from it and appears in the streaming buffer (I only saw this once). -> This was explained by the second bullet in shollyman's answer.
What I want is to have the data in the table in real time. According to the documentation this seems possible but doesn't work in my case (2h of delay as stated above).
Here's the code responsible for that part:
for row in response['rows']:
keys = ','.join(row['keys'])
# Data Manipulation Languate (DML) Insert one row each time to BigQuery
row_to_stream = {'keys':keys, 'f1':row['f1'], 'f2':row['f2'], 'ctr':row['ctr'], 'position':row['position']}
insert_all_data = {
"kind": "bigquery#tableDataInsertAllRequest",
"skipInvaliedRows": True,
"ignoreUnknownValues": True,
'rows':[{
'insertId': str(uuid.uuid4()),
'json': row_to_stream,
}]
}
build('bigquery', 'v2', cache_discovery=False).tabledata().insertAll(
projectId=projectid,
datasetId=dataset_id,
tableId=tableid,
body=insert_all_data).execute(num_retries=5)
I've seen questions that seem very similar to mine on here but I haven't really found an answer. I therefore have 2 questions.
1. What could cause this issue?
Also, I'm new to GCP and I've seen other options (at least they seemed like options to me) for real time streaming of data to BigQuery (e.g., using PubSub and a few projects around real time Twitter data analysis).
2. How do you pick the best option for a particular task?
By default, the BigQuery web UI doesn't automatically refresh the state of a table. There is a Refresh button when you click into the details of a table, that should show you the updated size information for both managed storage and the streaming buffer (displayed below the main table details). Rows in the buffer are available to queries, but the preview button may not show results until some data is extracted from the streaming buffer to managed storage.
I suspect the case where you observed data disappearing from managed storage and appearing back in the streaming buffer may have been a case where the table was deleted and recreated with the same name, or was truncated in some fashion and streaming restarted. Data doesn't transition from managed storage back to the buffer.
Deciding what technology to use for streaming depends on your needs. Pub/Sub is a great choice when you have multiple consumers of the information (multiple pub/sub subscribers consuming the same stream of messages independently), or you need to apply additional transformations of the data between the producer and consumer. To get the data from pub/sub to BigQuery, you'll still need a subscriber to write the messages into BigQuery, as the two have no direct integration.

Dataflow pipeline waits for elements from all streams before performing GroupBy

We are running a Dataflow job that handles multiple input streams. Some of them are high traffic and some of them rarely get messages through. We are joining all streams with a "shared" stream that contains information relevant to all elements. This is a simplified example of the pipeline:
I noticed that the job will not produce any output, until both streams contain some traffic.
For example, let's suppose that Stream 1 gets a steady flow of traffic, whereas Stream 2 does not produce any messages for a period of time. For this time, the job's DAG will show elements being accumulated in the GroupByKey step but nothing will be propagated beyond it. I can also see the Flatten PCollections step showing input elements for the left side of the graph but not the right one. This creates a problem when dealing with high traffic and low traffic streams in the same job, since it will cause output to be delayed for as much as it takes for Stream 2 to pick up messages.
I am not sure if the observation is correct, but I wanted to ask if this is how Flatten/GroupByKey works in general and if so, if the issue we're seeing can be avoided through an alternative way of constructing the pipeline.
(Example JobID: 2017-02-10_06_48_01-14191266875301315728)
As described in the documentation of group-by-key the default behavior is to wait for all data within the window to have arrived -- this is necessary to ensure correctness of down-stream results.
Depending on what you are trying to do, you may be able to use triggers to cause the aggregates to be output earlier.
You may also be able to use the slow-stream as a side-input to the processing of the fast-stream.
If you're still stuck, it would help if you could describe in more detail the contents of the streams and how you're trying to use them, since more detailed answers depend on the goal.