Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK) - google-cloud-platform

I am using Python SDK for Apache Beam to run a feature extraction pipeline on Google DataFlow. I need to run multiple transformations all of which expect items to be grouped by key.
Based on the answer to this question, DataFlow is unable to automatically spot and reuse repeated transformations like GroupBy, so I hoped to run GroupBy first and then feed the result PCollection to other transformations (see sample code below).
I wonder if this is supposed to work efficiently in DataFlow. If not, what is a recommended workaround in Python SDK? Is there an efficient way to have multiple Map or Write transformations taking results of the same GroupBy? In my case, I observe DataFlow scale to the max number of workers at 5% utilization and make no progress at the steps following the GroupBy as described in this question.
Sample code. For simplicity, only 2 transformations are shown.
# Group by key once.
items_by_key = raw_items | GroupByKey()
# Write groupped items to a file.
(items_by_key | FlatMap(format_item) | WriteToText(path))
# Run another transformation over the same group.
features = (items_by_key | Map(extract_features))

Feeding output of a single GroupByKey step into multiple transforms should work fine. But the amount of parallelization you can get depends on the total number of keys available in the original GroupByKey step. If any one of the downstream steps are high fanout, consider adding a Reshuffle step after those steps which will allow Dataflow to further parallelize execution.
For example,
pipeline | Create([<list of globs>]) | ParDo(ExpandGlobDoFn()) | Reshuffle() | ParDo(MyreadDoFn()) | Reshuffle() | ParDo(MyProcessDoFn())
Here,
ExpandGlobDoFn: expands input globs and generates files
MyReadDoFn: reads a given file
MyProcessDoFn: processes an element read from a file
I used two Reshuffles here (note that Reshuffle has a GroupByKey in it) to allow (1) parallelizing reading of files from a given glob (2) parallelizing processing of elements from a given file.

Based on my experience in troubleshooting this SO question, reusing GroupBy output in more than one transformation can make your pipeline extremely slow. At least this was my experience with Apache Beam SDK 2.11.0 for Python.
Common sense told me that branching out from a single GroupBy in the execution graph should make my pipeline run faster. After 23 hours of running on 120+ workers, the pipeline was not able to make any significant progress. I tried adding reshuffles, using a combiner where possible and disabling the experimental shuffling service.
Nothing helped until I split the pipeline into two ones. The first pipeline computes the GroupBy and stores it in a file (I need to ingest it "as is" to the DB). The second reads the file with a GroupBy output, reads additional inputs and run further transformations. The result - all transformations successfully finished under 2 hours. I think if I just duplicated the GroupBy in my original pipeline, I would have probably achieved the same results.
I wonder if this is a bug in the DataFlow execution engine or the Python SDK, or it works as intended. If it is by design, then at least it should be documented, and the pipeline like this should not be accepted when submitted, or there should be a warning.
You can spot this issue by looking at the 2 branches coming out of the "Group keywords" step. It looks like the solution is to rerun GroupBy for each branch separately.

Related

GCP Dataflow running streaming inserts into BigQuery: GC Thrashing

I am using Apache Beam 2.13.0 with GCP Dataflow runner.
I have a problem with streaming ingest to BigQuery from a batch pipeline:
PCollection<BigQueryInsertError> stageOneErrors =
destinationTableSelected
.apply("Write BQ Attempt 1",
BigQueryIO.<KV<TableDestination, TableRow>>write()
.withMethod(STREAMING_INSERTS)
.to(new KVTableDestination())
.withFormatFunction(new KVTableRow())
.withExtendedErrorInfo()
.withFailedInsertRetryPolicy(InsertRetryPolicy.neverRetry())
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND))
.getFailedInsertsWithErr();
The error:
Shutting down JVM after 8 consecutive periods of measured GC thrashing.
Memory is used/total/max = 15914/18766/18766 MB,
GC last/max = 99.17/99.17 %, #pushbacks=0, gc thrashing=true.
Heap dump not written.
Same code working in the streaming mode correctly (if the with explicit method setting omitted).
The code works on reasonably small datasets (less than 2 million records). Fails on 2,5 million plus.
On the surface it appears to be a similar problem to the one described here: Shutting down JVM after 8 consecutive periods of measured GC thrashing
Creating a separate question to add additional details.
Is there anything I could do to fix this? Looks like the issue is within the BigQueryIO component itself - GroupBy key fails.
The problem with transforms that contain GroupByKey is that it will wait until all the data for the current window has been received before grouping.
In Streaming mode, this is normally fine as the incoming elements are windowed into separate windows, so the GroupByKey only operates on a small(ish) chunk of data.
In Batch mode, however, the current window is the Global Window, meaning that GroupByKey will wait for the entire input dataset to be read and received before the grouping starts to be performed. If the input dataset is large, then your worker will run out of memory, which explains what you are seeing here.
This brings up the question: Why are you using BigQuery Streaming insert when processing Batch data? Streaming inserts are relatively expensive (compared to bulk which is free!) and have smaller quota/limits than Bulk import: even if you work around the issues you are seeing, there may be more issues yet to be discovered in Bigquery itself..
After extensive discussions with the support and the developers it has been communicated that using BigQuery streaming ingress from a batch pipeline is discouraged and currently (as of 2.13.0) not supported.

Google Cloud Dataflow: Dataflow programming model taking same computational time as it would take on a regular VM machine?

I am trying google cloud's dataflow service which is useful for efficient computation time. My code has the following programming model for the dataflow pipeline:
start=(p | "read" >> beam.io.ReadFromText("gcs path"))
end= start | "data_generation" >> beam.Pardo(PerfromFunction)
What I am doing:
PerformFunction is a regular Python function which contains a few series of functions for data-generation purpose. My problem is that when I run this function on a regular VM of n1-standard-16 on a single processor, it takes around 1 hour to complete the whole process.
Why I opted Dataflow:
I then decided to go for Dataflow where a ParDo function performs Multi-Threading of the given function and obviously to reduce the computational time from 1 hour to less than 1 hour.
The Problem:
After running a Dataflow job with the above-mentioned programming model, I came to realize that Dataflow is still taking around 1 hour to complete the entire process which is mentioned as wall-time on the GCP Dataflow UI. I then logged in to the worker machine and saw the resource utilization using the command htop and found that the machine was only utilizing one processor with 60% average usage.
Expected Results or Suggestions:
1. Can multiprocessing be done in the Dataflow worker Cluster?
2. Is my programming model very limited and wrong?
3. ParDo function does not seem to reduce the computational time as expected, What do you think I am doing wrong here?
PS- Owing to some protocols, I can not share the code. Thank you for understanding. Also please correct me if I wrongly understand dataflow at some point.
Apache Beam, and Dataflow are able to parallelize your computation based on the input that comes into it.
If you have a single computation to apply, and this computation takes one hour, then Beam will not be able to speed up your computation. Beam can help you if you need to apply the same computation multiple times to different elements (or data points).
You should also consider things such as overhead of running the computation in a distributed fashion (data copying, network calls, etc).
So, to be able to answer your question: How may individual "data points" (how many lines) are there in your GCS file? Is it possible to parallelize the computation over each one? How long does it take to process each one?

How to use Apache beam to process Historic Time series data?

I have the Apache Beam model to process multiple time series in real time. Deployed on GCP DataFlow, it combines multiple time series into windows, and calculates the aggregate etc.
I now need to perform the same operations over historic data (the same (multiple) time series data) stretching all the way back to 2017. How can I achieve this using Apache beam?
I understand that I need to use the windowing property of Apache Beam to calculate the aggregates etc, but it should accept data from 2 years back onwards
Effectively, I need data as would have been available had I deployed the same pipeline 2 years. This is needed for testing/model training purposes
That sounds like a perfect use case of Beam's focus on event-time processing. You can run the pipeline against any legacy data and get correct results as long as events have timestamps. Without additional context I think you will need to have an explicit step in your pipeline to assign custom timestamps (from 2017) that you will need to extract from the data. To do this you can probably use either:
context.outputWithTimestamp() in your DoFn;
WithTimestamps PTransform;
You might need to have to configure allowed timestamp skew if you have the timestamp ordering issues.
See:
outputWithTimestamp example: https://github.com/apache/beam/blob/efcb20abd98da3b88579e0ace920c1c798fc959e/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/windowing/WindowingTest.java#L248
documentation for WithTimestamps: https://beam.apache.org/releases/javadoc/2.13.0/org/apache/beam/sdk/transforms/WithTimestamps.html#of-org.apache.beam.sdk.transforms.SerializableFunction-
similar question: Assigning to GenericRecord the timestamp from inner object
another question that may have helpful details: reading files and folders in order with apache beam

Dataflow pipeline custom transform performance decreases as more data is passed through transform

I have a dataflow pipeline (Java 1.9.0) which reads data from GCS does a transform then outputs into a GroupBy transform.
I have been running this pipeline in production for couple of months with no problems. But today the pipeline started taking twice the time to run. I read 600 million lines into the pipeline and pass it through transform step X. Previously step X will process row at 200k/sec but now I am noticing after 575 million rows have passed through step X the performance decreases dramatically to about 5K/second.
I added logging in Step X to see if my code is taking more time towards the end of the pipeline when >575million rows have passed through transform, But I see consistent times to what it was when pipeline was processing at 200k/sec.
This job has failed due to an unusually slow BigQuery import that took over an hour, and a service-side bug that can fail a job if it is too slow in some cases: we're investigating the bug.
Try running your job with --experiments=enable_custom_bigquery_sink or updating your SDK to 2.0+ - this should solve the issue.
I also noticed that the pipeline structure of this job seems to be a lot more complex than it should be (it uses a very large number of similar pipeline steps where I believe many of the similar steps could have been condensed into a single step).
I would recommend posting a question on StackOverflow describing roughly what you're trying to accomplish and ask what is a simpler way to express it with Beam/Dataflow.
In particular, I suspect that the TextIO.readAll() transform will greatly help you (available in Beam at HEAD or in 2.2 which should be released soon) - see https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java - as well as, potentially, the DynamicDestinations support in BigQuery (available since 2.0).

What does stage mean in the spark logs?

When I run a job using spark I get the following logs?
[Stage 0:> (0 + 32) / 32]
Here 32 corresponds to the number of partitions of rdd that I have asked for.
However I am not getting why are there multiple stages and what exactly happens in each stage.
Each stage apparently takes a lot of time. Is it possible to get done in fewer stages?
A stage in Spark represents a segment of the DAG computation that is completed locally. A stage breaks on an operation that requires a shuffling of data, which is why you'll see it named by that operation in the Spark UI. If you're using Spark 1.4+, then you can even visualize this in the UI in the DAG visualization section:
Notice that the split occurs at reduceByKey, which requires a shuffle to complete the full execution.