SplittableDoFn when using BigQueryIOI - google-cloud-platform

When reading large tables from BigQuery, I find that only one worker is sometime active and Dataflow then actively kills other workers (then starts ramping up once the large PCollection requires processing - losing time)
So I wonder:
1. Will SplittableDoFn (SDF) alleviate this problem when applied to BigQueryIO
2. Will SDF's increase the use of the num_workers (and stop them from being shut down)?
3. Are SDF's available in Python (yet) and even in Java, available beyond just FileIO?
The real objective here is to reduce total processing time (quicker creation of the PCollection using more workers, faster execution of the DAG as Dataflow then scales up from --num_workers to --max_workers)

Related

How do parallel multi instance loop work in Camunda 7.16.6

I'm using the camunda-enginge 7.16.6.
I have a Process with a multi instance loop like this one that repeats parallel a 1000 times.
This loop is execute parallel. My assumption was, that n camunda executors now starts their work so executor #1 executes Task 2, then Task 3, then Task 4, and executor #2 and all others do the same. So after a short while at least some of the 1000 times finished all three Tasks in the loop
However what I observed so far is, that Task 2 gets execute 1000 times and only when that is finished, Task 3 gets executed a 1000 times and so on.
I also noticed, that camunda takes a lot of time by itself, outside of the tasks.
Is my Observation correct and is this behavior documented somewhere? Can you change that behavior?
I've run some tests an can explain the behavior:
The Order of Tasks and the overall time to finish is influenced by whenever or not there are transaction boundaries (async after, the red bars in the Screenshot).
Its a bit described here.
By setting the asyncBefore='true' attribute we introduce an additional save point at which the process state will be persisted and committed to the database. A separate job executor thread will continue the process asynchronously by using a separate database transaction. In case this transaction fails the service task will be retried and eventually marked as failed - in order to be dealt with by a human operator.
repeat 1000 times, parallel, no transaction
One Job Executor rushes trough the process, the Order is 1, [2,3,4|2,3,4|...], 5. Not really parallel. But this is as documented here:
The Job Executor makes sure that jobs from a single process instance are never executed concurrently.
It can be turned off if you are an expert and know what you are doing (and have understood this section).
Overall this took around 5 seconds.
repeat 1000 times, parallel, with transaction
Here, due the transactions, there will be 1000 waiting Jobs for Task 7, and each finish Task 7 creates another Job of Task 8. Since the execution of the Jobs is by the order in the database (see here), the order is 6,[7,7,7...8,8,8...9,9,9...],10.
The transaction handling which includes maintaining the variables has a huge impact on the runtime, with Transactions in parallel mode it runs 06:33 minutes.
If you turn off the exclusive-flag it takes around 4:30 minutes, but at the cost of thousands of OptimisticLockingExceptions.
Afaik the recommended approach to gain true parallelism would be to move Task 7, Task 8 and Task 9 to a seperate process and spawn 1000 instances of that process.
You can influence the order of execution if you tweak the job executor settings & priority, see here, but that seems to require the exclusive flag, too. If you do that, the Order will be 6,[7,7,7|8,9,8,9(in random order),...]10
repeat 1000 times, sequential, no transaction
The Order is 11,[12,13,14|12,13,14,...]15
This takes only 2 seconds.
repeat 1000 times, sequential, with transaction
The order is as expected 16,[17,18,19|17,18,19|...],20
Due the Transactions this takes 02:45 minutes.
I heard from colleges, that one should use parallel only if it involves long running/blocking tasks like a human task - in sequential mode there would only be one human task, and after that one is done, another will be created. in parallel mode, you have 1000 human tasks which is more likely the desired behavior.
Parallel performance seems to be improved in Camunda 8

How flink checkpointing time is related to buffer alignment size or alignment time?

My streaming flink job has checkpointing time of 2-3s(15-20% of time) and 3-4 mins(8-12% of time) and 2 mins on an average. We have two operators which are stateful. First is kafka consumer as source(FlinkKafkaConsumer010) and another is hdfs sink(CustomBucketingSink). This two makes state of around 1-1.5Gb for savepoints and 800mb-6Gb(3gb average) for checkpoint. We have 30sec of tumbling processing window. Checkpointing duration and minimum pause between two checkpoiting is 3 mins. My job consumes around 3 millions of records per minute on an average and around 20 millions/min records on peak time. There is more than enough cpu and memory for flink.
Now here are my doubts :
1) Even when few checkpointing state sizes are less(70-80% less) as compare to other checkpointing state, it takes minutes(15-20% of time) as compare to other one which takes 5-10 secs.
2) Buffer alignment size sometimes increases to 7-8gb as compare to 800mb-1gb average but checkpointing time is not affected by this. I guess it should take more time as it should wait for checkpoint barrier.
3) Will checkponting time be affected if we increase tumbling window size. I am considering it shouldn't affect neither savepoint time and nor checkpoint time.
4) Few of the sub-tasks which sinks into hdfs takes 2-3 mins (5-10% time). So while 98% of subtasks are completed in 30-50 secs. 1-2(95% of time, it's only one) subtasks takes 2-3 mins. Which delays the whole checkpointing time. Problem is not with the node on which this sub-tasks are running because it happens sometimes to some node and sometimes to another node.
5) We are getting one exception once every 6-8 hour which restarts the job. TimerException{java.nio.channels.ClosedByInterruptException} at org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService$TriggerTask.run(SystemProcessingTimeService.java:288)
6) How to minimize the alignment buffer time.
7) Savepoint time increases or decreases with increase and decrease of rate of input or state size but checkpointing time doesn't hold the same. Checkpointing time sometimes shows inverse relation with state size or we can saw it's not affected with the state size.
8) Whenever we restart the job, all sub-tasks take uniform time for 2-3 days on all nodes but afterwards 1-2 sub-tasks takes 2-3 minutes as compare to other which are taking 15-30 secs. I might be wrong on this behaviour but as far i have observed, this is also a case.
Note that windows are stateful, and unless you are doing incremental aggregation, longer windows have more state, which will in turn affect checkpoint sizes and durations.
It would be helpful to know which state backend you are using, and whether or not you are using incremental checkpointing.
I would start by trying to find the cause of the slow sink subtask(s) causing the backpressure, which is in turn causing the painful checkpointing. Could be data skew, or resource starvation, for example. Some common causes include insufficient CPU, network, or disk bandwidth, or AWS (or other API) rate limits. It may seem that you have plenty of CPU, for example, but one hot key can put way too much load on one thread, and thereby hold back the entire cluster.
If you find a way to correct the imbalance at the sink, then the checkpoint alignment problems should calm down. (Note that if you can tolerate duplicate results, you could disable checkpoint barrier alignment by choosing CheckpointingMode.AT_LEAST_ONCE.)

Dataflow job stuck and not reading messages from PubSub

I have a dataflow job which reads JSON from 3 PubSub topics, flattening them in one, apply some transformations and save to BigQuery.
I'm using a GlobalWindow with following configuration.
.apply(Window.<PubsubMessage>into(new GlobalWindows()).triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterFirst.of(AfterPane.elementCountAtLeast(20000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(durations))))
.discardingFiredPanes());
The job is running with following configuration
Max Workers : 20
Disk Size: 10GB
Machine Type : n1-standard-4
Autoscaling Algo: Throughput Based
The problem I'm facing is that after processing few messages (approx ~80k) the job stops reading messages from PubSub. There is a backlog of close to 10 Million messages in one of those topics and yet the Dataflow Job is not reading the messages or autoscaling.
I also checked the CPU usage of each worker and that is also hovering in single digit after initial burst.
I've tried changing machine type and max worker configuration but nothing seems to work.
How should I approach this problem ?
I suspect the windowing function is the culprit. GlobalWindow isn't suited to streaming jobs (which I assume this job is, due to the use of PubSub), because it won't fire the window until all elements are present, which never happens in a streaming context.
In your situation, it looks like the window will fire early once, when it hits either that element count or duration, but after that the window will get stuck waiting for all the elements to finally arrive. A quick fix to check if this is the case is to wrap the early firings in a Repeatedly.forever trigger, like so:
withEarlyFirings(
Repeatedly.forever(
AfterFirst.of(
AfterPane.elementCountAtLeast(20000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(durations)))))
This should allow the early firing to fire repeatedly, preventing the window from getting stuck.
However for a more permanent solution I recommend moving away from using GlobalWindow in streaming pipelines. Using fixed-time windows with early firings based on element count would give you the same behavior, but without risk of getting stuck.

Flink RocksDB Performance issues

I have a flink job (scala) that is basically reading from a kafka-topic (1.0), aggregating data (1 minute event time tumbling window, using a fold function, which I know is deprecated, but is easier to implement than an aggregate function), and writing the result to 2 different kafka topics.
The question is - when I'm using a FS state backend, everything runs smoothly, checkpoints are taking 1-2 seconds, with an average state size of 200 mb - that is, until the state size is increasing (while closing a gap, for example).
I figured I would try rocksdb (over hdfs) for checkpoints - but the throughput is SIGNIFICANTLY less than fs state backend. As I understand it, flink does not need to ser/deserialize for every state access when using fs state backend, because the state is kept in memory (heap), rocks db DOES, and I guess that is what is accounting for the slowdown (and backpressure, and checkpoints take MUCH longer, sometimes timeout after 10 minutes).
Still, there are times that the state cannot fit in memory, and I am trying to figure out basically how to make rocksdb state backend perform "better".
Is it because of the deprecated fold function? Do I need to fine tune some parameters that are not easily searchable in documentation? any tips?
Each state backend holds the working state somewhere, and then durably persists its checkpoints in a distributed filesystem. The RocksDB state backend holds its working state on disk, and this can be a local disk, hopefully faster than hdfs.
Try setting state.backend.rocksdb.localdir (see https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/state/state_backends.html#rocksdb-state-backend-config-options) to somewhere on the fastest local filesystem on each taskmanager.
Turning on incremental checkpointing could also make a large difference.
Also see Tuning RocksDB.

Elastic Beanstalk high CPU load after a week of running

I am running a single-instance worker on AWS Beanstalk. It is a single-container Docker that runs some processes once every business day. Mostly, the processes sync a large number of small files from S3 and analyze those.
The setup runs fine for about a week, and then CPU load starts growing linearly in time, as in this screenshot.
The CPU load stays at a considerable level, slowing down my scheduled processes. At the same time, my top-resource tracking running inside the container (privileged Docker mode to enable it):
echo "%CPU %MEM ARGS $(date)" && ps -e -o pcpu,pmem,args --sort=pcpu | cut -d" " -f1-5 | tail
shows nearly no CPU load (which changes only during the time that my daily process runs, seemingly accurately reflecting system load at those times).
What am I missing here in terms of the origin of this "background" system load? Wondering if anybody seen some similar behavior, and/or could suggest additional diagnostics from inside the running container.
So far I have been re-starting the setup every week to remove the "background" load, but that is sub-optimal since the first run after each restart has to collect over 1 million small files from S3 (while subsequent daily runs add only a few thousand files per day).
The profile is a bit odd. Especially that it is a linear growth. Almost like something is accumulating and taking progressively longer to process.
I don't have enough information to point at a specific issue. A few things that you could check:
Are you collecting files anywhere, whether intentionally or in a cache or transfer folder? It could be that the system is running background processes (AV, index, defrag, dedupe, etc) and the "large number of small files" are accumulating to become something that needs to be paged or handled inefficiently.
Does any part of your process use a weekly naming convention or house keeping process. Might you be getting conflicts, or accumulating work load as the week rolls over. i.e. the 2nd week is actually processing both the 1st & 2nd week data, but never completing so that the next day it is progressively worse. I saw something similar where an inappropriate bubble sort process was not completing (never reached the completion condition due to the slow but steady inflow of data causing it to constantly reset) and the demand by the process got progressively higher as the array got larger.
Do you have some logging on a weekly rollover cycle ?
Are there any other key performance metrics following the trend ? (network, disk IO, memory, paging, etc)
Do consider if it is a false positive. if it is high CPU there should be other metrics mirroring the CPU behaviour, cache use, disk IO, S3 transfer statistics/logging.
RL