Flink RocksDB Performance issues - hdfs

I have a flink job (scala) that is basically reading from a kafka-topic (1.0), aggregating data (1 minute event time tumbling window, using a fold function, which I know is deprecated, but is easier to implement than an aggregate function), and writing the result to 2 different kafka topics.
The question is - when I'm using a FS state backend, everything runs smoothly, checkpoints are taking 1-2 seconds, with an average state size of 200 mb - that is, until the state size is increasing (while closing a gap, for example).
I figured I would try rocksdb (over hdfs) for checkpoints - but the throughput is SIGNIFICANTLY less than fs state backend. As I understand it, flink does not need to ser/deserialize for every state access when using fs state backend, because the state is kept in memory (heap), rocks db DOES, and I guess that is what is accounting for the slowdown (and backpressure, and checkpoints take MUCH longer, sometimes timeout after 10 minutes).
Still, there are times that the state cannot fit in memory, and I am trying to figure out basically how to make rocksdb state backend perform "better".
Is it because of the deprecated fold function? Do I need to fine tune some parameters that are not easily searchable in documentation? any tips?

Each state backend holds the working state somewhere, and then durably persists its checkpoints in a distributed filesystem. The RocksDB state backend holds its working state on disk, and this can be a local disk, hopefully faster than hdfs.
Try setting state.backend.rocksdb.localdir (see https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/state/state_backends.html#rocksdb-state-backend-config-options) to somewhere on the fastest local filesystem on each taskmanager.
Turning on incremental checkpointing could also make a large difference.
Also see Tuning RocksDB.

Related

Accessing Data From Another Application

I currently have a server (called worldserver) that, when it starts, pulls data from multiple sources (mainly database), and every 5 minutes saves the modified data back into the database.
Since it is still in development, I am wondering - can I somehow load all the data the worldserver needs to load when it starts into a separate background application, and have the main server load data from there?
Here's my logic:
Having two different applications, one of them that stores data only (background application), and another that pulls data from there and updates the database every 5 minutes (worldserver).
I want to kill the data read between worldserver and the database, and make it read data from the background application (which realistically never needs to crash as all it does is store data and gather dust on the machine's RAM).
The reason I am thinking about it is because RAM is the fastest memory storage on a PC, and can optimize the server loading time significantly.
Is such a thing possible? If that IS in fact possible, is it recommended? I seek maximum optimization and data loss prevention as much as possible.
I tried searching online how to make this thing possible but encountered a Stackoverflow thread that says you cannot read data from multiple processes as the OS system restricts it.

SplittableDoFn when using BigQueryIOI

When reading large tables from BigQuery, I find that only one worker is sometime active and Dataflow then actively kills other workers (then starts ramping up once the large PCollection requires processing - losing time)
So I wonder:
1. Will SplittableDoFn (SDF) alleviate this problem when applied to BigQueryIO
2. Will SDF's increase the use of the num_workers (and stop them from being shut down)?
3. Are SDF's available in Python (yet) and even in Java, available beyond just FileIO?
The real objective here is to reduce total processing time (quicker creation of the PCollection using more workers, faster execution of the DAG as Dataflow then scales up from --num_workers to --max_workers)

Elastic Beanstalk high CPU load after a week of running

I am running a single-instance worker on AWS Beanstalk. It is a single-container Docker that runs some processes once every business day. Mostly, the processes sync a large number of small files from S3 and analyze those.
The setup runs fine for about a week, and then CPU load starts growing linearly in time, as in this screenshot.
The CPU load stays at a considerable level, slowing down my scheduled processes. At the same time, my top-resource tracking running inside the container (privileged Docker mode to enable it):
echo "%CPU %MEM ARGS $(date)" && ps -e -o pcpu,pmem,args --sort=pcpu | cut -d" " -f1-5 | tail
shows nearly no CPU load (which changes only during the time that my daily process runs, seemingly accurately reflecting system load at those times).
What am I missing here in terms of the origin of this "background" system load? Wondering if anybody seen some similar behavior, and/or could suggest additional diagnostics from inside the running container.
So far I have been re-starting the setup every week to remove the "background" load, but that is sub-optimal since the first run after each restart has to collect over 1 million small files from S3 (while subsequent daily runs add only a few thousand files per day).
The profile is a bit odd. Especially that it is a linear growth. Almost like something is accumulating and taking progressively longer to process.
I don't have enough information to point at a specific issue. A few things that you could check:
Are you collecting files anywhere, whether intentionally or in a cache or transfer folder? It could be that the system is running background processes (AV, index, defrag, dedupe, etc) and the "large number of small files" are accumulating to become something that needs to be paged or handled inefficiently.
Does any part of your process use a weekly naming convention or house keeping process. Might you be getting conflicts, or accumulating work load as the week rolls over. i.e. the 2nd week is actually processing both the 1st & 2nd week data, but never completing so that the next day it is progressively worse. I saw something similar where an inappropriate bubble sort process was not completing (never reached the completion condition due to the slow but steady inflow of data causing it to constantly reset) and the demand by the process got progressively higher as the array got larger.
Do you have some logging on a weekly rollover cycle ?
Are there any other key performance metrics following the trend ? (network, disk IO, memory, paging, etc)
Do consider if it is a false positive. if it is high CPU there should be other metrics mirroring the CPU behaviour, cache use, disk IO, S3 transfer statistics/logging.
RL

Berkeley DB: DbEnv::lsn_reset takes a very long time

I'm using Berkeley DB with a probably relatively large database file (2.1 GiB, using btree format in case it matters). During application shutdown, DbEnv::lsn_reset is called in order to "flush" everything before exiting the application. For the large database, this routine takes a very long time for me -- 10 minutes or so at least, during which heavy disk access happens.
Is this normal or the result of using Berkeley DB in some wrong way? Is there anything that can be done to make things process faster? In particular, which parameters could be tweaked to improve performance here?
DbEnv::lsn_reset() is probably not what you want. That function rewrites every single page in the database, so that you can close the databases out and open them in a different environment. It's going to write out at least 2.1 GiB, and pretty slowly.
If you're just shutting the application down to be started back up sometime later, you may simply just want to do a DbEnv::txn_checkpoint() to flush the database log and insert a checkpoint record. Though, this isn't required either. As long as you have the logs committed to stable storage, you can simply exit your application.
http://docs.oracle.com/cd/E17276_01/html/api_reference/CXX/txncheckpoint.html

Neo4j 1.9 throws "java.net.ConnectException: Connection refused" with multi-threaded neocons client

G'day,
I've written a little program in Clojure that uses neocons to jam a bunch of data into Neo4J v1.9.4, and after getting it working have been tinkering with performance.
On large data sets the bottleneck is inserting relationships into Neo4j, which isn't all that surprising given they have to be done one-at-a-time. So my thought was to sprinkle some pmap magic on it, to see if some naive parallelism helped.
Unexpectedly (at least to me), that resulted in neocons throwing a "java.net.ConnectException: Connection refused" exception, which seems odd given that the client will be defaulting to 10 threads (pmap creates no more than numberOfProcessors + 2 threads), while Neo4j will be defaulting to 80 threads (numberOfProcessors * 10, at least if I'm reading the docs right). Last time I checked, 10 was less than 80, so Neo4j should have... <takes off shoes>... lots of threads to spare.
The line of code in question is here - the only change that was made was to switch the "map" call to a "pmap" call.
Any ideas / suggestions?
Thanks in advance,
Peter
Peter,
I would recommend to use batch-mode for the creation of relationships too. I saw you use batch-creation for the nodes already.
Make sure that your batch sizes is roughly between 20k and 50k elements to be most efficient.
Otherwise you end up with two issues:
using one transaction per tx really drains your resources because it does a synchronized forced write to the transaction-log at each commit
as creating relationships locks both start and end-nodes you'll get a lot of locks that other threads wait for and so you can easily end up stalling all your server threads who wait for locks to be released on their start or end-nodes
You should see these locked threads by issuing a kill -3 (or jstack ) against the neo4j server.
Batching those relationship-creations and grouping them by subgraphs so that there is as little overlap as possible between the batches should help a lot.
Not related to your issue, but still worth investigating later.
Not sure what neocons uses there under the hood, but you might fare better with the transactional endpoint and cypher in neo4j 2.0.