Using tbb::flow::graph with an embarrassingly parallel portion - c++

I'm new to using tbb::flow and was looking to create a graph that had a portion of it that is embarrassingly parallel. So the idea is to have a message come in to a node that does some pre-processing and then can formulate a set of tasks that can be executed in parallel. And then the data is aggreagated in a multifunction_node that sends the results out to a couple of places.
-------------- ------
|parallel nodes| |Output|
----------- /|--------------|\ ---------- / ------
msg -> |pre-process| -|parallel nodes|- |aggregator|
----------- \|--------------|/ ---------- \ ------
|parallel nodes| |Output|
-------------- ------
Now the aggregator can't send out it's work until the work is done. So it would need to keep track of the number of answers expected. Can I do this with a tbb::flow::graph or should I create a function node that has a parallel for embedded into it? Other ideas or options?
If I can do it with tbb::flow what are the node types and queueing strategies?
Another way of thinking about this is it is a MapReduce kind of operation with a little preprocessing and the results being sent to a few different places in slightly different forms.

Related

What really are options of the "read_format" attribute of the "perf_event_attr" structure?

I'm currently using the perf_event_open syscall (on Linux systems), and I try to understand a configuration parameter of this syscall which is given by the struct perf_event_attr structure.
It's about the read_format option.
Has anyone can see on the man page of this syscall, this parameter is related to the output of this call.
But I don't understand what every possible argument can do.
Especially these two possibilities:
PERF_FORMAT_TOTAL_TIME_ENABLED
PERF_FORMAT_TOTAL_TIME_RUNNING
Can anyone with that information give me a straight answer?
Ok.
I've looked a little further, and I think I have found an answer.
PERF_FORMAT_TOTAL_TIME_ENABLED: It seems that an "enabled time" refer to the difference between the time the event is no longer observed, and the time the event is registered as "to be observed".
PERF_FORMAT_TOTAL_TIME_RUNNING: It seems that an "running time" refer to the sum of the time the event is truly observed by the kernel. It's smaller or equal to PERF_FORMAT_TOTAL_TIME_ENABLED.
For example :
You tell to your kernel that you want to observe the X event at 1:13:05 PM. Your kernel create a "probe" on X, and start to record the activity.
Then, for an unknown reason, you tell to stop the record for the moment at 1:14:05 PM.
Then, you resume the record at 1:15:05 PM.
Finally, you stop the record at 1:15:35 PM.
You have 00:02:30 enabled time (1:15:35 PM - 1:13:05 PM = 00:02:30)
and 00:01:30 running time (1:14:05 PM - 1:13:05 PM + 1:15:35 PM - 1:15:05 PM = 00:01:30)
The read_format attribute can have both values using a mask. In C++, it looks like that :
event_configuration.read_format = PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_TOTAL_TIME_RUNNING;
where event_configuration is an instance of struct perf_event_attr.

Parallelize map() operation on single Observable and receive results out of order

Given an Observable<Input> and a mapping function Function<Input, Output> that is expensive but takes variable time, is there a way to call the mapping function in parallel on multiple inputs, and receive the outputs in the order they're produced?
I've tried using observeOn() with a multi-threaded Scheduler:
PublishSubject<Input> inputs = PublishSubject.create();
Function<Input, Output> mf = ...
Observer<Output> myObserver = ...
// Note: same results with newFixedThreadPool(2)
Executor exec = Executors.newWorkStealingThreadPool();
// Use ConnectableObservable to make sure mf is called only once
// no matter how many downstream observers
ConnectableObservable<Output> outputs = inputs
.observeOn(SchedulersFrom(exec))
.map(mf)
.publish();
outputs.subscribe(myObserver1);
outputs.subscribe(myObserver2);
outputs.connect();
inputs.onNext(slowInput); // `mf.apply()` takes a long time to complete on this input
inputs.onNext(fastInput); // `mf.apply()` takes a short time to complete on this input
but in testing, mf.apply(fastInput) is never called till after mf.apply(slowInput) completes.
If I play some tricks in my test with CountDownLatch to ensure mf.apply(slowInput) can't complete until after mf.apply(fastInput), the program deadlocks.
Is there some simple operator I should be using here, or is getting Observables out of order just against the grain of RxJava, and I should be using a different technology?
ETA: I looked at using ParallelFlowable (converting it back to a plain Flowable with .sequential() before subscribing myObserver1/2, or rather mySubscriber1/2), but then I get extra mf.apply() calls, one per input per Subscriber. There's ConnectableFlowable, but I'm not having much luck figuring out how to mix it with .parallel().
I guess observeOn operator does not support concurrent execution for alone. So, how about using flatMap? Assume the mf function needs a lot time.
ConnectableObservable<Output> outputs = inputs
.flatMap(it -> Observable.just(it)
.observeOn(SchedulersFrom(exec))
.map(mf))
.publish();
or
ConnectableObservable<Output> outputs = inputs
.flatMap(it -> Observable.just(it)
.map(mf))
.subscribeOn(SchedulersFrom(exec))
.publish();
Edit 2019-12-30
If you want to run tasks concurrently, but supposed to keep the order, use concatMapEager operator instead of flatMap.
ConnectableObservable<Output> outputs = inputs
.concatMapEager(it -> Observable.just(it) // here
.observeOn(SchedulersFrom(exec))
.map(mf))
.publish();
Doesn't sound possible to me, unless Rx has some very specialised operator to do so. If you're using flatMap to do the mapping, then the elements will arrive out-of-order. Or you could use concatMap but then you'll lose the parallel mapping that you want.
Edit: As mentioned by another poster, concatMapEager should work for this. Parallel subscription and in-order results.

TopologyTestDriver with streaming groupByKey.windowedBy.reduce not working like kafka server [duplicate]

I'm trying to play with Kafka Stream to aggregate some attribute of People.
I have a kafka stream test like this :
new ConsumerRecordFactory[Array[Byte], Character]("input", new ByteArraySerializer(), new CharacterSerializer())
var i = 0
while (i != 5) {
testDriver.pipeInput(
factory.create("input",
Character(123,12), 15*10000L))
i+=1;
}
val output = testDriver.readOutput....
I'm trying to group the value by key like this :
streamBuilder.stream[Array[Byte], Character](inputKafkaTopic)
.filter((key, _) => key == null )
.mapValues(character=> PersonInfos(character.id, character.id2, character.age) // case class
.groupBy((_, value) => CharacterInfos(value.id, value.id2) // case class)
.count().toStream.print(Printed.toSysOut[CharacterInfos, Long])
When i'm running the code, I got this :
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 1
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 2
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 3
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 4
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 5
Why i'm getting 5 rows instead of just one line with CharacterInfos and the count ?
Doesn't groupBy just change the key ?
If you use the TopologyTestDriver caching is effectively disabled and thus, every input record will always produce an output record. This is by design, because caching implies non-deterministic behavior what makes itsvery hard to write an actual unit test.
If you deploy the code in a real application, the behavior will be different and caching will reduce the output load -- which intermediate results you will get, is not defined (ie, non-deterministic); compare Michael Noll's answer.
For your unit test, it should actually not really matter, and you can either test for all output records (ie, all intermediate results), or put all output records into a key-value Map and only test for the last emitted record per key (if you don't care about the intermediate results) in the test.
Furthermore, you could use suppress() operator to get fine grained control over what output messages you get. suppress()—in contrast to caching—is fully deterministic and thus writing a unit test works well. However, note that suppress() is event-time driven, and thus, if you stop sending new records, time does not advance and suppress() does not emit data. For unit testing, this is important to consider, because you might need to send some additional "dummy" data to trigger the output you actually want to test for. For more details on suppress() check out this blog post: https://www.confluent.io/blog/kafka-streams-take-on-watermarks-and-triggers
Update: I didn't spot the line in the example code that refers to the TopologyTestDriver in Kafka Streams. My answer below is for the 'normal' KStreams application behavior, whereas the TopologyTestDriver behaves differently. See the answer by Matthias J. Sax for the latter.
This is expected behavior. Somewhat simplified, Kafka Streams emits by default a new output record as soon as a new input record was received.
When you are aggregating (here: counting) the input data, then the aggregation result will be updated (and thus a new output record produced) as soon as new input was received for the aggregation.
input record 1 ---> new output record with count=1
input record 2 ---> new output record with count=2
...
input record 5 ---> new output record with count=5
What to do about it: You can reduce the number of 'intermediate' outputs through configuring the size of the so-called record caches as well as the setting of the commit.interval.ms parameter. See Memory Management. However, how much reduction you will be seeing depends not only on these settings but also on the characteristics of your input data, and because of that the extent of the reduction may also vary over time (think: could be 90% in the first hour of data, 76% in the second hour of data, etc.). That is, the reduction process is deterministic but from the resulting reduction amount is difficult to predict from the outside.
Note: When doing windowed aggregations (like windowed counts) you can also use the Suppress() API so that the number of intermediate updates is not only reduced, but there will only ever be a single output per window. However, in your use case/code you the aggregation is not windowed, so cannot use the Suppress API.
To help you understand why the setup is this way: You must keep in mind that a streaming system generally operates on unbounded streams of data, which means the system doesn't know 'when it has received all the input data'. So even the term 'intermediate outputs' is actually misleading: at the time the second input record was received, for example, the system believes that the result of the (non-windowed) aggregation is '2' -- its the correct result to the best of its knowledge at this point in time. It cannot predict whether (or when) another input record might arrive.
For windowed aggregations (where Suppress is supported) this is a bit easier, because the window size defines a boundary for the input data of a given window. Here, the Suppress() API allows you to make a trade-off decision between better latency but with multiple outputs per window (default behavior, Suppress disabled) and longer latency but you'll get only a single output per window (Suppress enabled). In the latter case, if you have 1h windows, you will not see any output for a given window until 1h later, so to speak. For some use cases this is acceptable, for others it is not.

kyoto cabinet scan_parallel not really parallel?

I just spent a day creating an abstraction layer to kyotodb to remove global locks from my code, I was busy porting my algorithms to this new abstraction layer when I discover that scan_parallel isn't really parallel. It only maxes out one core -- For jollies I stuck in a billion-int-countdown spin-loop in my code(empty stubs as I port) to try and simulate some processing time. still only one core maxed. Do I need to move to berkley db or leveldb ? I thought kyotodb was meant for internet scale problems :/. I must be doing something wrong or missing some gotchas.
top or iostat never went above 100% / 25% (iostat one cpu maxed = 1/number of cores * 100):/ On a quad core i5.
source db is 10gigs corpus of protocol buffer encoded data (treedb) with the following flags (picked these up from the documentation).
index_db.tune_options(TreeDB::TLINEAR | TreeDB::TCOMPRESS);
index_db.tune_buckets(1LL * 1000);
index_db.tune_defrag(8);
index_db.tune_page(32768);
edit
Do not remove the IR TAG. Please think before you wave arround the detag bat.
This IS an IR related question, its about creating GINORMOUS (40 gig +) inverted files ONLINE, inverted indices are the base of IR data access methods, and inverted index creation has a unique transactional profile. By removing the IR tag you rob me of the wisdom of IR researchers who have used a database library to create such large database files.

processing large data that sequential with tbb

I'm working on c++ app to process large amounts of quote data eg. (MSFT, AMZN, etc) with tbb. And was wondering how I would structure it. I'm been looking at parallel_for and pipeline and concurrent_queue.
The process would basically parse the data, process it and output to file. Parsing and processing can be done in parallel, but output should be in order for each symbol.
Eg. Input:
- Msg #1 - AMZN #1
- Msg #2 - AMZN #2
- Msg #3 - IBM #1
- Msg #4 - AMZN #3
- Msg #5 - CSCO #1
- Msg $6 - IBM #2
I would like to use lock-free solution or minimum locking, but it seems like I have keep in concurrent_queue to keep the order.
Any ideas would be helpful
Thanks,
David
If you use the pipeline pattern (tbb::pipeline class or tbb::parallel_pipeline() function), you can use ordered filters to ensure the output will appear in exactly the same order as the input was received. And you will not need any locks in your code for ordering.
Does your quote data either have a timestamp or a sequence number
Otherwise add a sequence number from the producer thread and sort the data based on squence number after parsing it - the resorting can be done then either in a batch or just before the writing of the files.
You can create an output structure (hash or list) where a key is a position of the displayed element (1st, 2nd, ...) and the value is the data to be displayed. Then when all the elements are ready, you can output the structure in the desired order.
This way you don't care about which thread finishes first.