When using groupBy in a stream flow definition with some max capacity of n:
source.groupBy(Int.MaxValue, _.key).to(Sink.actorRef)
If I hook up the subflows that result to say, an Actor sink, and purposefully cause the subflows to terminate on some message, will that free up the groupBy capacity? Will it go from n to n-1 back to n if a subflow is ended by the sink? Is this a viable way to set up a dynamic-ish graph?
Regarding how groupBy works in general: yes, the maxSubstreams capacity is dynamic, i.e. it represents the maximum number of active substreams.
The GroupBy stage keeps a reference of each subflow in its internal state, and this is removed whenever that specific subflow completes.
With regards to your specific example, I don't think there is a way to make sure that "a subflow is ended by the sink". This is because by using to(Sink.actorRef) after a groupBy all the flows are going to feed one single actor.
Related
I have a Map state which iterates my array. Inside the map state, there is a Lambda task and a Wait task. The Wait task is waiting much time, and I need to wait only between iterations. So I would like to skip waiting if this is the last iteration because there is no need for it.
Every time the items are different and their amount is different.
However, the Map context has only $$.Map.Item.Index and $$.Map.Item.Value variables. I couldn't find any mention of any variable with the total amount of steps for example.
How can I achieve that?
My understanding: Each execution has an arbitrary number of Map items that should run serially (i.e. MaxConcurrency:1). You want no delay after the last item. For instance, an execution with 4 items [A, B, C, D] should run in the following order: Lambda(A), Wait, Lambda(B), Wait, Lambda(C), Wait, Lambda(D). A single-item execution [A] should run Lambda(A) only.
Here is one way to do it:
[Update - late 2022: Using the new ArrayLength and MathAdd intrinsic functions we can calculate the last item index without a lambda - see "LastItemIndex.$" below]
Insert a Lambda Task before the Map State. The Lambda counts the items and outputs the last item's index to $.itemsCounter.lastItemIndex.
Add the last item and the item index to the Map iterations' payloads with Parameters on the Map State:
// add to the Map State definition - overrides what each iteration receives
"Parameters": {
"Index.$": "$$.Map.Item.Index",
"Data.$": "$$.Map.Item.Value", // by default, each iteration just gets this
"LastItemIndex.$": "States.MathAdd(States.ArrayLength($$.Execution.Input.Items), -1)", // 3 for {"Items": [A,B,C,D]}
},
Add a ShouldWait? Choice State inside the Map between the Lambda Task and Wait State. Proceed to Wait unless $.Index equals $.LastItemIndex.
Given an Observable<Input> and a mapping function Function<Input, Output> that is expensive but takes variable time, is there a way to call the mapping function in parallel on multiple inputs, and receive the outputs in the order they're produced?
I've tried using observeOn() with a multi-threaded Scheduler:
PublishSubject<Input> inputs = PublishSubject.create();
Function<Input, Output> mf = ...
Observer<Output> myObserver = ...
// Note: same results with newFixedThreadPool(2)
Executor exec = Executors.newWorkStealingThreadPool();
// Use ConnectableObservable to make sure mf is called only once
// no matter how many downstream observers
ConnectableObservable<Output> outputs = inputs
.observeOn(SchedulersFrom(exec))
.map(mf)
.publish();
outputs.subscribe(myObserver1);
outputs.subscribe(myObserver2);
outputs.connect();
inputs.onNext(slowInput); // `mf.apply()` takes a long time to complete on this input
inputs.onNext(fastInput); // `mf.apply()` takes a short time to complete on this input
but in testing, mf.apply(fastInput) is never called till after mf.apply(slowInput) completes.
If I play some tricks in my test with CountDownLatch to ensure mf.apply(slowInput) can't complete until after mf.apply(fastInput), the program deadlocks.
Is there some simple operator I should be using here, or is getting Observables out of order just against the grain of RxJava, and I should be using a different technology?
ETA: I looked at using ParallelFlowable (converting it back to a plain Flowable with .sequential() before subscribing myObserver1/2, or rather mySubscriber1/2), but then I get extra mf.apply() calls, one per input per Subscriber. There's ConnectableFlowable, but I'm not having much luck figuring out how to mix it with .parallel().
I guess observeOn operator does not support concurrent execution for alone. So, how about using flatMap? Assume the mf function needs a lot time.
ConnectableObservable<Output> outputs = inputs
.flatMap(it -> Observable.just(it)
.observeOn(SchedulersFrom(exec))
.map(mf))
.publish();
or
ConnectableObservable<Output> outputs = inputs
.flatMap(it -> Observable.just(it)
.map(mf))
.subscribeOn(SchedulersFrom(exec))
.publish();
Edit 2019-12-30
If you want to run tasks concurrently, but supposed to keep the order, use concatMapEager operator instead of flatMap.
ConnectableObservable<Output> outputs = inputs
.concatMapEager(it -> Observable.just(it) // here
.observeOn(SchedulersFrom(exec))
.map(mf))
.publish();
Doesn't sound possible to me, unless Rx has some very specialised operator to do so. If you're using flatMap to do the mapping, then the elements will arrive out-of-order. Or you could use concatMap but then you'll lose the parallel mapping that you want.
Edit: As mentioned by another poster, concatMapEager should work for this. Parallel subscription and in-order results.
I'm trying to play with Kafka Stream to aggregate some attribute of People.
I have a kafka stream test like this :
new ConsumerRecordFactory[Array[Byte], Character]("input", new ByteArraySerializer(), new CharacterSerializer())
var i = 0
while (i != 5) {
testDriver.pipeInput(
factory.create("input",
Character(123,12), 15*10000L))
i+=1;
}
val output = testDriver.readOutput....
I'm trying to group the value by key like this :
streamBuilder.stream[Array[Byte], Character](inputKafkaTopic)
.filter((key, _) => key == null )
.mapValues(character=> PersonInfos(character.id, character.id2, character.age) // case class
.groupBy((_, value) => CharacterInfos(value.id, value.id2) // case class)
.count().toStream.print(Printed.toSysOut[CharacterInfos, Long])
When i'm running the code, I got this :
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 1
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 2
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 3
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 4
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 5
Why i'm getting 5 rows instead of just one line with CharacterInfos and the count ?
Doesn't groupBy just change the key ?
If you use the TopologyTestDriver caching is effectively disabled and thus, every input record will always produce an output record. This is by design, because caching implies non-deterministic behavior what makes itsvery hard to write an actual unit test.
If you deploy the code in a real application, the behavior will be different and caching will reduce the output load -- which intermediate results you will get, is not defined (ie, non-deterministic); compare Michael Noll's answer.
For your unit test, it should actually not really matter, and you can either test for all output records (ie, all intermediate results), or put all output records into a key-value Map and only test for the last emitted record per key (if you don't care about the intermediate results) in the test.
Furthermore, you could use suppress() operator to get fine grained control over what output messages you get. suppress()—in contrast to caching—is fully deterministic and thus writing a unit test works well. However, note that suppress() is event-time driven, and thus, if you stop sending new records, time does not advance and suppress() does not emit data. For unit testing, this is important to consider, because you might need to send some additional "dummy" data to trigger the output you actually want to test for. For more details on suppress() check out this blog post: https://www.confluent.io/blog/kafka-streams-take-on-watermarks-and-triggers
Update: I didn't spot the line in the example code that refers to the TopologyTestDriver in Kafka Streams. My answer below is for the 'normal' KStreams application behavior, whereas the TopologyTestDriver behaves differently. See the answer by Matthias J. Sax for the latter.
This is expected behavior. Somewhat simplified, Kafka Streams emits by default a new output record as soon as a new input record was received.
When you are aggregating (here: counting) the input data, then the aggregation result will be updated (and thus a new output record produced) as soon as new input was received for the aggregation.
input record 1 ---> new output record with count=1
input record 2 ---> new output record with count=2
...
input record 5 ---> new output record with count=5
What to do about it: You can reduce the number of 'intermediate' outputs through configuring the size of the so-called record caches as well as the setting of the commit.interval.ms parameter. See Memory Management. However, how much reduction you will be seeing depends not only on these settings but also on the characteristics of your input data, and because of that the extent of the reduction may also vary over time (think: could be 90% in the first hour of data, 76% in the second hour of data, etc.). That is, the reduction process is deterministic but from the resulting reduction amount is difficult to predict from the outside.
Note: When doing windowed aggregations (like windowed counts) you can also use the Suppress() API so that the number of intermediate updates is not only reduced, but there will only ever be a single output per window. However, in your use case/code you the aggregation is not windowed, so cannot use the Suppress API.
To help you understand why the setup is this way: You must keep in mind that a streaming system generally operates on unbounded streams of data, which means the system doesn't know 'when it has received all the input data'. So even the term 'intermediate outputs' is actually misleading: at the time the second input record was received, for example, the system believes that the result of the (non-windowed) aggregation is '2' -- its the correct result to the best of its knowledge at this point in time. It cannot predict whether (or when) another input record might arrive.
For windowed aggregations (where Suppress is supported) this is a bit easier, because the window size defines a boundary for the input data of a given window. Here, the Suppress() API allows you to make a trade-off decision between better latency but with multiple outputs per window (default behavior, Suppress disabled) and longer latency but you'll get only a single output per window (Suppress enabled). In the latter case, if you have 1h windows, you will not see any output for a given window until 1h later, so to speak. For some use cases this is acceptable, for others it is not.
Let's say I don't have access to the set of producers that commits on the partition of interest, but just have control over a bunch of C++ consumers.
Since I'm running benchmarks over a complex program, I'd like to know the spread between the offset my consumers are fetching and the total offset stored in the partition.
e.g., >> reading message #1234 of 5678 total in partition 0 of topic foo
I misunderstood the purpose of RdKafka::Consumer->outq_len() and RdKafka::Topic->OFFSET_END, because they seem always equal to 0 and -1, respectively.
How can I acquire the 5678 value of my example?
You need to subscribe to librdkafka's statistics to get an updated view of your consumer's lag.
Register an Event callback class and regularily call poll() on your handle, check for EVENT_STATS and then parse the corresponding JSON message and look for lo_offset, hi_offset and consumer_lag.
I create a round of processes in erlang and wish to measure the time that it took for the first message to pass throigh the network and the entire message series, each time the first node gets the message back it sends another one.
right now in the first node i have the following code:
receive
stop->
io:format("all processes stopped!~n"),
true;
start->
statistics(runtime),
Son!{number, 1},
msg(PID, Son, M, 1);
{_, M} ->
{Time1, _} = statistics(runtime),
io:format("The last message has arrived after ~p! ~n",[Time1*1000]),
Son!stop;
of course i start the statistics when sending the first message.
as you can see i use the Time_Since_Last_Call for the first message loop and wish to use the Total_Run_Time for the entire run, the problem is that Total_Run_Time is accumulative since i start the statistics for the first time.
The second thought i had in mind is using another process with 2 receive loops getting the times for each one adding them and printing but i'm sure that erlang can do better than this.
i guess the best method to solve this is somehow flush the Total_Run_Time, but i couldn't find how this could be done. any ideas how this can be tackled?
One way to measure round-trip times would be to send a timestamp along with each message. When the first node receives the message, it can then measure the round-trip time, calculating Total_Run_Time - Timestamp.
To calculate the total run time, I would memorize the first timestamp in the process state (or dictionary), and calculate the total run time when stopping the test.
Besides, given that you mention the network, are you sure that the CPU time (which is what statistics(runtime) calculates is what you're after? Perhaps, wall clock time would be more appropriate.