Flow problems in akka stream

Flow problems in akka stream - akka

I´m pretty new in Akka stream, I´ve been working with Rx for a while so I know pretty good all operators, but I´m not able to know why my pipeline does not emit the values
Here my code
#Test def mainFlow(): Unit = {
val increase = Flow[Int]
.map(value => value * 10)
val filterFlow = Flow[Int]
.filter(value => value > 50)
.take(2)
Source(0 to 10)
.via(increase)
.via(filterFlow)
.to(Sink.foreach(value => println(s"Item emitted:$value")))
.run()
}
The first Flow transform the values emitted in the Source multiplying by 10, and the second flow filter by get only the items higher than 50 and then I just get 2, so I was expecting to have in the Sink 60 and 70
But nothing it´s emitted.
Any idea why?

Your flow is correctly built, and emits those 2 elements you mentioned.
I believe the problem is with your test. Namely, the flow runs asynchronously and your test is a plain Unit procedure. Therefore, the test will not wait until the flow is run.
You will need to introduce some synchronization in your test to perform your assertions. One way to do it is to use the ScalaFutures trait from ScalaTest, which offers you a futureValue method.
val increase = Flow[Int]
.map(value => value * 10)
val filterFlow = Flow[Int]
.filter(value => value > 50)
.take(2)
Source(0 to 10)
.via(increase)
.via(filterFlow)
.runForeach(value => println(s"Item emitted:$value"))
.futureValue
Note that .to(Sink.foreach{...}).run() does not expose the Future[Done] you need to synchronize on. Your code needs to change to .toMat(Sink.foreach{...})(Keep.right).run(), which can be abbreviated to .runForeach(...).

Because what you are saying is the following:
For the numbers 1..10 multiply them by 10 but only ever produce the 2 first elements then keep all of those elements which are greater than 50 and then print them.
Additionally, your test does not wait for the completion of the RunnableFlow, which typically means that your program will exit before the stream has a chance to run (Akka Streams run asynchronously).
Note that for your example there's no reason to use the GraphDSL, your code is identical to:
Source(1 to 10).map(_ * 10).take(2).filter(_ > 50).runForeach(println)
But since it isn't really doing anything "meaningfully async" I think you'd be much better off with:
(1 to 10).map(_ * 10).take(2).filter(_ > 50).foreach(println)
But then again, with the current state of the code, it is equivalent to the following expression:
()

Related

Explicit throughput limiting on part of an akka stream

I have a flow in our system which reads some elements from SQS (using alpakka) and does some preporcessing (~ 10 stages, normally < 1 minute in total). Then, the prepared element is sent to the main processing (single stage, taking a few minutes). The whole thing runs on AWS/K8S and we’d like to scale out when the SQS queue grows above a certain threshold. The issue is, the SQS queue takes a long time to blow up, since there are a lot of elements “idling” in-process, having done their preprocessing but waiting for the main thing.
We can’t externalize the preprocessing stuff to a separate queue since their outcome can’t survive a de/serialization roundtrip. Also, this service and the “main” processor are deeply coupled (this service runs as main’s sidecar) and can’t be scaled independently.
The preprocessing stages are technically .mapAsyncUnordered, but the whole thing is already very slim (stream stages and SQS batches/buffers).
We tried lowering the interstage buffer (akka.stream.materializer.max-input-buffer-size), but that only gives some indirect benefit, no direct control (and is too internal to be mucking with, for my taste anyway).
I tried implementing a “gate” wrapper which would limit the amount of elements allowed inside some arbitrary Flow, looking something like:
class LimitingGate[T, U](originalFlow: Flow[T, U], maxInFlight: Int) {
private def in: InputGate[T] = ???
private def out: OutputGate[U] = ???
def gatedFlow: Flow[T, U, NotUsed] = Flow[T].via(in).via(originalFlow).via(out)
}
And using callbacks between the in/out gates for throttling.
The implementation partially works (stream termination is giving me a hard time), but it feels like the wrong way to go about achieving the actual goal.
Any ideas / comments / enlightening questions are appreciated
Thanks!

Try something along these lines (I'm only compiling it in my head):
def inflightLimit[A, B, M](n: Int, source: Source[T, M])(businessFlow: Flow[T, B, _])(implicit materializer: Materializer): Source[B, M] = {
require(n > 0) // alternatively, could just result in a Source.empty...
val actorSource = Source.actorRef[Unit](
completionMatcher = PartialFunction.empty,
failureMatcher = PartialFunction.empty,
bufferSize = 2 * n,
overflowStrategy = OverflowStrategy.dropHead // shouldn't matter, but if the buffer fills, the effective limit will be reduced
)
val (flowControl, unitSource) = actorSource.preMaterialize()
source.statefulMapConcat { () =>
var firstElem: Boolean = true
{ a =>
if (firstElem) {
(0 until n).foreach(_ => flowControl.tell(())) // prime the pump on stream materialization
firstElem = false
}
List(a)
}}
.zip(unitSource)
.map(_._1)
.via(businessFlow)
.wireTap { _ => flowControl.tell(()) } // wireTap is Akka Streams 2.6, but can be easily replaced by a map stage which sends () to flowControl and passes through the input
}
Basically:
actorSource will emit a Unit ((), i.e. meaningless) element for every () it receives
statefulMapConcat will cause n messages to be sent to the actorSource only when the stream first starts (thus allowing n elements from the source through)
zip will pass on a pair of the input from source and a () only when actorSource and source both have an element available
for every element which exits businessFlow, a message will be sent to the actorSource, which will allow another element from the source through
Some things to note:
this will not in any way limit buffering within source
businessFlow cannot drop elements: after n elements are dropped the stream will no longer process elements but won't fail; if dropping elements is required, you may be able to inline businessFlow and have the stages which drop elements send a message to flowControl when they drop an element; there are other things to address this which you can do as well

Parallelize map() operation on single Observable and receive results out of order

Given an Observable<Input> and a mapping function Function<Input, Output> that is expensive but takes variable time, is there a way to call the mapping function in parallel on multiple inputs, and receive the outputs in the order they're produced?
I've tried using observeOn() with a multi-threaded Scheduler:
PublishSubject<Input> inputs = PublishSubject.create();
Function<Input, Output> mf = ...
Observer<Output> myObserver = ...
// Note: same results with newFixedThreadPool(2)
Executor exec = Executors.newWorkStealingThreadPool();
// Use ConnectableObservable to make sure mf is called only once
// no matter how many downstream observers
ConnectableObservable<Output> outputs = inputs
.observeOn(SchedulersFrom(exec))
.map(mf)
.publish();
outputs.subscribe(myObserver1);
outputs.subscribe(myObserver2);
outputs.connect();
inputs.onNext(slowInput); // `mf.apply()` takes a long time to complete on this input
inputs.onNext(fastInput); // `mf.apply()` takes a short time to complete on this input
but in testing, mf.apply(fastInput) is never called till after mf.apply(slowInput) completes.
If I play some tricks in my test with CountDownLatch to ensure mf.apply(slowInput) can't complete until after mf.apply(fastInput), the program deadlocks.
Is there some simple operator I should be using here, or is getting Observables out of order just against the grain of RxJava, and I should be using a different technology?
ETA: I looked at using ParallelFlowable (converting it back to a plain Flowable with .sequential() before subscribing myObserver1/2, or rather mySubscriber1/2), but then I get extra mf.apply() calls, one per input per Subscriber. There's ConnectableFlowable, but I'm not having much luck figuring out how to mix it with .parallel().

I guess observeOn operator does not support concurrent execution for alone. So, how about using flatMap? Assume the mf function needs a lot time.
ConnectableObservable<Output> outputs = inputs
.flatMap(it -> Observable.just(it)
.observeOn(SchedulersFrom(exec))
.map(mf))
.publish();
or
ConnectableObservable<Output> outputs = inputs
.flatMap(it -> Observable.just(it)
.map(mf))
.subscribeOn(SchedulersFrom(exec))
.publish();
Edit 2019-12-30
If you want to run tasks concurrently, but supposed to keep the order, use concatMapEager operator instead of flatMap.
ConnectableObservable<Output> outputs = inputs
.concatMapEager(it -> Observable.just(it) // here
.observeOn(SchedulersFrom(exec))
.map(mf))
.publish();

Doesn't sound possible to me, unless Rx has some very specialised operator to do so. If you're using flatMap to do the mapping, then the elements will arrive out-of-order. Or you could use concatMap but then you'll lose the parallel mapping that you want.
Edit: As mentioned by another poster, concatMapEager should work for this. Parallel subscription and in-order results.

TopologyTestDriver with streaming groupByKey.windowedBy.reduce not working like kafka server [duplicate]

I'm trying to play with Kafka Stream to aggregate some attribute of People.
I have a kafka stream test like this :
new ConsumerRecordFactory[Array[Byte], Character]("input", new ByteArraySerializer(), new CharacterSerializer())
var i = 0
while (i != 5) {
testDriver.pipeInput(
factory.create("input",
Character(123,12), 15*10000L))
i+=1;
}
val output = testDriver.readOutput....
I'm trying to group the value by key like this :
streamBuilder.stream[Array[Byte], Character](inputKafkaTopic)
.filter((key, _) => key == null )
.mapValues(character=> PersonInfos(character.id, character.id2, character.age) // case class
.groupBy((_, value) => CharacterInfos(value.id, value.id2) // case class)
.count().toStream.print(Printed.toSysOut[CharacterInfos, Long])
When i'm running the code, I got this :
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 1
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 2
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 3
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 4
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 5
Why i'm getting 5 rows instead of just one line with CharacterInfos and the count ?
Doesn't groupBy just change the key ?

If you use the TopologyTestDriver caching is effectively disabled and thus, every input record will always produce an output record. This is by design, because caching implies non-deterministic behavior what makes itsvery hard to write an actual unit test.
If you deploy the code in a real application, the behavior will be different and caching will reduce the output load -- which intermediate results you will get, is not defined (ie, non-deterministic); compare Michael Noll's answer.
For your unit test, it should actually not really matter, and you can either test for all output records (ie, all intermediate results), or put all output records into a key-value Map and only test for the last emitted record per key (if you don't care about the intermediate results) in the test.
Furthermore, you could use suppress() operator to get fine grained control over what output messages you get. suppress()—in contrast to caching—is fully deterministic and thus writing a unit test works well. However, note that suppress() is event-time driven, and thus, if you stop sending new records, time does not advance and suppress() does not emit data. For unit testing, this is important to consider, because you might need to send some additional "dummy" data to trigger the output you actually want to test for. For more details on suppress() check out this blog post: https://www.confluent.io/blog/kafka-streams-take-on-watermarks-and-triggers

Update: I didn't spot the line in the example code that refers to the TopologyTestDriver in Kafka Streams. My answer below is for the 'normal' KStreams application behavior, whereas the TopologyTestDriver behaves differently. See the answer by Matthias J. Sax for the latter.
This is expected behavior. Somewhat simplified, Kafka Streams emits by default a new output record as soon as a new input record was received.
When you are aggregating (here: counting) the input data, then the aggregation result will be updated (and thus a new output record produced) as soon as new input was received for the aggregation.
input record 1 ---> new output record with count=1
input record 2 ---> new output record with count=2
...
input record 5 ---> new output record with count=5
What to do about it: You can reduce the number of 'intermediate' outputs through configuring the size of the so-called record caches as well as the setting of the commit.interval.ms parameter. See Memory Management. However, how much reduction you will be seeing depends not only on these settings but also on the characteristics of your input data, and because of that the extent of the reduction may also vary over time (think: could be 90% in the first hour of data, 76% in the second hour of data, etc.). That is, the reduction process is deterministic but from the resulting reduction amount is difficult to predict from the outside.
Note: When doing windowed aggregations (like windowed counts) you can also use the Suppress() API so that the number of intermediate updates is not only reduced, but there will only ever be a single output per window. However, in your use case/code you the aggregation is not windowed, so cannot use the Suppress API.
To help you understand why the setup is this way: You must keep in mind that a streaming system generally operates on unbounded streams of data, which means the system doesn't know 'when it has received all the input data'. So even the term 'intermediate outputs' is actually misleading: at the time the second input record was received, for example, the system believes that the result of the (non-windowed) aggregation is '2' -- its the correct result to the best of its knowledge at this point in time. It cannot predict whether (or when) another input record might arrive.
For windowed aggregations (where Suppress is supported) this is a bit easier, because the window size defines a boundary for the input data of a given window. Here, the Suppress() API allows you to make a trade-off decision between better latency but with multiple outputs per window (default behavior, Suppress disabled) and longer latency but you'll get only a single output per window (Suppress enabled). In the latter case, if you have 1h windows, you will not see any output for a given window until 1h later, so to speak. For some use cases this is acceptable, for others it is not.

Akka HTTP Stream listener stops processing databytes after a while

I have an app which has 3 HTTP listeners like this one:
val futureResponse1: Future[HttpResponse] =
Http().singleRequest(HttpRequest(uri = someUrl))
each of the 3 is listening to a non stop stream (each to a different one). And handles it with a simple flow that starts with grouping and then a relatively fast processing (non-blocking):
futureResponse1.flatMap {response =>
response.status match {
case StatusCodes.OK =>
val source: Source[ByteString, Any] = response.entity.dataBytes
source.
grouped(100).
map(doSomethingFast).
runWith(Sink.ignore)
case notOK => system.log.info("failed opening, status: " + notOK.toString())
}
...
I get no exceptions or warnings. But after a while (could be 15-25 minutes) the listeners are just suddenly stopping. One after the other (not together).
Maybe its the grouped phase that is the problem there? Or maybe the connection/stream just stops? Or the dispatcher that is shared by them is getting starved / something not getting released.
Any ideas why that may be happening please?
==== update ====
#Ramon J Romero y Vigil
I changed my run to have only 1 stream instead of 3, and I removed the grouped stage. Still happening after few minutes. I suspect that the stream is closing based on timeout. all I do is get chunks and consume them.
==== update ====
found the reason, see below.

that was the reason:
EntityStreamSizeException: actual entity size (None) exceeded content length limit (8388608 bytes)! You can configure this by setting akka.http.[server|client].parsing.max-content-length or calling HttpEntity.withSizeLimit before materializing the dataBytes stream.
For anyone seeking the solution in the case of continuous response stream, you can get the source this way, using withoutSizeLimit:
val source: Source[ByteString, Any] = response.entity.withoutSizeLimit().dataBytes

How to limit an Akka Stream to execute and send down one message only once per second?

I have an Akka Stream and I want the stream to send messages down stream approximately every second.
I tried two ways to solve this problem, the first way was to make the producer at the start of the stream only send messages once every second when a Continue messages comes into this actor.
// When receive a Continue message in a ActorPublisher
// do work then...
if (totalDemand > 0) {
import scala.concurrent.duration._
context.system.scheduler.scheduleOnce(1 second, self, Continue)
}
This works for a short while then a flood of Continue messages appear in the ActorPublisher actor, I assume (guess but not sure) from downstream via back-pressure requesting messages as the downstream can consume fast but the upstream is not producing at a fast rate. So this method failed.
The other way I tried was via backpressure control, I used a MaxInFlightRequestStrategy on the ActorSubscriber at the end of the stream to limit the number of messages to 1 per second. This works but messages coming in come in at approximately three or so at a time, not just one at a time. It seems the backpressure control doesn't immediately change the rate of messages coming in OR messages were already queued in the stream and waiting to be processed.
So the problem is, how can I have an Akka Stream which can process one message only per second?
I discovered that MaxInFlightRequestStrategy is a valid way to do it but I should set the batch size to 1, its batch size is default 5, which was causing the problem I found. Also its an over-complicated way to solve the problem now that I am looking at the submitted answer here.

You can either put your elements through the throttling flow, which will back pressure a fast source, or you can use combination of tick and zip.
The first solution would be like this:
val veryFastSource =
Source.fromIterator(() => Iterator.continually(Random.nextLong() % 10000))
val throttlingFlow = Flow[Long].throttle(
// how many elements do you allow
elements = 1,
// in what unit of time
per = 1.second,
maximumBurst = 0,
// you can also set this to Enforcing, but then your
// stream will collapse if exceeding the number of elements / s
mode = ThrottleMode.Shaping
)
veryFastSource.via(throttlingFlow).runWith(Sink.foreach(println))
The second solution would be like this:
val veryFastSource =
Source.fromIterator(() => Iterator.continually(Random.nextLong() % 10000))
val tickingSource = Source.tick(1.second, 1.second, 0)
veryFastSource.zip(tickingSource).map(_._1).runWith(Sink.foreach(println))

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js