I have a test case for a Flux from Project Reactor roughly like this:
testMultipleChunks(StepVerifier.FirstStep<Chunk> verifier, Chunk chunk1, Chunk chunk2) {
verifier.then(() -> {
worker.add(chunk1);
worker.add(chunk2);
worker.update();
})
.expectNext(chunk1, chunk2)
.verifyTimeout(Duration.ofSeconds(5));
}
Thing is, my worker is encouraged to parallelize the work, which means the order of the output is undefined. chunk2, chunk1 would be equally valid.
How can I make assertions on the output in an order-independent way?
Properties I care about:
every element in the expected set is present
there are no unexpected elements
there are no extra (duplicate) events
I tried this:
testMultipleChunks(StepVerifier.FirstStep<Chunk> verifier, Chunk chunk1, Chunk chunk2) {
Set<Chunk> expectedOutput = Set.of(chunk1, chunk2);
verifier.then(() -> {
worker.add(chunk1);
worker.add(chunk2);
worker.update();
})
.recordWith(HashSet::new)
.expectNextCount(expectedOutput.size())
.expectRecordedMatches(expectedOutput::equals)
.verifyTimeout(Duration.ofSeconds(5));
}
While I think that makes the assertions I want, it took a terrible dive in readability. A clear one-line, one-method assertion was replaced with four lines with a lot of extra punctuation.
expectedRecordedMatches is also horribly uninformative when it fails, saying only “expected collection predicate match” without giving any information about what the expectation is or how close the result was.
What's a clearer way to write this test?
StepVerifier is not a good fit for that because it verifies each signal as it gets emitted, and materialize an expected order for asynchronous signals, by design.
It is especially tricky because (it seems) your publisher under test doesn't clearly complete.
If it was completing after N elements (N being the expected amount here), I'd change the publisher passed to StepVerifier.create from flux to flux.collectList(). That way, you get a List view of the onNext and you can assert the list as you see fit (eg. using AssertJ, which I recommend).
One alternative in recent versions of Reactor is the TestSubscriber, which can be used to drive request and cancel() without any particular opinion on blocking or on when to perform the assertions. Instead, it internally stores the events it sees (onNext go into a List, onComplete and onError are stored as a terminal Signal...) and you can access these for arbitrary assertions.
Related
When writing concurrent code, it's fairly common to want to spin off a separate (green or OS) thread and then ask the code in that thread to react to various thread-safe messages. Raku supports this pattern in a number of ways.
For example, many of the Channel examples in the docs show code that's similar to the code below (which prints one through ten across two threads).
my $channel = Channel.new;
start { react whenever $channel { say $_ }}
for ^10 { $channel.send($_) }
sleep 1
However, if we switch from the single-consumer world of Channels to the multi-consumer world of live Supplys, the equivalent code no longer works.
my Supplier $supplier .= new;
start { react whenever $supplier { say $_ }}
for ^10 { $supplier.emit($_) }
sleep 1;
This code prints nothing. As I understand it, this is because the react block was not listening when the values were emited – it doesn't take long to start a thread and react to events, but it takes even less time to emit ten values. And, logically enough, moving the sleep 1 line above the for loop causes the values to print again.
And that's all fair enough – after all, the reason to use a live Supply rather than an on-demand one is because you want the live semantics. That is, you want to only react to future events, not to past ones.
But my question is whether there's a way to ask a react block in a thread I've started whether it's ready and/or to wait for it to be ready before sending data. (awaiting the start block waits until the thread is done, rather than until it's ready, so that doesn't help here).
I'm also open to answers saying that I'm approaching this incorrectly/there's an X-Y problem – it's entirely possible that I'm straining against the direction the language is trying to push me or that live Supplys aren't the correct concurrency abstraction here.
For this specific case (which is a relatively common one), the answer would be to use a Supplier::Preserving:
my Supplier::Preserving $supplier .= new;
start { react whenever $supplier { say $_ }}
for ^10 { $supplier.emit($_) }
sleep 1;
Which retains sent values until $supplier is first tapped, and then emits them.
An alternative, more general, solution is to use a Promise:
my Supplier $supplier .= new;
# A Promise used just for synchronization
my Promise $ready .= new;
start react {
# Set up the subscriptions...
whenever $supplier { say $_ }
# ...and then signal that they are ready.
$ready.keep;
}
# Wait for the subscriptions to be set up...
await $ready;
# ...and off we go.
for ^10 { $supplier.emit($_) }
sleep 1;
The whenevers in a react block set up subscriptions as they are encountered, so by the time the Promise is kept, all of the subscriptions will have been made. (Further, although not important here, no messages are processed until the body of the react block has finished setting everything up.)
Finally I'll note that while Supplier is often reached for, many times one would be better off writing a supply block that emits the values. The example in the question is (quite reasonably enough) abstracted from a concrete application, but it's almost always worth asking, "can I do what I want by writing a supply block" before reaching for a Supplier or Supplier::Preserving. If you really do need to broadcast values or need to distribute asynchronous inputs to multiple places, there's a solid case for Supplier; if it's just a single stream of values to be produced once tapped, there probably isn't.
I'm having trouble getting my head around the purpose of supply {…} blocks/the on-demand supplies that they create.
Live supplies (that is, the types that come from a Supplier and get new values whenever that Supplier emits a value) make sense to me – they're a version of asynchronous streams that I can use to broadcast a message from one or more senders to one or more receivers. It's easy to see use cases for responding to a live stream of messages: I might want to take an action every time I get a UI event from a GUI interface, or every time a chat application broadcasts that it has received a new message.
But on-demand supplies don't make a similar amount of sense. The docs say that
An on-demand broadcast is like Netflix: everyone who starts streaming a movie (taps a supply), always starts it from the beginning (gets all the values), regardless of how many people are watching it right now.
Ok, fair enough. But why/when would I want those semantics?
The examples also leave me scratching my head a bit. The Concurancy page currently provides three examples of a supply block, but two of them just emit the values from a for loop. The third is a bit more detailed:
my $bread-supplier = Supplier.new;
my $vegetable-supplier = Supplier.new;
my $supply = supply {
whenever $bread-supplier.Supply {
emit("We've got bread: " ~ $_);
};
whenever $vegetable-supplier.Supply {
emit("We've got a vegetable: " ~ $_);
};
}
$supply.tap( -> $v { say "$v" });
$vegetable-supplier.emit("Radish"); # OUTPUT: «We've got a vegetable: Radish»
$bread-supplier.emit("Thick sliced"); # OUTPUT: «We've got bread: Thick sliced»
$vegetable-supplier.emit("Lettuce"); # OUTPUT: «We've got a vegetable: Lettuce»
There, the supply block is doing something. Specifically, it's reacting to the input of two different (live) Suppliers and then merging them into a single Supply. That does seem fairly useful.
… except that if I want to transform the output of two Suppliers and merge their output into a single combined stream, I can just use
my $supply = Supply.merge:
$bread-supplier.Supply.map( { "We've got bread: $_" }),
$vegetable-supplier.Supply.map({ "We've got a vegetable: $_" });
And, indeed, if I replace the supply block in that example with the map/merge above, I get exactly the same output. Further, neither the supply block version nor the map/merge version produce any output if the tap is moved below the calls to .emit, which shows that the "on-demand" aspect of supply blocks doesn't really come into play here.
At a more general level, I don't believe the Raku (or Cro) docs provide any examples of a supply block that isn't either in some way transforming the output of a live Supply or emitting values based on a for loop or Supply.interval. None of those seem like especially compelling use cases, other than as a different way to transform Supplys.
Given all of the above, I'm tempted to mostly write off the supply block as a construct that isn't all that useful, other than as a possible alternate syntax for certain Supply combinators. However, I have it on fairly good authority that
while Supplier is often reached for, many times one would be better off writing a supply block that emits the values.
Given that, I'm willing to hazard a pretty confident guess that I'm missing something about supply blocks. I'd appreciate any insight into what that might be.
Given you mentioned Supply.merge, let's start with that. Imagine it wasn't in the Raku standard library, and we had to implement it. What would we have to take care of in order to reach a correct implementation? At least:
Produce a Supply result that, when tapped, will...
Tap (that is, subscribe to) all of the input supplies.
When one of the input supplies emits a value, emit it to our tapper...
...but make sure we follow the serial supply rule, which is that we only emit one message at a time; it's possible that two of our input supplies will emit values at the same time from different threads, so this isn't an automatic property.
When all of our supplies have sent their done event, send the done event also.
If any of the input supplies we tapped sends a quit event, relay it, and also close the taps of all of the other input supplies.
Make very sure we don't have any odd races that will lead to breaking the supply grammar emit* [done|quit].
When a tap on the resulting Supply we produce is closed, be sure to close the tap on all (still active) input supplies we tapped.
Good luck!
So how does the standard library do it? Like this:
method merge(*#s) {
#s.unshift(self) if self.DEFINITE; # add if instance method
# [I elided optimizations for when there are 0 or 1 things to merge]
supply {
for #s {
whenever $_ -> \value { emit(value) }
}
}
}
The point of supply blocks is to greatly ease correctly implementing reusable operations over one or more Supplys. The key risks it aims to remove are:
Not correctly handling concurrently arriving messages in the case that we have tapped more than one Supply, potentially leading us to corrupt state (since many supply combinators we might wish to write will have state too; merge is so simple as not to). A supply block promises us that we'll only be processing one message at a time, removing that danger.
Losing track of subscriptions, and thus leaking resources, which will become a problem in any longer-running program.
The second is easy to overlook, especially when working in a garbage-collected language like Raku. Indeed, if I start iterating some Seq and then stop doing so before reaching the end of it, the iterator becomes unreachable and the GC eats it in a while. If I'm iterating over lines of a file and there's an implicit file handle there, I risk the file not being closed in a very timely way and might run out of handles if I'm unlucky, but at least there's some path to it getting closed and the resources released.
Not so with reactive programming: the references point from producer to consumer, so if a consumer "stops caring" but hasn't closed the tap, then the producer will retain its reference to the consumer (thus causing a memory leak) and keep sending it messages (thus doing throwaway work). This can eventually bring down an application. The Cro chat example that was linked is an example:
my $chat = Supplier.new;
get -> 'chat' {
web-socket -> $incoming {
supply {
whenever $incoming -> $message {
$chat.emit(await $message.body-text);
}
whenever $chat -> $text {
emit $text;
}
}
}
}
What happens when a WebSocket client disconnects? The tap on the Supply we returned using the supply block is closed, causing an implicit close of the taps of the incoming WebSocket messages and also of $chat. Without this, the subscriber list of the $chat Supplier would grow without bound, and in turn keep alive an object graph of some size for each previous connection too.
Thus, even in this case where a live Supply is very directly involved, we'll often have subscriptions to it that come and go over time. On-demand supplies are primarily about resource acquisition and release; sometimes, that resource will be a subscription to a live Supply.
A fair question is if we could have written this example without a supply block. And yes, we can; this probably works:
my $chat = Supplier.new;
get -> 'chat' {
web-socket -> $incoming {
my $emit-and-discard = $incoming.map(-> $message {
$chat.emit(await $message.body-text);
Supply.from-list()
}).flat;
Supply.merge($chat, $emit-and-discard)
}
}
Noting it's some effort in Supply-space to map into nothing. I personally find that less readable - and this didn't even avoid a supply block, it's just hidden inside the implementation of merge. Trickier still are cases where the number of supplies that are tapped changes over time, such as in recursive file watching where new directories to watch may appear. I don't really know how'd I'd express that in terms of combinators that appear in the standard library.
I spent some time teaching reactive programming (not with Raku, but with .Net). Things were easy with one asynchronous stream, but got more difficult when we started getting to cases with multiple of them. Some things fit naturally into combinators like "merge" or "zip" or "combine latest". Others can be bashed into those kinds of shapes with enough creativity - but it often felt contorted to me rather than expressive. And what happens when the problem can't be expressed in the combinators? In Raku terms, one creates output Suppliers, taps input supplies, writes logic that emits things from the inputs into the outputs, and so forth. Subscription management, error propagation, completion propagation, and concurrency control have to be taken care of each time - and it's oh so easy to mess it up.
Of course, the existence of supply blocks doesn't stop being taking the fragile path in Raku too. This is what I meant when I said:
while Supplier is often reached for, many times one would be better off writing a supply block that emits the values
I wasn't thinking here about the publish/subscribe case, where we really do want to broadcast values and are at the entrypoint to a reactive chain. I was thinking about the cases where we tap one or more Supply, take the values, do something, and then emit things into another Supplier. Here is an example where I migrated such code towards a supply block; here is another example that came a little later on in the same codebase. Hopefully these examples clear up what I had in mind.
Please, what is the difference between those two approaches defining a Sink[RandomCdr,Future[Done]
Flow[RandomCdr]
.grouped(bulkSize)
.flatMapConcat{ (bulk : Seq[RandomCdr]) =>
Source.fromFuture(collection.flatMap(_.insert[RandomCdr](false)(randomCdrWriter,ec).many(bulk)(ec))(ec))
}
.toMat(Sink.ignore)(Keep.right)
Flow[RandomCdr]
.grouped(bulkSize)
.map((bulk : Seq[RandomCdr]) => collection.flatMap(_.insert[RandomCdr](false)(randomCdrWriter,ec).many(bulk)(ec))(ec))
.toMat(Sink.ignore)(Keep.right)
The function collection.flatMap(_.insert[RandomCdr](false)(randomCdrWriter,ec).many(bulk)(ec))(ec) that returns a Future[T] is the reactivemongo driver
First snippet
Here each incoming bulk will be transformed into a Future, and said Future will be run within the execution context you provide. Only at this point, the next bulk will be processed by generating another Future, and so on.
Basically the futures are run in sequence. This is similar in behaviour to
Flow[RandomCdr]
.grouped(bulkSize)
.mapAsync(parallelism = 1){ (bulk : Seq[RandomCdr]) =>
collection.flatMap(_.insert[RandomCdr](false)(randomCdrWriter,ec).many(bulk)(ec))(ec)
}
.toMat(Sink.ignore)(Keep.right)
Second snippet
Here each incoming bulk will be transformed into a Future, which will be run within the execution context you provide. The Future will be then immediately passed to the Sink.ignore and its reference will be thrown away.
With this approach there is no control around how many Futures will be run at the same time. For this reason this approach is not recommended.
If you're looking for improved parallelism, consider using mapAsync as shown above, and tweak the parallelism parameter.
I made a class that has an asynchronous OpenWebPage() function. Once you call OpenWebPage(someUrl), a handler gets called - OnPageLoad(reply). I have been using a global variable called lastAction to take care of stuff once a page is loaded - handler checks what is the lastAction and calls an appropriate function. For example:
this->lastAction == "homepage";
this->OpenWebPage("http://www.hardwarebase.net");
void OnPageLoad(reply)
{
if(this->lastAction == "homepage")
{
this->lastAction = "login";
this->Login(); // POSTs a form and OnPageLoad gets called again
}
else if(this->lastAction == "login")
{
this->PostLogin(); // Checks did we log in properly, sets lastAction as new topic and goes to new topic URL
}
else if(this->lastAction == "new topic")
{
this->WriteTopic(); // Does some more stuff ... you get the point
}
}
Now, this is rather hard to write and keep track of when we have a large number of "actions". When I was doing stuff in Python (synchronously) it was much easier, like:
OpenWebPage("http://hardwarebase.net") // Stores the loaded page HTML in self.page
OpenWebpage("http://hardwarebase.net/login", {"user": username, "pw": password}) // POSTs a form
if(self.page == ...): // now do some more checks etc.
// do something more
Imagine now that I have a queue class which holds the actions: homepage, login, new topic. How am I supposed to execute all those actions (in proper order, one after one!) via the asynchronous callback? The first example is totally hard-coded obviously.
I hope you understand my question, because frankly I fear this is the worst question ever written :x
P.S. All this is done in Qt.
You are inviting all manner of bugs if you try and use a single member variable to maintain state for an arbitrary number of asynchronous operations, which is what you describe above. There is no way for you to determine the order that the OpenWebPage calls complete, so there's also no way to associate the value of lastAction at any given time with any specific operation.
There are a number of ways to solve this, e.g.:
Encapsulate web page loading in an immutable class that processes one page per instance
Return an object from OpenWebPage which tracks progress and stores the operation's state
Fire a signal when an operation completes and attach the operation's context to the signal
You need to add "return" statement in the end of every "if" branch: in your code, all "if" branches are executed in the first OnPageLoad call.
Generally, asynchronous state mamangment is always more complicated that synchronous. Consider replacing lastAction type with enumeration. Also, if OnPageLoad thread context is arbitrary, you need to synchronize access to global variables.
I am using the parallel data structures in my .NET 4 application and I have a ConcurrentQueue that gets added to while I am processing through it.
I want to do something like:
personqueue.AsParallel().WithDegreeOfParallelism(20).ForAll(i => ... );
as I make database calls to save the data, so I am limiting the number of concurrent threads.
But, I expect that the ForAll isn't going to dequeue, and I am concerned about just doing
ForAll(i => {
personqueue.personqueue.TryDequeue(...);
...
});
as there is no guarantee that I am popping off the correct one.
So, how can I iterate through the collection and dequeue, in a parallel fashion.
Or, would it be better to use PLINQ to do this processing, in parallel?
Well I'm not 100% sure what you try to archive here. Are you trying to just dequeue all items until nothing is left? Or just dequeue lots of items in one go?
The first probably unexpected behavior starts with this statement:
theQueue.AsParallel()
For a ConcurrentQueue, you get a 'Snapshot'-Enumerator. So when you iterate over a concurrent stack, you only iterate over the snapshot, no the 'live' queue.
In general I think it's not a good idea to iterate over something you're changing during the iteration.
So another solution would look like this:
// this way it's more clear, that we only deque for theQueue.Count items
// However after this, the queue is probably not empty
// or maybe the queue is also empty earlier
Parallel.For(0, theQueue.Count,
new ParallelOptions() {MaxDegreeOfParallelism = 20},
() => {
theQueue.TryDequeue(); //and stuff
});
This avoids manipulation something while iterating over it. However, after that statement, the queue can still contain data, which was added during the for-loop.
To get the queue empty for moment in time you probably need a little more work. Here's an really ugly solution. While the queue has still items, create new tasks. Each task start do dequeue from the queue as long as it can. At the end, we wait for all tasks to end. To limit the parallelism, we never create more than 20-tasks.
// Probably a kitty died because of this ugly code ;)
// However, this code tries to get the queue empty in a very aggressive way
Action consumeFromQueue = () =>
{
while (tt.TryDequeue())
{
; // do your stuff
}
};
var allRunningTasks = new Task[MaxParallism];
for(int i=0;i<MaxParallism && tt.Count>0;i++)
{
allRunningTasks[i] = Task.Factory.StartNew(consumeFromQueue);
}
Task.WaitAll(allRunningTasks);
If you are aiming at a high throughout real site and you don't have to do immediate DB updates , you'll be much better of going for very conservative solution rather than extra layers libraries.
Make fixed size array (guestimate size - say 1000 items or N seconds worth of requests) and interlocked index so that requests just put data into slots and return. When one block gets filled (keep checking the count), make another one and spawn async delegate to process and send to SQL the block that just got filled. Depending on the structure of your data that delegate can pack all data into comma-separated arrays, maybe even a simple XML (got to test perf of that one of course) and send them to SQL sproc which should give it's best to process them record by record - never holding a big lock. It if gets heavy, you can split your block into several smaller blocks. The key thing is that you minimized the number of requests to SQL, always kept one degree of separation and didn't even have to pay the price for a thread pool - you probably won't need to use more that 2 async threads at all.
That's going to be a lot faster that fiddling with Parallel-s.