Why `futures::channel::mpsc` can just notify one sender? - concurrency

I'm reading futures-preview 0.3 sources to find out how to do "notify any" correctly. In mpsc::channel (which is bounded), multiple senders may wait for a receipt (in case of full buffer).
Looking into the implementation of next_message and unpark_one, the receiver seems to only notify one sender per one receipt.
I doubt this works in presense of select!, because select! may lead to false notification. However, I couldn't produce a problem case.
Here's my attempt to confuse mpsc:
[package]
name = "futures-mpsc-test"
version = "0.1.0"
edition = "2018"
[dependencies]
futures-preview = { version = "0.3.0-alpha.9", features = ["tokio-compat"] }
tokio = "0.1.11"
and this:
#![feature(async_await, await_macro, futures_api, pin)]
use std::collections::HashSet;
use futures::prelude::*;
use futures::channel::mpsc::{channel, Sender};
use futures::channel::oneshot;
use futures::select;
async fn main2() {
let channel_len = 1;
let num_false_wait = 1000;
let num_expected_messages = 100;
let (mut send, mut recv) = channel(channel_len);
// One extra capacity per sender. Fill the extras.
await!(send.send(-2)).unwrap();
// Fill buffers
for _ in 0..channel_len {
await!(send.send(-1)).unwrap();
}
// False waits. Should resolve and produce false waiters.
for _ in 0..num_false_wait {
await!(false_wait(&send));
}
// True messages.
{
let mut send = send.clone();
await!(send.send(-2)).unwrap();
tokio::spawn(async move {
for i in 0..num_expected_messages {
await!(send.send(i)).unwrap();
}
Ok(())
}.boxed().compat());
}
// Drain receiver until all true messages are received.
let mut expects = (0..num_expected_messages).collect::<HashSet<_>>();
while !expects.is_empty() {
let i = await!(recv.next()).unwrap();
expects.remove(&i);
eprintln!("Received: {}", i);
}
}
// If `send` is full, it will produce false waits.
async fn false_wait(send: &Sender<i32>) {
let (wait_send, wait_recv) = oneshot::channel();
let mut send = send.clone();
await!(send.send(-2)).unwrap();
tokio::spawn(async move {
let mut sending = send.send(-3);
let mut fallback = future::ready(());
select! {
sending => {
sending.unwrap();
},
fallback => {
eprintln!("future::ready is selected");
},
};
wait_send.send(()).unwrap();
Ok(())
}.boxed().compat());
await!(wait_recv).unwrap();
}
fn main() {
tokio::run(async {
await!(main2());
Ok(())
}.boxed().compat());
}
I expect this to happen:
The buffer is filled by -1. Therefore later senders are blocked.
There are both "true waiters" and "false waiters".
False waiters already exited, because the other arm of select!
immediately completes.
In each call to await!(recv.next()), at most one waiting sender is
notified. If a false waiter is notified, no one can push to the buffer,
even if the buffer has a vacant room.
If all elements are drained without true notification,
the entire system is stuck.
Despite my expectation, the main2 async function successfully completed. Why?

Further investigation on the futures source code solved my problem. At last, I cannot confuse the mpsc in this way.
The point is that, the size of mpsc is flexible and can grow more than initially specified. This behavior is mentioned in the docs:
The channel's capacity is equal to buffer + num-senders. In other words, each sender gets a guaranteed slot in the channel capacity, and on top of that there are buffer "first come, first serve" slots available to all senders.
Yes, I've first read this before experimenting, but I couldn't figure out the importance of this at that time.
Problem with fixed buffer
Think of a typical bounded queue implementation, where the size of a queue cannot grow more than initially specified. The spec is this:
When the queue is empty, receivers block.
When the queue is full (that is, the size is hitting the bound), senders block.
In this situation, if the queue is full, multiple senders are waiting for a single resource (the size of the queue).
In multithread programming, this is accomplished by primitives like notify_one. However, in futures, this is fallible: unlike multithreaded programming, the notified task doesn't necessarily make use of the resource, because the task may already have given up acquiring the resource (due to constructions like select! or Deadline) Then the spec is simply broken (the queue isn't full, but all alive senders are blocked).
mpsc is flexible
As pointed out above, the buffer size for futures::channel::mpsc::channel isn't strict. The spec is summarized as:
When message_queue.len() == 0, receivers block.
When message_queue.len() >= buffer, senders may block.
When message_queue.len() >= buffer + num_senders, senders block.
Here, num_senders is basically the number of clones of Sender, but more than that in some cases. More precisely, num_senders is the number of SenderTasks.
So, how do we avoid resource sharing? We have additional states for that:
Each sender (an instance of SenderTask) has is_parked boolean state.
The channel has another queue called parked_queue, a queue of Arc references to SenderTask.
The channel maintains the following invariants:
message_queue.len() <= buffer + num_parked_senders. Note that we don't know the value of num_parked_senders.
parked_queue.len() == min(0, message_queue.len() - buffer)
Each parked sender has at least one message in parked_queue.
This is accomplished by the following algorithm:
For receiving,
it pops off a SenderTask from parked_queue and, if the sender is parked, unpark it.
For sending,
It always waits for is_parked to be false. If message_queue.len() < buffer, as parked_queue.len() == 0, all senders are unparked. Therefore we can guarantee progress in this case.
If is_parked is false, push the message to the queue anyway.
After that, if message_queue.len() <= buffer, it needs to do nothing further.
if message_queue.len() > buffer, the sender is made unparked and pushed to parked_queue.
You can easily check the invariant is maintained in the algorithm above.
Surprisingly, the senders no more wait for a shared resource. Instead, A sender waits for its is_parked state. Even if the sending task is dropped before completion, it just remains in parked_queue for a while and doesn't block anything. How clever it is!

I doubt this works in presense of select!, because select! may lead to false notification.
No, You can't "confuse" a mpsc channel using select!:
select! does not trigger any mspc related notification, it just return the future that finishes first.
When the message queue is full it is await!(recv.next()) that notifies one producer that a slot into the bounded channel is now available.
In other words: there are no true waiters and false waiters:
when a channel message queue is full the producers block and await that the receiver side consumes the enqueued messages.

Related

Explicit throughput limiting on part of an akka stream

I have a flow in our system which reads some elements from SQS (using alpakka) and does some preporcessing (~ 10 stages, normally < 1 minute in total). Then, the prepared element is sent to the main processing (single stage, taking a few minutes). The whole thing runs on AWS/K8S and we’d like to scale out when the SQS queue grows above a certain threshold. The issue is, the SQS queue takes a long time to blow up, since there are a lot of elements “idling” in-process, having done their preprocessing but waiting for the main thing.
We can’t externalize the preprocessing stuff to a separate queue since their outcome can’t survive a de/serialization roundtrip. Also, this service and the “main” processor are deeply coupled (this service runs as main’s sidecar) and can’t be scaled independently.
The preprocessing stages are technically .mapAsyncUnordered, but the whole thing is already very slim (stream stages and SQS batches/buffers).
We tried lowering the interstage buffer (akka.stream.materializer.max-input-buffer-size), but that only gives some indirect benefit, no direct control (and is too internal to be mucking with, for my taste anyway).
I tried implementing a “gate” wrapper which would limit the amount of elements allowed inside some arbitrary Flow, looking something like:
class LimitingGate[T, U](originalFlow: Flow[T, U], maxInFlight: Int) {
private def in: InputGate[T] = ???
private def out: OutputGate[U] = ???
def gatedFlow: Flow[T, U, NotUsed] = Flow[T].via(in).via(originalFlow).via(out)
}
And using callbacks between the in/out gates for throttling.
The implementation partially works (stream termination is giving me a hard time), but it feels like the wrong way to go about achieving the actual goal.
Any ideas / comments / enlightening questions are appreciated
Thanks!
Try something along these lines (I'm only compiling it in my head):
def inflightLimit[A, B, M](n: Int, source: Source[T, M])(businessFlow: Flow[T, B, _])(implicit materializer: Materializer): Source[B, M] = {
require(n > 0) // alternatively, could just result in a Source.empty...
val actorSource = Source.actorRef[Unit](
completionMatcher = PartialFunction.empty,
failureMatcher = PartialFunction.empty,
bufferSize = 2 * n,
overflowStrategy = OverflowStrategy.dropHead // shouldn't matter, but if the buffer fills, the effective limit will be reduced
)
val (flowControl, unitSource) = actorSource.preMaterialize()
source.statefulMapConcat { () =>
var firstElem: Boolean = true
{ a =>
if (firstElem) {
(0 until n).foreach(_ => flowControl.tell(())) // prime the pump on stream materialization
firstElem = false
}
List(a)
}}
.zip(unitSource)
.map(_._1)
.via(businessFlow)
.wireTap { _ => flowControl.tell(()) } // wireTap is Akka Streams 2.6, but can be easily replaced by a map stage which sends () to flowControl and passes through the input
}
Basically:
actorSource will emit a Unit ((), i.e. meaningless) element for every () it receives
statefulMapConcat will cause n messages to be sent to the actorSource only when the stream first starts (thus allowing n elements from the source through)
zip will pass on a pair of the input from source and a () only when actorSource and source both have an element available
for every element which exits businessFlow, a message will be sent to the actorSource, which will allow another element from the source through
Some things to note:
this will not in any way limit buffering within source
businessFlow cannot drop elements: after n elements are dropped the stream will no longer process elements but won't fail; if dropping elements is required, you may be able to inline businessFlow and have the stages which drop elements send a message to flowControl when they drop an element; there are other things to address this which you can do as well

epoll_wait return EPOLLOUT even with EPOLLET flag

I am using linux epoll in edge trigger mode.
Each time a new connection is incoming, I add the file descriptor to epoll with EPOLLIN|EPOLLOUT|EPOLLET flag. My first question is: What's the right way to check which kind of event(s) occur for each ready file descriptor after the epoll_wait returns? I mean, I see some example code e.g from https://github.com/yedf/handy/blob/master/raw-examples/epoll-et.cc line 124 do it like this:
for (int i = 0; i < n; i++) {
//...
if (events & (EPOLLIN | EPOLLERR)) {
if (fd == lfd) {
handleAccept(efd, fd);
} else {
handleRead(efd, fd);
}
} else if (events & EPOLLOUT) {
if (output_log)
printf("handling epollout\n");
handleWrite(efd, fd);
} else {
exit_if(1, "unknown event");
}
}
What caught my attention is: it uses "if and else if and else" to check which event occurs, which means if it handleRead, then it can't handleWrite at the same time. And I think this may cause loss of event in the following condition: Both socket read and write operation have meet EAGAIN and then the remote end both read and send some data, thus the epoll wait may set both EPOLLIN and EPOLLOUT, but it can only handleRead, and the data remaining in output buffer can't be sent since handleWrite is not being called.
So is the above usage wrong?
According man 7 epoll QA:
If more than one event occurs between epoll_wait(2) calls, are
they combined or reported separately?
They will be combined.
If i got it right, several events can occur on a single file descriptor between epoll_wait calls. So I think I should use multiple "if if and if" to check on by one whether readable/writable/error events occur instead of using "if and else if". I went to see how nginx epoll module do, from https://github.com/nginx/nginx/blob/953f53921505a884f3912f2d8db5217a71c0479a/src/event/modules/ngx_epoll_module.c#L867 I see the following code:
if (revents & (EPOLLERR|EPOLLHUP)) {
//...
}
if ((revents & EPOLLIN) && rev->active) {
//....
rev->handler(rev);
}
if ((revents & EPOLLOUT) && wev->active) {
//....
wev->handler(wev);
}
It seems to adhere to my thoughts of checking all EPOLLERR..,EPOLLIN,EPOLLOUT events one after another.
Then I do the same kind of thing as nginx do in my application. But What I realized after experiment is: if I add the file descriptor to epoll with EPOLLIN|EPOLLOUT|EPOLLET flag, and I didn't fill up the output buffer, I will always get EPOLLOUT flag set after epoll_wait returns due to some data arrives and this fd becomes readable, therefore redundant write_handler would be called, which is not what I expect.
I did some search and found that this situation indeed exists and not caused by any bug in my application. According to the top voted answer at epoll with edge triggered event says:
On a somewhat related note: if you register for EPOLLIN and EPOLLOUT events and assuming you never fill up the send buffer, you still get the EPOLLOUT flag set in the event returned by epoll_wait each time EPOLLIN is triggered - see https://lkml.org/lkml/2011/11/17/234 for a more detailed explanation.
And the link in this answer says:
It's doesn't mean there's an EPOLLOUT "event", it just means a message
is triggered (by the socket becoming readable) so you get a status
update. In theory the program doesn't need to be told about EPOLLOUT
here (it should be assuming the socket is writable already), but it
doesn't do any harm.
So far What I understand about epoll edge trigger mode is:
the epoll_wait return when the state of any fd being monitored has changed, e.g from nothing to read -> readable or buffer is full-> buffer can write
the epoll_wait may return one or several event(flags) for each fd in the ready list.
the flags in sturct epoll_event.events field indicate the current state of this fd. Even if we don't fill out the output buffer, the EPOLLOUT flag would be set when epoll_wait return due to readable, because the current state of the fd is just writable.
Please correct me if I am wrong.
Then my question would be: Should I maintain a flag in each connection to indicate whether EAGAIN occurs when write to output buffer, if it is not set, don't call write_handler/handleWrite in "if (events & EPOLLOUT)" branch, so that my upper layer program would not be told about EPOLLOUT here?
What a great question (since I had pretty much the same question)! I'll just summarize what I think I know now wrt to your informative question/description and your helpful links and hopefully smarter folk will correct any mistakes.
Yes, the if/else handling of event flags is definitely bogus. For sure at least two can events can arrive at effectively the same time. E.g., both the read and write sides might have become unblocked since last you called epoll_wait(). And, of course, as soon as you accept() the connection, both reading and writing suddenly become possible, so you get an "event" of EPOLLIN|EPOLLOUT.
I really didn't grok that epoll_wait() is always delivering the entire current state, rather than only the parts of the state that changed -- thanks for clearing that up. To be perhaps clearer, epoll_wait() won't return an fd unless something changed on that socket, but if something did change, it returns all the flags representing the current state. So, I found myself staring at a stream of EPOLLIN|EPOLLOUT events wondering why it was claiming there was an "output" event, even though I hadn't written anything yet. Your answer being correct: it's just telling me the output side is still writeable.
"Should I maintain a flag..." Yes, but I would imagine that in all but the most trivial situations you were probably going to end up maintaining at least one bit of "am I currently blocked" state for your readers/writers anyway. For example, if you ever want to process data in an order different than how it arrives (e.g., prioritize responses over requests to make your server more resistant to overload) you instantly have to give up the simplicity of just having the arrival of I/O drive everything. In the particular case of writing, epoll simply doesn't have enough information to notify you at the "right" time. As soon as you accept a connection, there's an event that says "you can write now"--but you probably have nothing to write if you're a server who couldn't possibly have already gotten a request from the client. epoll just can't know whether you have something to write or not, so you were always going to have to either suffer essentially "extraneous" events, or maintain your own state.
In all but the simplest cases, the socket file descriptor ends up being insufficient information for handling I/O events, so you invariably have to associate some data structure with it, or object if you prefer. So, my C++ looks something like:
nAwake = epoll_wait(epollFd, events, 100, milliseconds);
if(nAwake < 0)
{
perror("epoll_wait failed");
assert(false);
}
for(int iSocket=0; iSocket < nAwake; ++iSocket)
{
auto This = static_cast<Eventable*>(events[iSocket].data.ptr);
auto eventFlags = events[iSocket].events;
fprintf(stderr, "%s event on socket [%d] -> %s\n",
This->ClassName(), This->fd, DumpEvent(eventFlags));
This->Event(eventFlags);
}
Where Eventable is a C++ class (or derivative thereof) that has all the state needed to decide how to handle the flags epoll delivers. (Of course, this is letting the kernel store a pointer to a C++ object, requiring a design that is very clear about pointer ownership/lifetimes.)
And since you're writing low-level code on Linux, you may also care about EPOLLRDHUP. This not-highly-portable flag lets you save one call to read(). If the client (curl seems pretty good at evoking this behavior) closes its write side of the connection (sends a FIN), you normally discover that when epoll tells you EPOLLIN, but read() returns zero bytes. However, Linux maintains an extra bit to indicate your client's write side (your read side) has been closed. So, if you tell epoll you want the EPOLLRDHUP event you can use it to avoid doing a read() whose sole purpose will turn out to be telling you the writer closed their side.
Note that EPOLLIN will still be turned on whenever EPOLLRDHUP is, AFAIK. Even after you do a shutdown(fd, SHUT_RD). Another example of how you will usually be driven to maintain your own idea of the state of the connection. You care more about clients who are kind enough to do half-shutdowns if you are implementing HTTP.
When used as an edge-triggered interface, for performance reasons,
it
is possible to add the file descriptor inside the epoll interface
(EPOLL_CTL_ADD) once by specifying (EPOLLIN|EPOLLOUT).
This allows you
to avoid continuously switching between EPOLLIN and EPOLLOUT calling
epoll_ctl(2) with EPOLL_CTL_MOD.

How do I use select() and gRPC to create a server?

I need to use gRPC but in a single-threaded application (with additional socket channels). Naively, I'm thinking of using select() and depending on which file descriptor pops, calling gRPC to handle the message. My question is, can someone give me a rough (5-10 lines of code) outline skeleton on what I need to call after the select() pops?
Looking at Google's "hello world" example in the synchronous case implies a thread pool (which I can't use), and in the asynchronous case shows the main loop blocking -- which doesn't work for me because I need to handle other socket operations.
You can't do it, at this point (and probably ever).
One of the big weaknesses of event loops, including direct use of select()/poll() style APIs, is that they aren't composable in any natural way short of direct integration between the two.
We could theoretically add such functionality for Linux -- exporting an epoll_fd with a timerfd which becomes readable if it would be productive to call into a completion queue, but doing so would impose substantial constraints and architectural overhead on the rest of the stack just to support this usecase only on Linux. Everywhere else would require a background thread to manage that fd's readability.
This can be done using a gRPC async service along with grpc::Alarm to send any events that come from select or other polling APIs onto the gRPC completion queue. You can see an example using Epoll and gRPC together in this gist. The important functions are these two:
bool grpc_tick(grpc::ServerCompletionQueue& queue) {
void* tag = nullptr;
bool ok = false;
auto next_status = queue.AsyncNext(&tag, &ok, std::chrono::system_clock::now());
if (next_status == grpc::CompletionQueue::GOT_EVENT) {
if (ok && tag) {
static_cast<RequestProcessor*>(tag)->grpc_queue_tick();
} else {
std::cerr << "Not OK or bad tag: " << ok << "; " << tag << std::endl;
return false;
}
}
return next_status != grpc::CompletionQueue::SHUTDOWN;
}
bool tick_loops(int epoll, grpc::ServerCompletionQueue& queue) {
// Pump epoll events over to gRPC's completion queue.
epoll_event event{0};
while (epoll_wait(epoll, &event, /*maxevents=*/1, /*timeout=*/0)) {
grpc::Alarm alarm;
alarm.Set(&queue, std::chrono::system_clock::now(), event.data.ptr);
if (!grpc_tick(queue)) return false;
}
// Make sure gRPC gets at least 1 tick.
return grpc_tick(queue);
}
Here you can see the tick_loops function repeatedly calls epoll_wait until no more events are returned. For each epoll event, a grpc::Alarm is constructed with the deadline set to right now. After that, the gRPC event loop is immediately pumped with grpc_tick.
Note that the grpc::Alarm instance MUST outlive its time on the completion queue. In a real-world application, the alarm should be somehow attached to the tag (event.data.ptr in this example) so it can be cleaned up in the completion callback.
The gRPC event loop is then pumped again to ensure that any non-epoll events are also processed.
Completion queues are thread safe, so you could also put the epoll pump on one thread and the gRPC pump on another. With this setup you would not need to set the polling timeouts for each to 0 as they are in this example. This would reduce CPU usage by limiting dry cycles of the event loop pumps.

How to limit an Akka Stream to execute and send down one message only once per second?

I have an Akka Stream and I want the stream to send messages down stream approximately every second.
I tried two ways to solve this problem, the first way was to make the producer at the start of the stream only send messages once every second when a Continue messages comes into this actor.
// When receive a Continue message in a ActorPublisher
// do work then...
if (totalDemand > 0) {
import scala.concurrent.duration._
context.system.scheduler.scheduleOnce(1 second, self, Continue)
}
This works for a short while then a flood of Continue messages appear in the ActorPublisher actor, I assume (guess but not sure) from downstream via back-pressure requesting messages as the downstream can consume fast but the upstream is not producing at a fast rate. So this method failed.
The other way I tried was via backpressure control, I used a MaxInFlightRequestStrategy on the ActorSubscriber at the end of the stream to limit the number of messages to 1 per second. This works but messages coming in come in at approximately three or so at a time, not just one at a time. It seems the backpressure control doesn't immediately change the rate of messages coming in OR messages were already queued in the stream and waiting to be processed.
So the problem is, how can I have an Akka Stream which can process one message only per second?
I discovered that MaxInFlightRequestStrategy is a valid way to do it but I should set the batch size to 1, its batch size is default 5, which was causing the problem I found. Also its an over-complicated way to solve the problem now that I am looking at the submitted answer here.
You can either put your elements through the throttling flow, which will back pressure a fast source, or you can use combination of tick and zip.
The first solution would be like this:
val veryFastSource =
Source.fromIterator(() => Iterator.continually(Random.nextLong() % 10000))
val throttlingFlow = Flow[Long].throttle(
// how many elements do you allow
elements = 1,
// in what unit of time
per = 1.second,
maximumBurst = 0,
// you can also set this to Enforcing, but then your
// stream will collapse if exceeding the number of elements / s
mode = ThrottleMode.Shaping
)
veryFastSource.via(throttlingFlow).runWith(Sink.foreach(println))
The second solution would be like this:
val veryFastSource =
Source.fromIterator(() => Iterator.continually(Random.nextLong() % 10000))
val tickingSource = Source.tick(1.second, 1.second, 0)
veryFastSource.zip(tickingSource).map(_._1).runWith(Sink.foreach(println))

how to simulate time delay in network

Let's say that we need to send this message Hellow World using UDP protocol between two PCs A and B . Computer A will send the message to B with some time delay (i.e. constant or time-varying). Now to simulate this scenario, my first attempt is to use sleep function but this solution will freezes the entire application. Another solution is to implement mutlithreads and use sleep() with the thread that is responsible for getting the data and store this in a global variable and access this variable from another thread. In this solution, there might be difficulties in the synchronization between the threads. To overcome this problem, I will write the received data in txt file and read it from another thread. My question is what is the proper way to carry out this trivial experiment? I will appreciate if the answer has some C++ pseudo.
Edit:
My attempt to solve it is as follows, for the Master side (client),
Master masterObj
int main()
{
masterObj.initialize();
masterObj.connect();
while( masterObj.isConnected() == true ){
get currentTime and data; // currentTime here is sendTime
datagram = currentTime + data;
masterObj.send( datagram );
}
}
For the Slave side (server), the pseudo code is
Slave slaveObj
int main()
{
slaveObj.initialize();
slaveObj.connect();
slaveObj.slaveThreadInit();
while( slaveObj.isConnected() == true ){
slaveObj.getData();
}
}
Slave::recieve()
{
get currentTime and call it recievedTime
get datagram from Master;
this->slaveThread( recievedTime + datagram );
}
Slave::slaveThread( info )
{
sleep( 1 msec );
info = recievedTime + datagram ;
get time delay;
time delay = sendTime - recievedTime;
extract data from datagram;
insert data and time delay in txt file ( call it txtSlaveData);
}
Slave::getData()
{
read from txtSlaveData;
}
As you can see, I'm using an independent thread which inside it, I'm using sleep(). I'm not sure if this approach is applicable.
A simple way to simulate sending UDP datagrams from one computer to another is to send the datagrams through the loopback interface to another - or the same - process on the same computer. That will function exactly like the real thing except for the delay.
You can simulate the delay either when sending or receiving. Once you've implemented it one way, the other should be trivial. I think delaying the sending side is more natural option. Here is an approach for the more general problem of simulating network delay. See the last paragraph for a trivial experiment of sending only one datagram.
In case you choose delaying on send, what you could do is, instead of sending, store the datagram in a queue, along with the time it should be sent (target = now + delay).
Then, in another thread, wait for a datagram to become available, then sleep for max(target - now, 0). After sleeping, send the datagram and move on to the next one. Wait if queue is empty.
To simulate jitter, randomize the delay. To let jitter simulation send the datagrams in non-sequential order, use a priority queue, sorted by the target send-time.
Remember to synchronize the access to the queue.
For a single datagram, you can do much simpler. Simply start a new thread, sleep for the delay, send and end thread. No need for synchronization. Here's c++ code for that:
std::thread([]{
std::this_thread::sleep_for(delay);
send("foo");
}).detach();