I'm working on a project, where a primary server thread needs to dispatch events to a series of worker threads. The work that goes on in the worker threads relies on polling (ie. epoll or kqueue depending on the UNIX system in question) with timeouts on these operations needing to be handles. This means, that a normal conditional variable or semaphore structure is not viable for this dispatch, as it would make one or the other block resulting in an unwanted latency between either handling the events coming from polling or the events originating from the server thread.
So, I'm wondering what the most optimal construct for dispatching such events between threads in a pollable fashion is? Essentially, all that needs to be delivered is a pollable "signal" that tells the worker thread, that it has more events to fetch. I've looked at using UNIX pipes (unnamed ones, as it's internal to the process) which seems like a decent solution given that a single byte can be written to the pipe and read back out when the queue is cleared -- but, I'm wondering if this is the best approach available? Or the fastest?
Alternatively, there is the possibility to use signalfd(2) on Linux, but as this is not available on BSD systems, I'd rather like to avoid this construct. I'm also wondering how great the overhead in using system signals actually is?
Jan Hudec's answer is correct, although I wouldn't recommend using signals for a few reasons:
Older versions of glibc emulated pselect and ppoll in a non-atomic fashion, making them basically worthless. Even when you used the mask correctly, signals could get "lost" between the pthread_sigprocmask and select calls, meaning they don't cause EINTR.
I'm not sure signalfd is any more efficient than the pipe. (Haven't tested it, but I don't have any particular reason to believe it is.)
signals are generally a pain to get right. I've spent a lot of effort on them (see my sigsafe library) and I'd recommend avoiding them if you can.
Since you're trying to have asynchronous handling portable to several systems, I'd recommend looking at libevent. It will abstract epoll or kqueue for you, and it will even wake up workers on your behalf when you add a new event. See event.c
2058 static inline int
2059 event_add_internal(struct event *ev, const struct timeval *tv,
2060 int tv_is_absolute)
2061 {
...
2189 /* if we are not in the right thread, we need to wake up the loop */
2190 if (res != -1 && notify && EVBASE_NEED_NOTIFY(base))
2191 evthread_notify_base(base);
...
2196 }
Also,
The worker thread deals with both socket I/O and asynchronous disk I/O, which means that it is optimally always waiting for the event queuing mechanism (epoll/kqueue).
You're likely to be disappointed here. These event queueing mechanisms don't really support asynchronous disk I/O. See this recent thread for more details.
As far as performance goes, the cost of system call is comparably huge to other operations, so it's the number of system calls that matters. There are two options:
Use the pipes as you wrote. If you have any useful payload for the message, you get one system call to send, one system call to wait and one system call to receive. Try to pass any relevant data down the pipe instead of reading them from a shared structure to avoid additional overhead from locking.
The select and poll have variants, that also waits for signals (pselect, ppoll). Linux epoll can do the same using signalfd, so it remains a question whether kqueue can wait for signals, which I don't know. If it can, than you could use them (you are using different mechanism on Linux and *BSD anyway). It would save you the syscall for reading if you don't have good use for the passed data.
I would expect passing the data over socket to be more efficient if it allows you do do away with any other locking.
Related
I am working on designing a websocket server which receives a message and saves it to an embedded database. For reading the messages I am using boost asio. To save the messages to the embedded database I see a few options in front of me:
Save the messages synchronously as soon as I receive them over the same thread.
Save the messages asynchronously on a separate thread.
I am pretty sure the second answer is what I want. However, I am not sure how to pass messages from the socket thread to the IO thread. I see the following options:
Use one io service per thread and use the post function to communicate between threads. Here I have to worry about lock contention. Should I?
Use Linux domain sockets to pass messages between threads. No lock contention as far as I understand. Here I can probably use BOOST_ASIO_DISABLE_THREADS macro to get some performance boost.
Also, I believe it would help to have multiple IO threads which would receive messages in a round robin fashion to save to the embedded database.
Which architecture would be the most performant? Are there any other alternatives from the ones I mentioned?
A few things to note:
The messages are exactly 8 bytes in length.
Cannot use an external database. The database must be embedded in the running
process.
I am thinking about using RocksDB as the embedded
database.
I don't think you want to use a unix socket, which is always going to require a system call and pass data through the kernel. That is generally more suitable as an inter-process mechanism than an inter-thread mechanism.
Unless your database API requires that all calls be made from the same thread (which I doubt) you don't have to use a separate boost::asio::io_service for it. I would instead create an io_service::strand on your existing io_service instance and use the strand::dispatch() member function (instead of io_service::post()) for any blocking database tasks. Using a strand in this manner guarantees that at most one thread may be blocked accessing the database, leaving all the other threads in your io_service instance available to service non-database tasks.
Why might this be better than using a separate io_service instance? One advantage is that having a single instance with one set of threads is slightly simpler to code and maintain. Another minor advantage is that using strand::dispatch() will execute in the current thread if it can (i.e. if no task is already running in the strand), which may avoid a context switch.
For the ultimate optimization I would agree that using a specialized queue whose enqueue operation cannot make a system call could be fastest. But given that you have network i/o by producers and disk i/o by consumers, I don't see how the implementation of the queue is going to be your bottleneck.
After benchmarking/profiling I found the facebook folly implementation of MPMC Queue to be the fastest by at least a 50% margin. If I use the non-blocking write method, then the socket thread has almost no overhead and the IO threads remain busy. The number of system calls are also much less than other queue implementations.
The SPSC queue with cond variable in boost is slower. I am not sure why that is. It might have something to do with the adaptive spin that folly queue uses.
Also, message passing (UDP domain sockets in this case) turned out to be orders of magnitude slower especially for larger messages. This might have something to do with copying of data twice.
You probably only need one io_service -- you can create additional threads which will process events occurring within the io_service by providing boost::asio::io_service::run as the thread function. This should scale well for receiving 8-byte messages from clients over the network socket.
For storing the messages in the database, it depends on the database & interface. If it's multi-threaded, then you might as well just send each message to the DB from the thread that received it. Otherwise, I'd probably set up a boost::lockfree::queue where a single reader thread pulls items off and sends them to the database, and the io_service threads append new messages to the queue when they arrive.
Is that the most efficient approach? I dunno. It's definitely simple, and gives you a baseline that you can profile if it's not fast enough for your situation. But I would recommend against designing something more complicated at first: you don't know whether you'll need it at all, and unless you know a lot about your system, it's practically impossible to say whether a complicated approach would perform any better than the simple one.
void Consumer( lockfree::queue<uint64_t> &message_queue ) {
// Connect to database...
while (!Finished) {
message_queue.consume_all( add_to_database ); // add_to_database is a Functor that takes a message
cond_var.wait_for( ... ); // Use a timed wait to avoid missing a signal. It's OK to consume_all() even if there's nothing in the queue.
}
}
void Producer( lockfree::queue<uint64_t> &message_queue ) {
while (!Finished) {
uint64_t m = receive_from_network( );
message_queue.push( m );
cond_var.notify_all( );
}
}
Assuming that the constraint of using cxx11 is not too hard in your situtation, I would try to use the std::async to make an asynchronous call to the embedded DB.
I'm making a messaging service that needs to use both socket io and shared memory. The routine will be the same regardless of where the input comes from, with the only difference being local messages will be passed via shared memory and non-local messages over a socket. Both events will have to unblock the same pselect call.
At this point I think the best option might be to send a signal whenever a message is written to shared memory and use it to interrupt a pselect call but I'm not quite sure how this would be done or even if it's the best route.
I'm not used to using signals. what's the best way to accomplish this?
I would consider using a pipe (see pipe(2)) or an AF_UNIX local unix(7) socket(2) (as commented by caf) at least to transmit control information -for synchronization- about the shared memory (i.e. tell when it has changed, that is when a message has been sent thru shared memory, etc.). Then you can still multiplex with e.g. poll(2) (or ppoll(2) or pselect(2) etc...)
I don't think that synchronization using signals is the right approach: signals are difficult to get right (so coding is tricky) and they are not more efficient than exchanging a few bytes on some pipe.
Did you consider to use MPI?
If you only want to signal between processes rather than pass data, then an eventfd (see eventfd(2)) will allow you to use select() with less overhead than a pipe. As with a pipe solution, the processes will require a parent/child relationship.
If you want to use signals, use sigqueue to send them - you can send an integer payload with this, for example an offset into your shared memory.
Make sure to register your signal handler with sigaction and use the sa_sigaction callback: the siginfo_t->si_int member will contain that payload.
In general, I'm not sure I can recommend using this mechanism instead of a unix pipe or eventfd, because I'm not sure whether signal delivery is really tuned for speed as you hope: benchmark to be sure.
PS. performance aside, one reason signals feel a bit icky is that you lose the opportunity to have a "well-known" rendezvous like an inet or unix port, and instead have to go looking for a PID. Also, you have to be very careful about masking to make sure the signal is delivered where you want.
PPS. You raise or send a signal - you don't throw it. That's for exceptions.
I did some additional looking and came across signalfd(2). I believe this will be the best solution - very similar to Basile Starynkevitch's suggestion but without the overhead of standard pipes and done within the kernel rather than userspace.
pipe+select+queue+lock, nothing else.
I'm writing a POSIX compatible multi-threaded server in c/c++ that must be able to accept, read from, and write to a large number of connections asynchronously. The server has several worker threads which perform tasks and occasionally (and unpredictably) queue data to be written to the sockets. Data is also occasionally (and unpredictably) written to the sockets by the clients, so the server must also read asynchronously. One obvious way of doing this is to give each connection a thread which reads and writes from/to its socket; this is ugly, though, since each connection may persist for a long time and the server thus may have to hold hundred or thousand threads just to keep track of connections.
A better approach would be to have a single thread that handled all communications using the select()/pselect() functions. I.e., a single thread waits on any socket to be readable, then spawns a job to process the input that will be handled by a pool of other threads whenever input is available. Whenever the other worker threads produce output for a connection, it gets queued, and the communication thread waits for that socket to be writable before writing it.
The problem with this is that the communication thread may be waiting in the select() or pselect() function when output is queued by the worker threads of the server. It's possible that, if no input arrives for several seconds or minutes, a queued chunk of output will just wait for the communication thread to be done select()ing. This shouldn't happen, however--data should be written as soon as possible.
Right now I see a couple solutions to this that are thread-safe. One is to have the communication thread busy-wait on input and update the list of sockets it waits on for writing every tenth of a second or so. This isn't optimal since it involves busy-waiting, but it will work. Another option is to use pselect() and send the USR1 signal (or something equivalent) whenever new output has been queued, allowing the communication thread to update the list of sockets it is waiting on for writable status immediately. I prefer the latter here, but still dislike using a signal for something that should be a condition (pthread_cond_t). Yet another option would be to include, in the list of file descriptors on which select() is waiting, a dummy file that we write a single byte to whenever a socket needs to be added to the writable fd_set for select(); this would wake up the communications server because that particular dummy file would then be readable, thus allowing the communications thread to immediately update it's writable fd_set.
I feel intuitively, that the second approach (with the signal) is the 'most correct' way to program the server, but I'm curious if anyone knows either which of the above is the most efficient, generally speaking, whether either of the above will cause race conditions that I'm not aware of, or if anyone knows of a more general solution to this problem. What I really want is a pthread_cond_wait_and_select() function that allows the comm thread to wait on both a change in sockets or a signal from a condition.
Thanks in advance.
This is a fairly common problem.
One often used solution is to have pipes as a communication mechanism from worker threads back to the I/O thread. Having completed its task a worker thread writes the pointer to the result into the pipe. The I/O thread waits on the read end of the pipe along with other sockets and file descriptors and once the pipe is ready for read it wakes up, retrieves the pointer to the result and proceeds with pushing the result into the client connection in non-blocking mode.
Note, that since pipe reads and writes of less then or equal to PIPE_BUF are atomic, the pointers get written and read in one shot. One can even have multiple worker threads writing pointers into the same pipe because of the atomicity guarantee.
Unfortunately, the best way to do this is different for each platform. The canonical, portable way to do it is to have your I/O thread block in poll. If you need to get the I/O thread to leave poll, you send a single byte on a pipe that the thread is polling. That will cause the thread to exit from poll immediately.
On Linux, epoll is the best way. On BSD-derived operating systems (including OSX, I think), kqueue. On Solaris, it used to be /dev/poll and there's something else now whose name I forget.
You may just want to consider using a library like libevent or Boost.Asio. They give you the best I/O model on each platform they support.
Your second approach is the cleaner way to go. It's totally normal to have things like select or epoll include custom events in your list. This is what we do on my current project to handle such events. We also use timers (on Linux timerfd_create) for periodic events.
On Linux the eventfd lets you create such arbitrary user events for this purpose -- thus I'd say it is quite accepted practice. For POSIX only functions, well, hmm, perhaps one of the pipe commands or socketpair I've also seen.
Busy-polling is not a good option. First you'll be scanning memory which will be used by other threads, thus causing CPU memory contention. Secondly you'll always have to return to your select call which will create a huge number of system calls and context switches which will hurt overall system performance.
I'm quite bewildered by the use of message queues in realtime OS. The code that was given seems to have message queues used down to the bone: even passing variables to another class object is done through MQ. I always have a concept of MQ used in IPC. Question is: what is a proper use of a message queue?
In realtime OS environments you often face the problem that you have to guarantee execution of code at a fixed schedule. E.g. you may have a function that gets called exactly each 10 milliseconds. Not earlier, not later.
To guarantee such hard timing constraints you have to write code that must not block the time critical code under any circumstances.
The posix thread synchronization primitives from cannot be used here.
You must never lock a mutex or aqurie a semaphore from time critical code because a different process/thread may already have it locked. However, often you are allowed to unblock some other thread from time critical code (e.g. releasing a semaphore is okay).
In such environments message queues are a nice choice to exchange data because they offer a clean way to pass data from one thread to another without ever blocking.
Using queues to just set variables may sound like overkill, but it is very good software design. If you do it that way you have a well-defined interface to your time critical code.
Also it helps to write deterministic code because you'll never run into the problem of race-conditions. If you set variables via message-queues you can be sure that the time critical code sees the messages in the same order as they have been sent. When mixing direct memory access and messages you can't guarantee this.
Message Queues are predominantly used as an IPC Mechanism, whenever there needs to be exchange of data between two different processes. However, sometimes Message Queues are also used for thread context switching. For eg:
You register some callback with a software layer which sits on top of driver. The callback is returned to you in the context of the driver. It is a thread spawned by the driver. Now you cannot hog this thread of driver by doing a lot of processing in it. So one may add the data returned in callback in a message Queue, which has application threads blocked on it for performing the processing on the data.
I dont see why one should use Message Queues for replacing just normal function calls.
I am reading data from multiple serial ports. At present I am using a custom signal handler (by setting sa_handler) to compare and wake threads based on file descriptor information. I was searching for a way out to have individual threads with unique signal handlers, in this regard I found that select system call is to be used.
Now I have following questions:
If I am using a thread (Qt) then where do I put the select system call to monitor the serial port?
Is the select system call thread safe?
Is it CPU intensive because there are many things happening in my app including GUI update?
Please do not mind, if you find these questions ridiculous. I have never used such a mechanism for serial communication.
The POSIX specification (select) is the place to look for the select definition. I personally recommend poll - it has a better interface and can handle any number of descriptors, rather than a system-defined limit.
If I understand correctly you're waking threads based on the state of certain descriptors. A better way would be to have each thread have its own descriptor and call select itself. You see, select does not modify the system state, and as long as you use thread-local variables it'll be safe. However, you will definitely want to ensure you do not close a descriptor that a thread depends on.
Using select/poll with a timeout leaves the "waiting" up to the kernel side, which means the thread is usually put to sleep. While the thread is sleeping it is not using any CPU time. A while/for loop on a select call without a timeout on the other hand will give you a higher CPU usage as you're constantly spinning in the loop.
Hope this helps.
EDIT: Also, select/poll can have unpredictable results when working with the same descriptor in multiple threads. The simple reason for this is that the first thread might be woken up because the descriptor is ready for reading, but the second thread has to wait for the next "available for reading" wakeup.
As long as you're not selecting on the same descriptor in multiple threads you should not have a problem.
It is a system call -- it should be thread safe, I think.
I did not do this before, but I would be rather surprised, if it where not. How CPU intensive select() is, depends in my opinion largely on the number of file handles you are waiting for. select() is mostly used, to wait for a number (>1) of file handles to become ready.
It should also be mentioned that select() should not be used to poll the file handles -- for performance reason. Normal usage is: You have your work done and some time can elapse till the next thing is going on. Now you suspend your process with select and let another process run. select() normally does suspend the active process. How this works together with threads, I am not sure! I would think, that the whole process (and all threads) are suspended. But this might be documented. It also could depend (on Linux) whether you use system-threads or User-Threads. The kernel will not know User-Threads and hence suspend the whole process.