responsively checking two queues without pegging CPU - c++

I have a thread pool system which uses message passing to organize events, and I am also using the Windows API which also does a bit of message passing. So essentially I need to use the functions which check for the presence of messages without blocking. If I block (if I use GetMessage I think it will block) while checking either queue, I may miss any incoming messages on the other queue.
The first solution I know of is to Sleep a couple of miliseconds somewhere during my loop of peeking on both queues.
Another way I can think of is to have an additional thread, so that now I have one for each loop I am listening to. I make it not responsible for doing anything other than running the windows message loop, then use it to process and forward any events to my own message queue for the event to be handled. But this won't work if Windows specifically sends the messages i'm interested in to the original thread.
Are there other good solutions?

Your requirement is a bit unclear, but I can agree that Windows message queues are awkward in that only one thread can wait on them. Windows binds windows to threads, and only the thread that creates a window can interact with it.
If you have user-defined messages that contain work to processed by to your thread pool, I suggest that you do exactly what you suggest in your question - use one thread to process all the Windows messages, (GetMessage() loop), requeue any work that turns up to your thread pool input queue and handle 'normal' Windows messages with the usual Translate/Dispatch mechanism.
If you need more help, could you describe more clearly the flow of Windows messages and/or work objects through your system? It is not obvious where the work for the thread pool comes from and how it is transported, (if forced to use a WMQ, I usually postMessage a reference in wParam/lParam, but your system?).
Rgds,
Martin

Normally, a thread pool would not be involved in the Windows message loop, and blocking indefinitely when there is no work is not only allowable for a worker thread, but even desirable.
The most elegant way of implementing a thread pool that can receive messages via some kind of queue, which automatically keeps all CPU cores busy, and which as a bonus is very efficient, is using a completion port.
CreateIoCompletionPort with a null handle will create a completion port and return the handle. Passing zero as NumberOfConcurrentThreads tells the operating system to keep as many threads running as there are cores available.
Create any number of worker threads (a few more than you have cores) and CreateIoCompletionPort with the handle returned by the first call. That will bind the workers to this completion port. Now call GetQueuedCompletionStatus with INFINITE timeout on every worker, that will block them indefinitively.
Make a struct which has an OVERLAPPED as the first member, plus any data that you want to hand as a task (some pointers to data, or anything).
For every task, set up one of your message structs, and PostQueuedCompletionStatus to the completion port handle. At application exit, post null. You can use the dwNumberOfBytesTransferred field (and the completion key) to pass some additional info.
Now Windows will wake one thread for every message you posted, in last-in-first-out order, up to the number of cores available. If one of the workers blocks on IO, Windows will wake another one for another task (keeping the CPU busy as long as there is work to do).
After finishing a task, go back to GetQueuedCompletionStatus.
A way to gracefully terminate all workers is to pass "zero bytes transferred" and have the worker re-post the event, and exit if it encounters that.

I am not an expert on windows queues, but I am nearly certain there has to be an asynchronous event driven mechanism for message passing.

Related

boost ASIO and message passing between thread

I am working on designing a websocket server which receives a message and saves it to an embedded database. For reading the messages I am using boost asio. To save the messages to the embedded database I see a few options in front of me:
Save the messages synchronously as soon as I receive them over the same thread.
Save the messages asynchronously on a separate thread.
I am pretty sure the second answer is what I want. However, I am not sure how to pass messages from the socket thread to the IO thread. I see the following options:
Use one io service per thread and use the post function to communicate between threads. Here I have to worry about lock contention. Should I?
Use Linux domain sockets to pass messages between threads. No lock contention as far as I understand. Here I can probably use BOOST_ASIO_DISABLE_THREADS macro to get some performance boost.
Also, I believe it would help to have multiple IO threads which would receive messages in a round robin fashion to save to the embedded database.
Which architecture would be the most performant? Are there any other alternatives from the ones I mentioned?
A few things to note:
The messages are exactly 8 bytes in length.
Cannot use an external database. The database must be embedded in the running
process.
I am thinking about using RocksDB as the embedded
database.
I don't think you want to use a unix socket, which is always going to require a system call and pass data through the kernel. That is generally more suitable as an inter-process mechanism than an inter-thread mechanism.
Unless your database API requires that all calls be made from the same thread (which I doubt) you don't have to use a separate boost::asio::io_service for it. I would instead create an io_service::strand on your existing io_service instance and use the strand::dispatch() member function (instead of io_service::post()) for any blocking database tasks. Using a strand in this manner guarantees that at most one thread may be blocked accessing the database, leaving all the other threads in your io_service instance available to service non-database tasks.
Why might this be better than using a separate io_service instance? One advantage is that having a single instance with one set of threads is slightly simpler to code and maintain. Another minor advantage is that using strand::dispatch() will execute in the current thread if it can (i.e. if no task is already running in the strand), which may avoid a context switch.
For the ultimate optimization I would agree that using a specialized queue whose enqueue operation cannot make a system call could be fastest. But given that you have network i/o by producers and disk i/o by consumers, I don't see how the implementation of the queue is going to be your bottleneck.
After benchmarking/profiling I found the facebook folly implementation of MPMC Queue to be the fastest by at least a 50% margin. If I use the non-blocking write method, then the socket thread has almost no overhead and the IO threads remain busy. The number of system calls are also much less than other queue implementations.
The SPSC queue with cond variable in boost is slower. I am not sure why that is. It might have something to do with the adaptive spin that folly queue uses.
Also, message passing (UDP domain sockets in this case) turned out to be orders of magnitude slower especially for larger messages. This might have something to do with copying of data twice.
You probably only need one io_service -- you can create additional threads which will process events occurring within the io_service by providing boost::asio::io_service::run as the thread function. This should scale well for receiving 8-byte messages from clients over the network socket.
For storing the messages in the database, it depends on the database & interface. If it's multi-threaded, then you might as well just send each message to the DB from the thread that received it. Otherwise, I'd probably set up a boost::lockfree::queue where a single reader thread pulls items off and sends them to the database, and the io_service threads append new messages to the queue when they arrive.
Is that the most efficient approach? I dunno. It's definitely simple, and gives you a baseline that you can profile if it's not fast enough for your situation. But I would recommend against designing something more complicated at first: you don't know whether you'll need it at all, and unless you know a lot about your system, it's practically impossible to say whether a complicated approach would perform any better than the simple one.
void Consumer( lockfree::queue<uint64_t> &message_queue ) {
// Connect to database...
while (!Finished) {
message_queue.consume_all( add_to_database ); // add_to_database is a Functor that takes a message
cond_var.wait_for( ... ); // Use a timed wait to avoid missing a signal. It's OK to consume_all() even if there's nothing in the queue.
}
}
void Producer( lockfree::queue<uint64_t> &message_queue ) {
while (!Finished) {
uint64_t m = receive_from_network( );
message_queue.push( m );
cond_var.notify_all( );
}
}
Assuming that the constraint of using cxx11 is not too hard in your situtation, I would try to use the std::async to make an asynchronous call to the embedded DB.

Solution for non-blocking timer and server is boost threads?

My project has a queue, a server and a timer. The server receives data and puts it in the queue and the timer process the queue. When the queue is processed, external processes are open with popen, which means that popen will block the timer until a process has ended.
Correct me if I'm wrong, but as both server and timer are linked to the same io_service, if the server receives data, it will somehow block io_service from proceeding to the next event, and the vice-versa is the timer blocking if a process in the queue is being executed.
I'm thinking in a solution based on boost::thread but I'm not sure of what architecture should I use as I never used threads. My options are:
Two threads - one for the timer and one for the server, each one using its own io_service
One thread - one for the timer with its own io_service. the server remains in main process
In both ways the queue (a simple map) must be shared, so I think I'll have some trouble with mutexes and other things
If someone wants to take a look at the code, it is at https://github.com/MendelGusmao/CGI-for-LCD-Smartie
Thanks!
I don't see why you can't have your server listening for connections, processing data, and placing that data in the queue in one thread while your timer takes those items out of the queue in another thread and then spawns processes via popen() to process the queue data. Unless there is a detail here that I've missed, the socket that the server will be listening on (or pipe, FIFO, etc.), is separate from the pipe that will be internally opened by the libc runtime via popen(), so your server and timer threads won't be blocking each other. You'll simply have to make sure that you have enough space in the queue to store the data coming in from the server without overflowing memory (i.e., if this is a high-data-rate application, and data is coming in much faster than it's being processed, you'll eventually run out of memory).
Finally, while guarding a shared queue via muextes is a good thing, it's actually unnecessary for only a single producer/consumer situation like you're currently describing if you decide to use a bounded queue (i.e., a ring-buffer). If you decide on an unbounded queue, while there are some lockless algorithms out there, they're pretty complex, and so guarding an unbounded queue like std::queue<T> with a mutex is an absolute must.
I have implemented almost the exact thing you have described using windows threads. I had my consumer wait on an event HANDLE which is fired by the producer when the queue gets too long. There was a timeout on the wait as well so that if the queue was not filled fast enough the consumer would still wait and process the queue. It was a service in windows so the main thread was used for that. And yes, mutexes will be required to access the shared object.
So I used two threads (not including the main), 1 mutex, 1 shared object. I think your better option is also two threads as it keeps the logic cleaner. The main thread just starts the two threads and then waits (or can be used for signalling, control, output), and the two other threads are just doing their own jobs.

Waiting on a condition (pthread_cond_wait) and a socket change (select) simultaneously

I'm writing a POSIX compatible multi-threaded server in c/c++ that must be able to accept, read from, and write to a large number of connections asynchronously. The server has several worker threads which perform tasks and occasionally (and unpredictably) queue data to be written to the sockets. Data is also occasionally (and unpredictably) written to the sockets by the clients, so the server must also read asynchronously. One obvious way of doing this is to give each connection a thread which reads and writes from/to its socket; this is ugly, though, since each connection may persist for a long time and the server thus may have to hold hundred or thousand threads just to keep track of connections.
A better approach would be to have a single thread that handled all communications using the select()/pselect() functions. I.e., a single thread waits on any socket to be readable, then spawns a job to process the input that will be handled by a pool of other threads whenever input is available. Whenever the other worker threads produce output for a connection, it gets queued, and the communication thread waits for that socket to be writable before writing it.
The problem with this is that the communication thread may be waiting in the select() or pselect() function when output is queued by the worker threads of the server. It's possible that, if no input arrives for several seconds or minutes, a queued chunk of output will just wait for the communication thread to be done select()ing. This shouldn't happen, however--data should be written as soon as possible.
Right now I see a couple solutions to this that are thread-safe. One is to have the communication thread busy-wait on input and update the list of sockets it waits on for writing every tenth of a second or so. This isn't optimal since it involves busy-waiting, but it will work. Another option is to use pselect() and send the USR1 signal (or something equivalent) whenever new output has been queued, allowing the communication thread to update the list of sockets it is waiting on for writable status immediately. I prefer the latter here, but still dislike using a signal for something that should be a condition (pthread_cond_t). Yet another option would be to include, in the list of file descriptors on which select() is waiting, a dummy file that we write a single byte to whenever a socket needs to be added to the writable fd_set for select(); this would wake up the communications server because that particular dummy file would then be readable, thus allowing the communications thread to immediately update it's writable fd_set.
I feel intuitively, that the second approach (with the signal) is the 'most correct' way to program the server, but I'm curious if anyone knows either which of the above is the most efficient, generally speaking, whether either of the above will cause race conditions that I'm not aware of, or if anyone knows of a more general solution to this problem. What I really want is a pthread_cond_wait_and_select() function that allows the comm thread to wait on both a change in sockets or a signal from a condition.
Thanks in advance.
This is a fairly common problem.
One often used solution is to have pipes as a communication mechanism from worker threads back to the I/O thread. Having completed its task a worker thread writes the pointer to the result into the pipe. The I/O thread waits on the read end of the pipe along with other sockets and file descriptors and once the pipe is ready for read it wakes up, retrieves the pointer to the result and proceeds with pushing the result into the client connection in non-blocking mode.
Note, that since pipe reads and writes of less then or equal to PIPE_BUF are atomic, the pointers get written and read in one shot. One can even have multiple worker threads writing pointers into the same pipe because of the atomicity guarantee.
Unfortunately, the best way to do this is different for each platform. The canonical, portable way to do it is to have your I/O thread block in poll. If you need to get the I/O thread to leave poll, you send a single byte on a pipe that the thread is polling. That will cause the thread to exit from poll immediately.
On Linux, epoll is the best way. On BSD-derived operating systems (including OSX, I think), kqueue. On Solaris, it used to be /dev/poll and there's something else now whose name I forget.
You may just want to consider using a library like libevent or Boost.Asio. They give you the best I/O model on each platform they support.
Your second approach is the cleaner way to go. It's totally normal to have things like select or epoll include custom events in your list. This is what we do on my current project to handle such events. We also use timers (on Linux timerfd_create) for periodic events.
On Linux the eventfd lets you create such arbitrary user events for this purpose -- thus I'd say it is quite accepted practice. For POSIX only functions, well, hmm, perhaps one of the pipe commands or socketpair I've also seen.
Busy-polling is not a good option. First you'll be scanning memory which will be used by other threads, thus causing CPU memory contention. Secondly you'll always have to return to your select call which will create a huge number of system calls and context switches which will hurt overall system performance.

Proper message queue usage in POSIX

I'm quite bewildered by the use of message queues in realtime OS. The code that was given seems to have message queues used down to the bone: even passing variables to another class object is done through MQ. I always have a concept of MQ used in IPC. Question is: what is a proper use of a message queue?
In realtime OS environments you often face the problem that you have to guarantee execution of code at a fixed schedule. E.g. you may have a function that gets called exactly each 10 milliseconds. Not earlier, not later.
To guarantee such hard timing constraints you have to write code that must not block the time critical code under any circumstances.
The posix thread synchronization primitives from cannot be used here.
You must never lock a mutex or aqurie a semaphore from time critical code because a different process/thread may already have it locked. However, often you are allowed to unblock some other thread from time critical code (e.g. releasing a semaphore is okay).
In such environments message queues are a nice choice to exchange data because they offer a clean way to pass data from one thread to another without ever blocking.
Using queues to just set variables may sound like overkill, but it is very good software design. If you do it that way you have a well-defined interface to your time critical code.
Also it helps to write deterministic code because you'll never run into the problem of race-conditions. If you set variables via message-queues you can be sure that the time critical code sees the messages in the same order as they have been sent. When mixing direct memory access and messages you can't guarantee this.
Message Queues are predominantly used as an IPC Mechanism, whenever there needs to be exchange of data between two different processes. However, sometimes Message Queues are also used for thread context switching. For eg:
You register some callback with a software layer which sits on top of driver. The callback is returned to you in the context of the driver. It is a thread spawned by the driver. Now you cannot hog this thread of driver by doing a lot of processing in it. So one may add the data returned in callback in a message Queue, which has application threads blocked on it for performing the processing on the data.
I dont see why one should use Message Queues for replacing just normal function calls.

send over IP immediately on different thread

This is probably impossible, but i'm going to ask anyways. I have a multi-threaded program (server) that receives a request on a thread dedicated to IP communications and then passes it on to worker threads to do work, then I have to send a reply back with answers to the client and send it when it is actually finished, with as little delay as possible. Currently I am using a consumer/producer pattern and placing replies on a queue for the IP Thread to take off and send back to my client. This, however gives me no guarantee about WHEN this is going to happen, as the IP thread might not get scheduled any time soon, I cannot know. This makes my client, that is blocking for this call, think that the request has failed, which is obviously not the point.
Due to the fact I am unable to make changes in the client, I need to solve this sending issue on my side, the problem that I'm facing is that I do not wish to start sharing my IP object (currently only on 1 thread) with the worker threads, as then things get overly complicated. I wondered if there is some way I can use thread sync mechanisms to ensure that the moment my worker thread is finished, the IP thread will execute my send the reply back to the client?
Will manual/autoreset events do this for me or are these not guaranteed to wake up the thread immediately?
If you need it sent immediately, your best bet is to bite the bullet and start sharing the connection object. Lock it before accessing it, of course, and be sure to think about what you'll do if the send buffer is already full (the connection thread will need to deal with sending the portion of the message that didn't fit the first time, or the worker thread will be blocked until the client accepts some of the data you've sent). This may not be too difficult if your clients only have one request running at a time; if that's the case you can simply pass ownership of the client object to the worker thread when it begins processing, and pass it back when you're done.
Another option is using real-time threads. The details will vary between operating systems, but in most cases, if your thread has a high enough priority, it will be scheduled in immediately if it becomes ready to run, and will preempt all other threads with lower priority until done. On Linux this can be done with the SCHED_RR priority class, for example. However, this can negatively impact performance in many cases; as well as crashing the system if your thread gets into an infinite loop. It also usually requires administrative rights to use these scheduling classes.
That said, if scheduling takes long enough that the client times out, you might have some other problems with load. You should also really put a number on how fast the response needs to be - there's no end of things you can do if you want to speed up the response, but there'll come a point where it doesn't matter anymore (do you need response in the tens of ms? single-digit ms? hundreds of microseconds? single-digit microseconds?).
There is no synchronization mechanism that will wake a thread immediately. When a synchronization mechanism for which a thread is waiting is signaled, the thread is placed in a ready queue for its priority class. It can be starved there for several seconds before it's scheduled (Windows does have mechanisms that deal with starvation over 3-4 second intervals).
I think that for out-of-band, critical communications you can have a higher priority thread to which you can enqueue the reply message and wake it up (with a condition variable, MRE or any other synchronization mechanism). If that thread has higher priority than the rest of your application's threads, waking it up will immediately effect a context switch.