Boost: Single Threaded IO Service

Boost: Single Threaded IO Service - c++

In my app I will receive various events that I would like to process asynchronously in a prioritised order.
I could do this with a boost::asio::io_service, but my application is single threaded. I don't want to pay for locks and mallocs you might need for a multi threaded program (the performance cost really is significant to me). I'm basically looking for a boost::asio::io_service that is written for single threaded execution.
I'm pretty sure I could implement this myself using boost::coroutine, but before I do, does something like a boost::asio::io_service that is written for single threaded execution exist already? I scanned the list of boost libraries already and nothing stood out to me

Be aware that you have to pay for synchronization as soon as you use any non-blocking calls of Asio.
Even though you might use a single thread for scheduling work and processing the resulting callbacks, Asio might still have to spawn additional threads internally for executing asynchronous calls. Those will access the io_service concurrently.
Think of an async_read on a socket: As soon as the received data becomes available, the socket has to notify the io_service. This happens concurrent to your main thread, so additional synchronization is required.
For blocking I/O this problem goes away in theory, but since asynchronous I/O is sort of the whole point of the library, I would not expect to find too many optimizations for this case in the implementation.
As was pointed out in the comments already, the contention on the io_service will be very low with only one main thread, so unless profiling indicates a clear performance bottleneck there, you should not worry about it too much.

I suggest to use boost::asio together with boost::coroutine -> boost::asio::yield_context (does already the coupling between coroutine + io_service). If you detect an task with higher priority you could suspend the current task and start processing the task with higher priority.
The problem is that you have to define/call certain check-points in the code of your task in order to suspend the task if the condition (higher prio task enqueued) is given.

Related

Do ASIOs io_context.run() lock the thread into busy waiting

I think a straightforward question that i cant seem to find any information on. When calling ASIOs io_context.run(), if there is at that moment nothing yet to read/write asynchronously, does asio do busy waiting with that thread or does it do something more clever where the thread can be released and used in other parts of the application or OS?
I looked into the code but its not very clear to me what the answer is. I do see usage of conditional variables in some places so i can only presume that the run call is not busy waiting if it doesnt have to be.
I ask because in our case, we would like to maximise thread efficiency so it was suggested to place a thread sleep inside a recursive async read handler in case asio is busy waiting. We dont get that much network activity for a single thread to be used maximally.

It's not busy-waiting. This is documented here: The Proactor Design Pattern: Concurrency Without Threads
It highlights what underlying API's are preferred depending on platforms:
On many platforms, Boost.Asio implements the Proactor design pattern in terms of a Reactor, such as select, epoll or kqueue.
And
On Windows NT, 2000 and XP, Boost.Asio takes advantage of overlapped I/O to provide an efficient implementation of the Proactor design pattern.
Q. it was suggested to place a thread sleep inside a recursive async read handler in case asio is busy waiting
Don't do that. Keeping handlers short will allow you to multiplex all IO on a single service. If you do blocking work, consider posting it to a separate thread (pool).

boost ASIO and message passing between thread

I am working on designing a websocket server which receives a message and saves it to an embedded database. For reading the messages I am using boost asio. To save the messages to the embedded database I see a few options in front of me:
Save the messages synchronously as soon as I receive them over the same thread.
Save the messages asynchronously on a separate thread.
I am pretty sure the second answer is what I want. However, I am not sure how to pass messages from the socket thread to the IO thread. I see the following options:
Use one io service per thread and use the post function to communicate between threads. Here I have to worry about lock contention. Should I?
Use Linux domain sockets to pass messages between threads. No lock contention as far as I understand. Here I can probably use BOOST_ASIO_DISABLE_THREADS macro to get some performance boost.
Also, I believe it would help to have multiple IO threads which would receive messages in a round robin fashion to save to the embedded database.
Which architecture would be the most performant? Are there any other alternatives from the ones I mentioned?
A few things to note:
The messages are exactly 8 bytes in length.
Cannot use an external database. The database must be embedded in the running
process.
I am thinking about using RocksDB as the embedded
database.

I don't think you want to use a unix socket, which is always going to require a system call and pass data through the kernel. That is generally more suitable as an inter-process mechanism than an inter-thread mechanism.
Unless your database API requires that all calls be made from the same thread (which I doubt) you don't have to use a separate boost::asio::io_service for it. I would instead create an io_service::strand on your existing io_service instance and use the strand::dispatch() member function (instead of io_service::post()) for any blocking database tasks. Using a strand in this manner guarantees that at most one thread may be blocked accessing the database, leaving all the other threads in your io_service instance available to service non-database tasks.
Why might this be better than using a separate io_service instance? One advantage is that having a single instance with one set of threads is slightly simpler to code and maintain. Another minor advantage is that using strand::dispatch() will execute in the current thread if it can (i.e. if no task is already running in the strand), which may avoid a context switch.
For the ultimate optimization I would agree that using a specialized queue whose enqueue operation cannot make a system call could be fastest. But given that you have network i/o by producers and disk i/o by consumers, I don't see how the implementation of the queue is going to be your bottleneck.

After benchmarking/profiling I found the facebook folly implementation of MPMC Queue to be the fastest by at least a 50% margin. If I use the non-blocking write method, then the socket thread has almost no overhead and the IO threads remain busy. The number of system calls are also much less than other queue implementations.
The SPSC queue with cond variable in boost is slower. I am not sure why that is. It might have something to do with the adaptive spin that folly queue uses.
Also, message passing (UDP domain sockets in this case) turned out to be orders of magnitude slower especially for larger messages. This might have something to do with copying of data twice.

You probably only need one io_service -- you can create additional threads which will process events occurring within the io_service by providing boost::asio::io_service::run as the thread function. This should scale well for receiving 8-byte messages from clients over the network socket.
For storing the messages in the database, it depends on the database & interface. If it's multi-threaded, then you might as well just send each message to the DB from the thread that received it. Otherwise, I'd probably set up a boost::lockfree::queue where a single reader thread pulls items off and sends them to the database, and the io_service threads append new messages to the queue when they arrive.
Is that the most efficient approach? I dunno. It's definitely simple, and gives you a baseline that you can profile if it's not fast enough for your situation. But I would recommend against designing something more complicated at first: you don't know whether you'll need it at all, and unless you know a lot about your system, it's practically impossible to say whether a complicated approach would perform any better than the simple one.

void Consumer( lockfree::queue<uint64_t> &message_queue ) {
// Connect to database...
while (!Finished) {
message_queue.consume_all( add_to_database ); // add_to_database is a Functor that takes a message
cond_var.wait_for( ... ); // Use a timed wait to avoid missing a signal. It's OK to consume_all() even if there's nothing in the queue.
}
}
void Producer( lockfree::queue<uint64_t> &message_queue ) {
while (!Finished) {
uint64_t m = receive_from_network( );
message_queue.push( m );
cond_var.notify_all( );
}
}

Assuming that the constraint of using cxx11 is not too hard in your situtation, I would try to use the std::async to make an asynchronous call to the embedded DB.

Is better sync or async from boost asio when there is lot of calculation and push/pop on thread safe containers?

Need advice on boost::asio because I am totally new and have deadline soon, I need to create a TCP server (lot of connections) and I used the chat server example from the documentation as a start point.
When I receive a message I have lot of calculation over and I need to push in thread-safe queue (lock guard mutex). Except writing and reading everything calculates in main thread ( where callback executes ?). For this purpose do I need to put synchronous with lot of threads maybe or is there any rule how to make async with lot of calculations quicker ?
(I can put calculation in new async but I wonder is there better solution )

Just handle the communication asynchronously, on a single thread. This should allow up to ~10k connections per second. Just don't perform anything slow on this thread. Just push onto the queue and yield to the communication service.
Now, start as many threads as can usefully do the CPU intensive work (usually #of logical core, but sometimes #physical cores and certainly if you are saturating the communication throughput (unlikely), may (#cores - 1)).
If you anticipate that the IO side will be saturated and you cannot afford to block even on the mutex, use a lockfree queue. In that case, definitely dimension (#cores -1) workers, because the workers would naturally spin in a tight loop waiting for messages on the queue, suffocating the IO thread if you don't take precautions.

Is there a way to find out, whether a thread is blocked?

I'm writing a thread pool class in C++ which receives tasks to be executed in parallel. I want all cores to be busy, if possible, but sometimes some threads are idle because they are blocked for a time for synchronization purposes. When this happens I would like to start a new thread, so that there are always approximately as many threads awake as there are cpu cores. For this purpose I need a way to find out whether a certain thread is awake or sleeping (blocked). How can I find this out?
I'd prefer to use the C++11 standard library or boost for portability purposes. But if necessary I would also use WinAPI. I'm using Visual Studio 2012 on Windows 7. But really, I'd like to have a portable way of doing this.
Preferably this thread-pool should be able to master cases like
MyThreadPool pool;
for ( int i = 0; i < 100; ++i )
pool.addTask( &block_until_this_function_has_been_called_a_hundred_times );
pool.join(); // waits until all tasks have been dispatched.
where the function block_until_this_function_has_been_called_a_hundred_times() blocks until 100 threads have called it. At this time all threads should continue running. One requirement for the thread-pool is that it should not deadlock because of a too low number of threads in the pool.

Add a facility to your thread pool for a thread to say "I'm blocked" and then "I'm no longer blocked". Before every significant blocking action (see below for what I mean by that) signal "I'm blocked", and then "I'm no longer blocked" afterwards.
What constitutes a "significant blocking action"? Certainly not a simple mutex lock: mutexes should only be held for a short period of time, so blocking on a mutex is not a big deal. I mean things like:
Waiting for I/O to complete
Waiting for another pool task to complete
Waiting for data on a shared queue
and other similar events.

Use Boost Asio. It has its own thread pool management and scheduling framework. The basic idea is to push tasks to the io_service object using the post() method, and call run() from as many threads as many CPU cores you have. You should create a work object while the calculation is running to avoid the threads from exiting if they don't have enough jobs.
The important thing about Asio is never to use any blocking calls. For I/O calls, use the asynchronous calls of Asio's own I/O objects. For synchronization, use strand objects instead of mutexes. If you post functions to the io service that is wrapped in a strand, then it ensures that at any time at most one task runs that belongs to a certain strand. If there is a conflict, the task remains in Asio's event queue instead of blocking a working thread.
There is one drawback of using asynchronous programming though. It is much harder to read a code that is scattered into several asynchronous calls than one with a clear control flow. You should be aware of this when designing your program.

Does endless While loop take up CPU resources?

From what I understand, you write your Linux Daemon that listens to a request in an endless loop.
Something like..
int main() {
while(1) {
//do something...
}
}
ref: http://www.thegeekstuff.com/2012/02/c-daemon-process/
I read that sleeping a program makes it go into waiting mode so it doesn't eat up resources.
1.If I want my daemon to check for a request every 1 second, would the following be resource consuming?
int main() {
while(1) {
if (request) {
//do something...
}
sleep(1)
}
}
2.If I were to remove the sleep, does it mean the CPU consumption will go up 100%?
3.Is it possible to run an endless loop without eating resources? Say..if it does nothing but just loops itself. Or just sleep(1).
Endless loops and CPU resources is a mystery to me.

Is it possible to run an endless loop without eating resources? Say..if it does nothing but just loops itself. Or just sleep(1).
There ia a better option.
You can just use a semaphore, which remains blocked at the begining of loop and you can signal the semaphore whenever you want the loop to execute.
Note that this will not eat any resources.

The poll and select calls (mentioned by Basile Starynkevitch in a comment) or a semaphore (mentioned by Als in an answer) are the correct ways to wait for requests, depending on circumstances. On operating systems without poll or select, there should be something similar.
Neither sleep, YieldProcessor, nor sched_yield are proper ways to do this, for the following reasons.
YieldProcessor and sched_yield merely move the process to the end of the runnable queue but leave it runnable. The effect is that they allow other processes at the same or higher priority to execute, but, when those processes are done (or if there are none), then the process that called YieldProcessor or sched_yield continues to run. This causes two problems. One is that lower priority processes still will not run. Another is that this causes the processor to be always running, using energy. We would prefer the operating system to recognize when no process needs to be running and to put the processor into a low-power state.
sleep may permit this low-power state, but it plays a guessing game about how long it will be until the next request comes in, it wakes the processor repeatedly when there is no need, and it makes the process less responsive to requests, since the process will continue sleeping until the expiration of the requested time even if there is a request to be serviced.
The poll and select calls are designed for exactly this situation. They tell the operating system that this process wants to service a request coming in on one of its I/O channels but otherwise has no work to do. This allows the operating system to mark the process as not runnable and to put the processor in a low-power state if suitable.
Using a semaphore provides the same behavior, except that the signal to wake the process comes from another process raising the semaphore instead of activity arising in an I/O channel. Semaphores are suitable when the signal to do some work arrives in this way; simply use whichever of poll or a semaphore is more appropriate for your situation.
The criticism that poll, select, or a semaphore causes a kernel-mode call is irrelevant, because the other methods also cause kernel-mode calls. A process cannot sleep on its own; it has to call the operating system to request it. Similarly, YieldProcessor and sched_yield make requests to the operating system.

The short answer is yes -- removing sleep gives 100% CPU -- but the answer does depend on some additional details. It consumes all CPU it can get, unless...
The loop body is trivial, and optimised away.
The loop contains a blocking operation (like a file or network operation). The link you provide suggests to avoid this, but it is often a good idea to block until something relevant happens.
EDIT : For your scenario, I support the suggestion made by #Als.
EDIT 2: I expect this answer has received a -1 because I claim blocking operations can actually be a good idea. [If you -1, you should leave a motivation in a comment so that we all may learn something.]
Current popular thinking is that non-block (event-based) IO is good and blocking is bad. This view is oversimplified because it assumes all software that performs IO can improve throughput by using non-blocking operations.
What? Am I really suggesting that using non-blocking IO can actually reduce throughput? Yes it can. When a process serves a single activity it is actually better to use blocking IO because blocking IO only burns resources that have already been paid for in the existence of the process.
In contrast, non-blocking IO can carry a greater fixed overhead than simple blocking IO. If the process isn't able to supply additional IO that can be interleaved, then there is nothing gained by paying for non-blocking setup. (In practice, the greatest cost of innapropriate non-blocking IO is simply in the added code complexity. Beyond that, this topic is largely a thought exercise.)
Under blocking IO we rely upon the operating system to schedule those processes that can make progress. That's what the OS is designed to do.
Under non-blocking IO we have greater setup costs but can share the resources of the process and its threads between interleaved work. The non-blocking IO is therefor ideal for any process that serves multiple independent activities, such as a web server. The throughput gained is vastly superior to the fixed cost overheads of non-blocking IO.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js