High throughput non-blocking server design: Alternatives to busy wait

High throughput non-blocking server design: Alternatives to busy wait - c++

I have been building a high-throughput server application for multimedia messaging, language of implementation is C++. Each server can be used in stand-alone mode or many servers can be joined together to create a DHT-based overlay network; the servers act like super-peers like in case of Skype.
The work is in progress. Currently the server can handle around 200,000 messages per second (256 byte messages) and has a max throughput of around 256 MB/s on my machine (Intel i3 Mobile 2 GHz, Fedora Core 18 (64-bit), 4 GB RAM) for messages of length 4096 bytes. The server has got two threads, one thread for handling all IOs (epoll-based, edge triggered) another one for processing the incoming messages. There is another thread for overlay management, but it doesn't matter in the current discussion.
The two threads in discussion share data using two circular buffers. Thread number 1 enqueues fresh messages for the thread number 2 using one circular buffer, while thread number 2 returns back the processed messages through the other circular Buffer. The server is completely lock-free. I am not using any synchronization primitive what-so-ever, not even atomic operations.
The circular buffers never overflow, because the messages are pooled (pre-allocated on start). In fact all vital/frequently-used data-structures are pooled to reduce memory fragmentation and to increase cache efficiency, hence we know the maximum number of messages we are ever going to create when the server starts, hence we can pre-calculate the maximum size of the buffers and then initialize the circular buffers accordingly.
Now my question: Thread #1 enqueues the serialized messages one message at a time (actually the pointers to message objects), while thread #2 takes out messages from the queue in chunks (chunks of 32/64/128), and returns back the processed messages in chunks through the second circular buffer. In case there are no new messages thread #2 keeps busy waiting, hence keeping one of the CPU cores busy all the time. How can I improve upon the design further? What are the alternatives to the busy wait strategy? I want to do this elegantly and efficiently. I have considered using semaphores, but I fear that is not the best solution for a simple reason that I have to call "sem_post" every time I enqueue a message in the thread #1 which might considerably decrease the throughput, the second thread has to call "sem_post" equal number of times to keep the semaphore from overflowing. Also I fear that a semaphore implementation might be using a mutex internally.
The second good option might be use of signal if I can discover an algorithm for raising signal only if the second thread has either "emptied the queue and is in process of calling sigwait" or is "already waiting on sigwait", in short the signal must be raised minimum number of times, although it won't hurt if signals are raised a few more times than needed. Yes, I did use Google Search, but none of the solutions I found on Internet were satisfactory. Here are a few considerations:
A. The server must waste minimum CPU cycles while making system calls, and system calls must be used a minimum number of times.
B. There must be very low overhead and the algorithm must be efficient.
C. No locking what-so-ever.
I want all options to be put on table.
Here is the link to the site where I have shared info about my server, to better understand the functionality and the purpose:
www.wanhive.com

Busy waiting is good if you need to wake up thread #2 as fast as possible. In fact this is the fastest way to notify one processor about changes made by another processor. You need to generate memory fences on both ends (write fence on one side and read fence on the other). But this statement holds true only if your both threads are executed on dedicated processors. In this case no context switching is needed, just cache coherency traffic.
There is some improvements can be made.
If thread #2 is in general CPU bound and do busy waiting - it can be penalized by the scheduler (at least on windows and linux). OS scheduler dynamically adjust thread priorities to improve overall system performance. It reduces priorities of CPU bound threads that consumes large amount of CPU time to prevent thread starvation. You need to manually increase priority of thread #2 to prevent this.
If you have multicore or multiprocessor machine, you will end up with undersubscription of processors and your application won't be able to exploit hardware concurrency. You can mitigate this by using several processor threads (thread #2).
Parallelization of processing step.
There is two options.
Your messages is totally ordered and need to be processed in the same order as they arrived.
Messages can be reordered. Processing can be done in any order.
You need N cycle buffers and N processing threads and N output buffers and one consumer thread in first case. Thread #1 enqueues messages in round-robin order in that cycle buffers.
// Thread #1 pseudocode
auto message = recv()
auto buffer_index = atomic_increment(&message_counter);
buffer_index = buffer_index % N; // N is the number of threads
// buffers is an array of cyclic buffers - Buffer* buffers[N];
Buffer* current_buffer = buffers[buffer_index];
current_buffer->euqueue(message);
Each thread consumes messages from one of the buffers and enqueues result to his dedicated output buffer.
// Thread #i pseudocode
auto message = my_buffer->dequeue();
auto result = process(message);
my_output_buffer->enqueue(result);
Now you need to process all this messages in the arrival order. You can do this with dedicated consumer thread by dequeuing processed messages from output cyclic buffers in round-robin manner.
// Consumer thread pseudocode
// out_message_counter is equal to message_counter at start
auto out_buffer_index = atomic_increment(&out_message_counter);
out_buffer_index = out_buffer_index % N;
// out_buffers is array of output buffers that is used by processing
// threads
auto out_buffer = out_buffers[out_buffer_index];
auto result = out_buffer->dequeue();
send(result); // or whatever you need to do with result
In second case, when you doesn't need to preserve message order - you doesn't need the consumer thread and output cyclic buffers. You just do whatever you need to do with result in processing thread.
N must be equal num CPU's - 3 in first case ("- 3" is one I/O thread + one consumer thread + one DHT thread) and num CPU's - 2 in second case ("- 2" is one I/O thread + one DHT thread). This is because busy wait can't be effective if you have oversubscription of processors.

Sounds like you want to coordinate a producer and consumer connected by some shared state. At least in Java for such patterns, one way to avoid busy wait is to use wait and notify. With this approach, thread #2 can go into a blocked state if it finds that the queue is empty by calling wait and avoid spinning the CPU. Once thread #1 puts some stuff in the queue, it can do a notify. A quick search of such mechanisms in C++ yields this:
wait and notify in C/C++ shared memory

You can have thread #2 go to sleep for X miliseconds when the queue is empty.
X can be determined by the length of the queues you want + some guard band.
BTW, in user mode (ring3) you can't use MONITOR/MWAIT instructions which would be ideal for your question.
Edit
You should definitely give TBB's RWlock a try (there's a free version). Sounds like what you're looking for.
Edit2
Another option is to use conditional variables. They involve a mutex and a condition. Basically you wait on the condition to become "true". The low level pthread stuff can be found here.

Related

Multirate threads

I ran recently into a requirement in which there is a need for multithreaded application whose threads run at different rates.
The questions then become, since i am still learning multithreading:
A scenario is given to put things into perspective:
Say 1st thread runs at 100 Hz "real time"
2nd runs at 10 Hz
and say that the 1st thread provides data "myData" to the 2nd thread.
How is myData going to be provided to the 2nd thread, is the common practice to just read whatever is available from the first thread, or there need to be some kind of decimation to reduce the rate.
Does the myData need to be some kind of Singleton with locking mechanism. Although myData isn't shared, but rather updated by the first thread and used in the second thread.
How about the opposite case, when the data used in one thread need to be used at higher rate in a different thread.

How is myData going to be provided to the 2nd thread
One common method is to provide a FIFO queue -- this could be a std::dequeue or a linked list, or whatever -- and have the producer thread push data items onto one end of the queue while the consumer thread pops the data items off of the other end of the queue. Be sure to serialize all accesses to the FIFO queue (using a mutex or similar locking mechanism), to avoid race conditions.
Alternatively, instead of a queue you could have a single shared data object (essentially a queue of length one) and have your producer thread overwrite the object every time it generates new data. This could be done in cases where it's not important that the consumer thread sees every piece of data that was generated, but rather it's only important that it sees the most recent data. You'd still need to do the locking, though, to avoid the risk of the consumer thread reading from the data object at the same time the producer thread is in the middle of writing to it.
or does there need to be some kind of decimation to reduce the rate.
There doesn't need to be any decimation -- the second thread can just read in as much data as there is available to read, whenever it wakes up.
Does the myData need to be some kind of Singleton with locking
mechanism.
Singleton isn't necessary (although it's possible to do it that way). The locking mechanism is necessary, unless you have some kind of lock-free synchronization mechanism (and if you're asking this level of question, you don't have one and you don't want to try to get one either -- keep things simple for now!)
How about the opposite case, when the data used in one thread need to
be used at higher rate in a different thread.
It's the same -- if you're using a proper inter-thread communications mechanism, the rates at which the threads wake up doesn't matter, because the communications mechanism will do the right thing regardless of when or how often the the threads wake up.

Any multithreaded program has to cope with the possibility that one of the threads will work faster than another - by any ratio - even if they're executing on the same CPU with the same clock frequency.
Your choices include:
producer-consumer container than lets the first thread enqueue data, and the second thread "pop" it off for processing: you could let the queue grow as large as memory allows, or put some limit on the size after which either data would be lost or the 1st thread would be forced to slow down and wait to enqueue further values
there are libraries available (e.g. boost), or if you want to implement it yourself google some tutorials/docs on mutex and condition variables
do something conceptually similar to the above but where the size limit is 1 so there's just the single myData variable rather than a "container" - but all the synchronisation and delay choices remain the same
The Singleton pattern is orthogonal to your needs here: the two threads do need to know where the data is, but that would normally be done using e.g. a pointer argument to the function(s) run in the threads. Singleton's easily overused and best avoided unless reasons stack up high....

In an LMAX disruptor like pattern, how do you handle a slow consumer?

I have a question on what to do in a case of a slow consumer in a lmax disruptor like ring buffer that has multi producers and a single consumer running on x86 Linux. With an lmax like ring buffer pattern you are constantly overwriting data but what if the consumer is slow. Therefore how do you handle the case where say in a 10 sized ring buffer 0-9 ring slots your consumer is on slot 5 and now your writers are ready to start writing slot 15, which is also slot 5 in the buffer ( ie: slot 5 = 15 % 10 )? What is the typical way to handle this such that writers still produce data in order it came in and clients will receive the data in the same order? That's really my question. Below are some details about my design and it works fine it's just I currently don't have a good way to handle this issue. There are multiple threads doing writes and a single thread doing reads I can't introduce multiple reader threads without changing the existing design which is beyond the current project scope currently, but interested still in your thoughts still if they involve this as a solution.
Design specifics
I have a ring buffer and the design currently has multiple producers threads and a single consumer thread. This part of the design is existing and cannot currently change . I am trying to remove the existing queue-ing system using a lock free ring buffer. What I have is as follows.
The code runs on x86 Linux, there are multiple threads running for writers and there is a single thread for the reader. The reader and writer start one slot apart and are std::atomic<uint64_t>, so the reader starts at slot 0 and writer at slot 1 then each writer will first claim a slot by first doing an atomic fetch_add(1, std::memory_order::memory_order_acq_rel) on the writer sequence by calling incrementSequence shown below and afterwards use a compare_and_swap loop to update the reader sequence to let clients know this slot is available see updateSequence .
inline data_type incrementSequence() {
return m_sequence.fetch_add(1,std::memory_order::memory_order_seq_cst);
}
void updateSequence(data_type aOld, data_type aNew) {
while ( !m_sequence.compare_exchange_weak(aOld, aNew, std::memory_order::memory_order_release, std::memory_order_relaxed)
if (sequence() < aNew) {
continue;
}
break;
}
}
inline data_type sequence() const {
return m_sequence.load(std::memory_order::memory_order_acquire);
}

A ring buffer (or a FIFO in general--doesn't have to be implemented as a ring buffer) is intended to smooth out bursts of traffic. Even though producers may produce the data in bursts, the consumers can deal with a steady flow of input.
If you're overflowing the FIFO, it means one of two things:
Your bursts are larger than you planned for. Fix this by increasing the FIFO size (or making its size dynamic).
Your producers are out-running your consumers. Fix this by increasing the resources devoted to consuming the data (probably more threads).
It sounds to me like you're currently hitting the second: your single consumer simply isn't fast enough to keep up with the producers. The only real choices in that case are to speed up consumption by either optimizing the single consumer, or adding more consumers.
It also sounds a bit as if your consumer may be leaving their input data in the FIFO while they process the data, so that spot in the FIFO remains occupied until the consumer finishes processing that input. If so, you may be able to fix your problem by simply having the consumer remove the input data from the FIFO as soon as it starts processing. This frees up that slot so the producers can continue placing input into the buffer.
One other point: making the FIFO size dynamic can be something of a problem. The problem is fairly simple: it can cover up the fact that you really have the second problem of not having the resources necessary to process the data on the consumer side.
Assuming both the producers and the consumers are thread pools, the easiest way to balance the system is often to use a fixed-size FIFO. If the producers start to get so far ahead of the consumers that the FIFO overflows, then producers start to block. This lets the consumer thread pool consume more computational resources (e.g., run on more cores) to catch back up with the producers. This does, however, depend on being able to add more consumers, not restricting the system to a single consumer.

Scheduling of Process(s) waiting for Semaphore

It is always said when the count of a semaphore is 0, the process requesting the semaphore are blocked and added to a wait queue.
When some process releases the semaphore, and count increases from 0->1, a blocking process is activated. This can be any process, randomly picked from the blocked processes.
Now my question is:
If they are added to a queue, why is the activation of blocking processes NOT in FIFO order? I think it would be easy to pick next process from the queue rather than picking up a process at random and granting it the semaphore. If there is some idea behind this random logic, please explain. Also, how does the kernel select a process at random from queue? getting a random process that too from queue is something complex as far as a queue data structure is concerned.
tags: various OSes as each have a kernel usually written in C++ and mutex shares similar concept

A FIFO is the simplest data structure for the waiting list in a system
that doesn't support priorities, but it's not the absolute answer
otherwise. Depending on the scheduling algorithm chosen, different
threads might have different absolute priorities, or some sort of
decaying priority might be in effect, in which case, the OS might choose
the thread which has had the least CPU time in some preceding interval.
Since such strategies are widely used (particularly the latter), the
usual rule is to consider that you don't know (although with absolute
priorities, it will be one of the threads with the highest priority).

When a process is scheduled "at random", it's not that a process is randomly chosen; it's that the selection process is not predictable.
The algorithm used by Windows kernels is that there is a queue of threads (Windows schedules "threads", not "processes") waiting on a semaphore. When the semaphore is released, the kernel schedules the next thread waiting in the queue. However, scheduling the thread does not immediately make that thread start executing; it merely makes the thread able to execute by putting it in the queue of threads waiting to run. The thread will not actually run until a CPU has no threads of higher priority to execute.
While the thread is waiting in the scheduling queue, another thread that is actually executing may wait on the same semaphore. In a traditional queue system, that new thread would have to stop executing and go to the end of the queue waiting in line for that semaphore.
In recent Windows kernels, however, the new thread does not have to stop and wait for that semaphore. If the thread that has been assigned that semaphore is still sitting in the run queue, the semaphore may be reassigned to the old thread, causing the old thread to go back to waiting on the semaphore again.
The advantage of this is that the thread that was about to have to wait in the queue for the semaphore and then wait in the queue to run will not have to wait at all. The disadvantage is that you cannot predict which thread will actually get the semaphore next, and it's not fair so the thread waiting on the semaphore could potentially starve.

It is not that it CAN'T be FIFO; in fact, I'd bet many implementations ARE, for just the reasons that you state. The spec isn't that the process is chosen at random; it is that it isn't specified, so your program shouldn't rely on it being chosen in any particular way. (It COULD be chosen at random; just because it isn't the fastest approach doesn't mean it can't be done.)

All of the other answers here are great descriptions of the basic problem - especially around thread priorities and ready queues. Another thing to consider however is IO. I'm only talking about Windows here, since it is the only platform I know with any authority, but other kernels are likely to have similar issues.
On Windows, when an IO completes, something called a kernel-mode APC (Asynchronous Procedure Call) is queued against the thread which initiated the IO in order to complete it. If the thread happens to be waiting on a scheduler object (such as the semaphore in your example) then the thread is removed from the wait queue for that object which causes the (internal kernel mode) wait to complete with (something like) STATUS_ALERTED. Now, since these kernel-mode APCs are an implementation detail, and you can't see them from user mode, the kernel implementation of WaitForMultipleObjects restarts the wait at that point which causes your thread to get pushed to the back of the queue. From a kernel mode perspective, the queue is still in FIFO order, since the first caller of the underlying wait API is still at the head of the queue, however from your point of view, way up in user mode, you just got pushed to the back of the queue due to something you didn't see and quite possibly had no control over. This makes the queue order appear random from user mode. The implementation is still a simple FIFO, but because of IO it doesn't look like one from a higher level of abstraction.
I'm guessing a bit more here, but I would have thought that unix-like OSes have similar constraints around signal delivery and places where the kernel needs to hijack a process to run in its context.
Now this doesn't always happen, but the documentation has to be conservative and unless the order is explicitly guaranteed to be FIFO (which as described above - for windows at least - it can't be) then the ordering is described in the documentation as being "random" or "undocumented" or something because a random process controls it. It also gives the OS vendors lattitude to change the ordering at some later time.

Process scheduling algorithms are very specific to system functionality and operating system design. It will be hard to give a good answer to this question. If I am on a general PC, I want something with good throughput and average wait/response time. If I am on a system where I know the priority of all my jobs and know I absolutely want all my high priority jobs to run first (and don't care about preemption/starvation), then I want a Priority algorithm.
As far as a random selection goes, the motivation could be for various reasons. One being an attempt at good throughput, etc. as mentioned above above. However, it would be non-deterministic (hypothetically) and impossible to prove. This property could be an exploitation of probability (random samples, etc.), but, again, the proofs could only be based on empirical data on whether this would really work.

Possible frameworks/ideas for thread managment and work allocation in C++

I am developing a C++ application that needs to process large amount of data. I am not in position to partition data so that multi-processes can handle each partition independently. I am hoping to get ideas on frameworks/libraries that can manage threads and work allocation among worker threads.
Manage threads should include at least below functionality.
1. Decide on how many workers threads are required. We may need to provide user-defined function to calculate number of threads.
2. Create required number of threads.
3. Kill/stop unnecessary threads to reduce resource wastage.
4. Monitor healthiness of each worker thread.
Work allocation should include below functionality.
1. Using callback functionality, the library should get a piece of work.
2. Allocate the work to available worker thread.
3. Master/slave configuration or pipeline-of-worker-threads should be possible.
Many thanks in advance.

Your question essentially boils down to "how do I implement a thread pool?"
Writing a good thread pool is tricky. I recommend hunting for a library that already does what you want rather than trying to implement it yourself. Boost has a thread-pool library in the review queue, and both Microsoft's concurrency runtime and Intel's Threading Building Blocks contain thread pools.
With regard to your specific questions, most platforms provide a function to obtain the number of processors. In C++0x this is std::thread::hardware_concurrency(). You can then use this in combination with information about the work to be done to pick a number of worker threads.
Since creating threads is actually quite time consuming on many platforms, and blocked threads do not consume significant resources beyond their stack space and thread info block, I would recommend that you just block worker threads with no work to do on a condition variable or similar synchronization primitive rather than killing them in the first instance. However, if you end up with a large number of idle threads, it may be a signal that your pool has too many threads, and you could reduce the number of waiting threads.
Monitoring the "healthiness" of each thread is tricky, and typically platform dependent. The simplest way is just to check that (a) the thread is still running, and hasn't unexpectedly died, and (b) the thread is processing tasks at an acceptable rate.
The simplest means of allocating work to threads is just to use a single shared job queue: all tasks are added to the queue, and each thread takes a task when it has completed the previous task. A more complex alternative is to have a queue per thread, with a work-stealing scheme that allows a thread to take work from others if it has run out of tasks.
If your threads can submit tasks to the work queue and wait for the results then you need to have a scheme for ensuring that your worker threads do not all get stalled waiting for tasks that have not yet been scheduled. One option is to spawn a new thread when a task gets blocked, and another is to run the not-yet-scheduled task that is blocking a given thread on that thread directly in a recursive manner. There are advantages and disadvantages with both these schemes, and with other alternatives.

How can I improve my real-time behavior in multi-threaded app using pthreads and condition variables?

I have a multi-threaded application that is using pthreads. I have a mutex() lock and condition variables(). There are two threads, one thread is producing data for the second thread, a worker, which is trying to process the produced data in a real time fashion such that one chuck is processed as close to the elapsing of a fixed time period as possible.
This works pretty well, however, occasionally when the producer thread releases the condition upon which the worker is waiting, a delay of up to almost a whole second is seen before the worker thread gets control and executes again.
I know this because right before the producer releases the condition upon which the worker is waiting, it does a chuck of processing for the worker if it is time to process another chuck, then immediately upon receiving the condition in the worker thread, it also does a chuck of processing if it is time to process another chuck.
In this later case, I am seeing that I am late processing the chuck many times. I'd like to eliminate this lost efficiency and do what I can to keep the chucks ticking away as close to possible to the desired frequency.
Is there anything I can do to reduce the delay between the release condition from the producer and the detection that that condition is released such that the worker resumes processing? For example, would it help for the producer to call something to force itself to be context switched out?
Bottom line is the worker has to wait each time it asks the producer to create work for itself so that the producer can muck with the worker's data structures before telling the worker it is ready to run in parallel again. This period of exclusive access by the producer is meant to be short, but during this period, I am also checking for real-time work to be done by the producer on behalf of the worker while the producer has exclusive access. Somehow my hand off back to running in parallel again results in significant delay occasionally that I would like to avoid. Please suggest how this might be best accomplished.

I could suggest the following pattern. Generally the same technique could be used, e.g. when prebuffering frames in some real-time renderers or something like that.
First, it's obvious that approach that you describe in your message would only be effective if both of your threads are loaded equally (or almost equally) all the time. If not, multi-threading would actually benefit in your situation.
Now, let's think about a thread pattern that would be optimal for your problem. Assume we have a yielding and a processing thread. First of them prepares chunks of data to process, the second makes processing and stores the processing result somewhere (not actually important).
The effective way to make these threads work together is the proper yielding mechanism. Your yielding thread should simply add data to some shared buffer and shouldn't actually care about what would happen with that data. And, well, your buffer could be implemented as a simple FIFO queue. This means that your yielding thread should prepare data to process and make a PUSH call to your queue:
X = PREPARE_DATA()
BUFFER.LOCK()
BUFFER.PUSH(X)
BUFFER.UNLOCK()
Now, the processing thread. It's behaviour should be described this way (you should probably add some artificial delay like SLEEP(X) between calls to EMPTY)
IF !EMPTY(BUFFER) PROCESS(BUFFER.TOP)
The important moment here is what should your processing thread do with processed data. The obvious approach means making a POP call after the data is processed, but you will probably want to come with some better idea. Anyway, in my variant this would look like
// After data is processed
BUFFER.LOCK()
BUFFER.POP()
BUFFER.UNLOCK()
Note that locking operations in yielding and processing threads shouldn't actually impact your performance because they are only called once per chunk of data.
Now, the interesting part. As I wrote at the beginning, this approach would only be effective if threads act somewhat the same in terms of CPU / Resource usage. There is a way to make these threading solution effective even if this condition is not constantly true and matters on some other runtime conditions.
This way means creating another thread that is called controller thread. This thread would merely compare the time that each thread uses to process one chunk of data and balance the thread priorities accordingly. Actually, we don't have to "compare the time", the controller thread could simply work the way like:
IF BUFFER.SIZE() > T
DECREASE_PRIORITY(YIELDING_THREAD)
INCREASE_PRIORITY(PROCESSING_THREAD)
Of course, you could implement some better heuristics here but the approach with controller thread should be clear.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js