Chronicle Queue: What is the recommended way to multiplex producers to write into a single queue?

Chronicle Queue: What is the recommended way to multiplex producers to write into a single queue? - concurrency

Let there be 5 producer threads, and 1 queue. It seems like I have 2 options:
create an appender for each producer thread, append concurrently and let chronicle queue deal with synchronization (enable double-buffereing?)
synchronize the 5 producer threads first (lockfree mechanism e.g. disruptor), create 1 extra thread with 1 appender that writes into chronicle queue
Why this question?
I had the initial impression that writing to a chronicle queue is lock-free and should therefore be really fast. But github documentation mentioned multiple times there is a write lock that serializes concurrent writes. So I wonder if a lockfree disruptor placed in front of the chronicle queue would increase performance?

What you suggest can improve the writer's performance, esp if you have an expensive serialization/marshalling strategy. However, if you are writing to a real disk, you will find the performance of the drive is possibly your biggest issue. (Even a fast NVMe drive) You might find the time to read the data is worse.
Let's say you spend 1 microsecond writing a 512-byte message, and you are writing at 200K/s messages. This means that your 80%ile will be an extra 1 us waiting for the queue due to contention. However, you will be writing 360 GB/h as will very quickly fill a fast NVMe drive. If instead, you have a relatively low volume of 20K/s messages, you are increasing your 98%ile latency by 1 us.
In short, if write contention is a problem, your drive is probably a much bigger problem. Adding a disruptor could help the writer, but it will delay the read time of every message.
I recommend building a latency benchmark for a realistic throughput first. You can double buffer the data yourself by writing first to a Wire and only copying the bytes while holding the lock.

Related

In what circumstances lock free data structures are faster than lock based ones?

I'm currently reading C++ Concurrency in Action book by Anthony Williams and there are several lock free data structures implementations. In the forward of the chapter about lock free data structures in the book Anthony is writing:
This brings us to another downside of lock-free and wait-free code: although it can increase the potential for concurrency of operations on a data structure and reduce the time an individual thread spends waiting, it may well decrease overall performance.
And indeed I tested all lock free stack implementations described in the book against lock based implementation from one of the previous chapters. And it seems the performance of lock free code is always lower than the lock based stack.
In what circumstances lock free data structure are more optimal and must be preferred?

One benefit of lock-free structures is that they do not require context switch. However, in modern systems, uncontended locks are also context-switch free. To benefit (performance-wise) from lock-free algo, several conditions have to be met:
Contention has to be high
There should be enough CPU cores so that spinning thread can run uninterrupted (ideally, should be pinned to its own core)

I've done performance study years ago. When the number of threads is small, lock-free data structures and lock-based data structures are comparable. But as the number of threads rises, at some point lock-based data structures exhibit a sharp performance drop, while lock-free data structures scale up to thousands of threads.

it depends on the probability of a collision.
if a collision is very likely, than a mutex is the optimal solution.
For example: 2 threads are constantly pushing data to the end of a container.
With lock-freedom only 1 thread will succeed. The other will need to retry. In this scenario the blocking and waiting would be better.
But if you have a large container and the 2 threads will access the container at different areas, its very likely, that there will be no collision.
For example: one thread modifies the first element of a container and the other thread the last element.
In this case, the probability of a retry is very small, hence lock-freedom would be better here.
Other problem with lock-freedom are spin-locks (heavy memory-usage), the overall performance of the atomic-variables and some constraints on variables.
For example if you have the constraint x == y which needs to be true, you cannot use atomic-variables for x and y, because you cannot change both variables at once, while a lock() would satisfy the constraint

The only way to know which is better is to profile each. The result will change drastically from use case to use case, from one cpu to another, from one arch to another, from one year to another. What might be best today might not be best tomorrow. So always measure and keep measuring.
That said let me give you some of my private thoughts on this:
First: If there is no contention it shouldn't matter what you do. The no-collision case should always be fast. If it's not then you need a different implementation tuned to the no contention case. One implementation might use fewer or faster machine instruction than the other and win but the difference should be minimal. Test, but I expect near identical results.
Next lets look at cases with (high) contention:
Again you might need an implementation tuned to the contention case. One lock mechanism isn't like the other same as lock-free methods.
threads <= cores
It's reasonable to assume all threads will be running and do work. There might be small pauses where a thread gets interrupted but that's really the exception. This obviously will only hold true if you only have one application doing this. The threads of all cpu heavy applications add up for this scenario.
Now with a lock one thread will get the lock and work. Other threads can wait for the lock to become available and react the instant the lock becomes free. They can busy loop or for longer durations sleep, doesn't matter much. The lock limits you to 1 thread doing work and you get that with barely any cpu cycles wasted when switching locks.
On the other hand lock free data structures all fall into some try&repeat loop. They will do work and at some crucial point they will try to commit that work and if there was contention they will fail and try again. Often repeating a lot of expensive operations. The more contention there is the more wasted work there is. Worse, all that access on the caches and memory will slow down the thread that actually manages to commit work in the end. So you are not only not getting ahead faster, you are slowing down progress.
I would always go with locks with any workload that takes more cpu cycles than the lock instruction vs. the CAS (or similar) instruction a lock free implementation needs. It really doesn't take much work there leaving only trivial cases for the lock-free approach. The builtin atomic types are such a case and often CPUs have opcodes to do those atomic operations lock-free in hardware in a single instruction that is (relatively) fast. In the end the lock will use such an instruction itself and can never be faster than such a trivial case.
threads >> cores
If you have much more threads than cores then only a fraction of them can run at any one time. It is likely a thread that sleeps will hold a lock. All other threads needing the lock will then also have to go to sleep until the lock holding thread wakes up again. This is probably the worst case for locking data structures. Nobody gets to do any work.
Now there are implementations for locks (with help from the operating system) where one thread trying to acquire a lock will cause the lock holding thread to take over till it releases the lock. In such systems the waste is reduced to context switching between the thread.
There is also a problem with locks called the thundering herd problem. If you have 100 threads waiting on a lock and the lock gets freed, then depending on your lock implementation and OS support, 100 threads will wake up. One will get the lock and 99 will waste time trying to acquire the lock, fail and go back to sleep. You really don't want a lock implementation suffering from thundering herds.
Lock free data structures begin to shine here. If one thread is descheduled then the other thread will continue their work and succeed in committing the result. The thread will wake up again at some point and fail to commit it's work and retry. The waste is in the work the one descheduled thread did.
cores < threads < 2 * cores
There is a grey zone there when the number of threads is near the number of cores. The chance the blocking thread is running remains high. But this is a very chaotic region. Results what method is better are rather random there. My conclusion: If you don't have tons of threads then try really hard to stay <= core count.
Some more thoughs:
Sometimes the work, once started, needs to be done in a specific order. If one thread is descheduled you can't just skip it. You see this in some data structures where the code will detect a conflict and one thread actually finishes the work a different thread started before it can commit it's own results. Now this is really great if the other thread was descheduled. But if it's actually running it's just wasteful to do the work twice. So data structure with this scheme really aim towards scenario 2 above.
With the amount of mobile computing done today it becomes more and more important to consider the power usage of your code. There are many ways you can optimize your code to change power usage. But really the only way for your code to use less power is to sleep. Something you hear more and more is "race to sleep". If you can make your code run faster so it can sleep earlier then you save power. But the emphasis here is on sleep earlier, or maybe I should say sleep more. If you have 2 threads running 75% of the time they might solve your problem in 75 seconds. But if you can solve the same problem with 2 threads running 50% of the time, alternating with a lock, then they take 100 seconds. But the first also uses 150% cpu power. For a shorter time, true, but 75 * 150% = 112.5 > 100 * 100%. Power wise the slower solution wins. Locks let you sleep while lock free trades power for speed.
Keep that in mind and balance your need for speed with the need to recharge your phone of laptop.

The mutex design will very rarely, if ever out perform the lockless one.
So the follow up question is why would anyone ever use a mutex rather than a lockless design?
The problem is that lockless designs can be hard to do, and require a significant amount of designing to be reliable; while a mutex is quite trivial (in comparison), and when debugging can be even harder. For this reason, people generally prefer to use mutexes first, and then migrate to lock free later once the contention has been proven to be a bottleneck.

I think one thing missing in these answers is locking period. If your locking period is very short, i.e. after acquiring lock if you perform a task for a very short period(like incrementing a variable) then using lock-based data structure would bring in unnecessary context switching, cpu scheduling etc. In this case, lock-free is a good option as the thread would be spinning for a very short time.

In an LMAX disruptor like pattern, how do you handle a slow consumer?

I have a question on what to do in a case of a slow consumer in a lmax disruptor like ring buffer that has multi producers and a single consumer running on x86 Linux. With an lmax like ring buffer pattern you are constantly overwriting data but what if the consumer is slow. Therefore how do you handle the case where say in a 10 sized ring buffer 0-9 ring slots your consumer is on slot 5 and now your writers are ready to start writing slot 15, which is also slot 5 in the buffer ( ie: slot 5 = 15 % 10 )? What is the typical way to handle this such that writers still produce data in order it came in and clients will receive the data in the same order? That's really my question. Below are some details about my design and it works fine it's just I currently don't have a good way to handle this issue. There are multiple threads doing writes and a single thread doing reads I can't introduce multiple reader threads without changing the existing design which is beyond the current project scope currently, but interested still in your thoughts still if they involve this as a solution.
Design specifics
I have a ring buffer and the design currently has multiple producers threads and a single consumer thread. This part of the design is existing and cannot currently change . I am trying to remove the existing queue-ing system using a lock free ring buffer. What I have is as follows.
The code runs on x86 Linux, there are multiple threads running for writers and there is a single thread for the reader. The reader and writer start one slot apart and are std::atomic<uint64_t>, so the reader starts at slot 0 and writer at slot 1 then each writer will first claim a slot by first doing an atomic fetch_add(1, std::memory_order::memory_order_acq_rel) on the writer sequence by calling incrementSequence shown below and afterwards use a compare_and_swap loop to update the reader sequence to let clients know this slot is available see updateSequence .
inline data_type incrementSequence() {
return m_sequence.fetch_add(1,std::memory_order::memory_order_seq_cst);
}
void updateSequence(data_type aOld, data_type aNew) {
while ( !m_sequence.compare_exchange_weak(aOld, aNew, std::memory_order::memory_order_release, std::memory_order_relaxed)
if (sequence() < aNew) {
continue;
}
break;
}
}
inline data_type sequence() const {
return m_sequence.load(std::memory_order::memory_order_acquire);
}

A ring buffer (or a FIFO in general--doesn't have to be implemented as a ring buffer) is intended to smooth out bursts of traffic. Even though producers may produce the data in bursts, the consumers can deal with a steady flow of input.
If you're overflowing the FIFO, it means one of two things:
Your bursts are larger than you planned for. Fix this by increasing the FIFO size (or making its size dynamic).
Your producers are out-running your consumers. Fix this by increasing the resources devoted to consuming the data (probably more threads).
It sounds to me like you're currently hitting the second: your single consumer simply isn't fast enough to keep up with the producers. The only real choices in that case are to speed up consumption by either optimizing the single consumer, or adding more consumers.
It also sounds a bit as if your consumer may be leaving their input data in the FIFO while they process the data, so that spot in the FIFO remains occupied until the consumer finishes processing that input. If so, you may be able to fix your problem by simply having the consumer remove the input data from the FIFO as soon as it starts processing. This frees up that slot so the producers can continue placing input into the buffer.
One other point: making the FIFO size dynamic can be something of a problem. The problem is fairly simple: it can cover up the fact that you really have the second problem of not having the resources necessary to process the data on the consumer side.
Assuming both the producers and the consumers are thread pools, the easiest way to balance the system is often to use a fixed-size FIFO. If the producers start to get so far ahead of the consumers that the FIFO overflows, then producers start to block. This lets the consumer thread pool consume more computational resources (e.g., run on more cores) to catch back up with the producers. This does, however, depend on being able to add more consumers, not restricting the system to a single consumer.

High throughput non-blocking server design: Alternatives to busy wait

I have been building a high-throughput server application for multimedia messaging, language of implementation is C++. Each server can be used in stand-alone mode or many servers can be joined together to create a DHT-based overlay network; the servers act like super-peers like in case of Skype.
The work is in progress. Currently the server can handle around 200,000 messages per second (256 byte messages) and has a max throughput of around 256 MB/s on my machine (Intel i3 Mobile 2 GHz, Fedora Core 18 (64-bit), 4 GB RAM) for messages of length 4096 bytes. The server has got two threads, one thread for handling all IOs (epoll-based, edge triggered) another one for processing the incoming messages. There is another thread for overlay management, but it doesn't matter in the current discussion.
The two threads in discussion share data using two circular buffers. Thread number 1 enqueues fresh messages for the thread number 2 using one circular buffer, while thread number 2 returns back the processed messages through the other circular Buffer. The server is completely lock-free. I am not using any synchronization primitive what-so-ever, not even atomic operations.
The circular buffers never overflow, because the messages are pooled (pre-allocated on start). In fact all vital/frequently-used data-structures are pooled to reduce memory fragmentation and to increase cache efficiency, hence we know the maximum number of messages we are ever going to create when the server starts, hence we can pre-calculate the maximum size of the buffers and then initialize the circular buffers accordingly.
Now my question: Thread #1 enqueues the serialized messages one message at a time (actually the pointers to message objects), while thread #2 takes out messages from the queue in chunks (chunks of 32/64/128), and returns back the processed messages in chunks through the second circular buffer. In case there are no new messages thread #2 keeps busy waiting, hence keeping one of the CPU cores busy all the time. How can I improve upon the design further? What are the alternatives to the busy wait strategy? I want to do this elegantly and efficiently. I have considered using semaphores, but I fear that is not the best solution for a simple reason that I have to call "sem_post" every time I enqueue a message in the thread #1 which might considerably decrease the throughput, the second thread has to call "sem_post" equal number of times to keep the semaphore from overflowing. Also I fear that a semaphore implementation might be using a mutex internally.
The second good option might be use of signal if I can discover an algorithm for raising signal only if the second thread has either "emptied the queue and is in process of calling sigwait" or is "already waiting on sigwait", in short the signal must be raised minimum number of times, although it won't hurt if signals are raised a few more times than needed. Yes, I did use Google Search, but none of the solutions I found on Internet were satisfactory. Here are a few considerations:
A. The server must waste minimum CPU cycles while making system calls, and system calls must be used a minimum number of times.
B. There must be very low overhead and the algorithm must be efficient.
C. No locking what-so-ever.
I want all options to be put on table.
Here is the link to the site where I have shared info about my server, to better understand the functionality and the purpose:
www.wanhive.com

Busy waiting is good if you need to wake up thread #2 as fast as possible. In fact this is the fastest way to notify one processor about changes made by another processor. You need to generate memory fences on both ends (write fence on one side and read fence on the other). But this statement holds true only if your both threads are executed on dedicated processors. In this case no context switching is needed, just cache coherency traffic.
There is some improvements can be made.
If thread #2 is in general CPU bound and do busy waiting - it can be penalized by the scheduler (at least on windows and linux). OS scheduler dynamically adjust thread priorities to improve overall system performance. It reduces priorities of CPU bound threads that consumes large amount of CPU time to prevent thread starvation. You need to manually increase priority of thread #2 to prevent this.
If you have multicore or multiprocessor machine, you will end up with undersubscription of processors and your application won't be able to exploit hardware concurrency. You can mitigate this by using several processor threads (thread #2).
Parallelization of processing step.
There is two options.
Your messages is totally ordered and need to be processed in the same order as they arrived.
Messages can be reordered. Processing can be done in any order.
You need N cycle buffers and N processing threads and N output buffers and one consumer thread in first case. Thread #1 enqueues messages in round-robin order in that cycle buffers.
// Thread #1 pseudocode
auto message = recv()
auto buffer_index = atomic_increment(&message_counter);
buffer_index = buffer_index % N; // N is the number of threads
// buffers is an array of cyclic buffers - Buffer* buffers[N];
Buffer* current_buffer = buffers[buffer_index];
current_buffer->euqueue(message);
Each thread consumes messages from one of the buffers and enqueues result to his dedicated output buffer.
// Thread #i pseudocode
auto message = my_buffer->dequeue();
auto result = process(message);
my_output_buffer->enqueue(result);
Now you need to process all this messages in the arrival order. You can do this with dedicated consumer thread by dequeuing processed messages from output cyclic buffers in round-robin manner.
// Consumer thread pseudocode
// out_message_counter is equal to message_counter at start
auto out_buffer_index = atomic_increment(&out_message_counter);
out_buffer_index = out_buffer_index % N;
// out_buffers is array of output buffers that is used by processing
// threads
auto out_buffer = out_buffers[out_buffer_index];
auto result = out_buffer->dequeue();
send(result); // or whatever you need to do with result
In second case, when you doesn't need to preserve message order - you doesn't need the consumer thread and output cyclic buffers. You just do whatever you need to do with result in processing thread.
N must be equal num CPU's - 3 in first case ("- 3" is one I/O thread + one consumer thread + one DHT thread) and num CPU's - 2 in second case ("- 2" is one I/O thread + one DHT thread). This is because busy wait can't be effective if you have oversubscription of processors.

Sounds like you want to coordinate a producer and consumer connected by some shared state. At least in Java for such patterns, one way to avoid busy wait is to use wait and notify. With this approach, thread #2 can go into a blocked state if it finds that the queue is empty by calling wait and avoid spinning the CPU. Once thread #1 puts some stuff in the queue, it can do a notify. A quick search of such mechanisms in C++ yields this:
wait and notify in C/C++ shared memory

You can have thread #2 go to sleep for X miliseconds when the queue is empty.
X can be determined by the length of the queues you want + some guard band.
BTW, in user mode (ring3) you can't use MONITOR/MWAIT instructions which would be ideal for your question.
Edit
You should definitely give TBB's RWlock a try (there's a free version). Sounds like what you're looking for.
Edit2
Another option is to use conditional variables. They involve a mutex and a condition. Basically you wait on the condition to become "true". The low level pthread stuff can be found here.

Increase io priority on Windows?

Originally my producer function would just write the data, now I have a second thread that is responsible for writing the data. The producer function does a memcpy into a circular buffer and triggers the consumer thread to start writing.
When I use the 2 threaded scheme I get the desired thread isolation, program stability and the ability to variable computation before writing - but the io performance is 50% worse.
My theory is that there is some kind of priority that can be set per thread that I want to adjust. Is this possible.
I am using 2 SSDs in a RAID0 data stripping configuration.

What do you mean by "io performance is 50% worse"? According to your resource monitor it is as high as it can be: disk queue is full, disk active time is 100%. If you mean write speed jumps - they have nothing to do with any possible thread priority. They are cause by disk head positioning due to files fragmentation, fs table modifications and so on.

boost threadpool with a thread that handles an IO queue

I recently began experimenting with the pseudo-boost threadpool (pseudo because it hasn't been officially accepted yet).
As a simple exercise, I initialized the threadpool with a maximum of two threads.
Each task does two things:
a CPU-intensive calculation
writes out the result to disk
Question
How do I modify the model into a threadpool that does:
a CPU-intensive calculation
and a single I/O thread which listens for completion from the threadpool - takes the resultant memory and simply:
writes out the result to disk
Should I simply have the task communicate to the I/O thread (spawned
as std::thread) through a std::condition_variable (essentially a mutexed queue of calculation results) or is there a way to
do it all within the threadpool library?
Or is the gcc 4.6.1 implementation of future and promise mature enough for me to pull this off?
Answer
It looks like a simple mutex queue with a condition variable works fine.
By grouping read access and writes, in addition to using the threadpool, I got the following improvements:
2 core machine: 1h14m down to 33m (46% reduction in runtime)
4 core vm: 40m down to 18m (55% reduction in runtime)
Thanks to Martin James for his thoughtful answer. Before this exercise, I thought that my next computational server should have dual-processors and a ton of memory. But now, with so much processing power inherent in the multiple cores and hyperthreading, I realize that money will probably better spent dealing with the I/O bottleneck.
As Martin mentioned, having multiple drives or RAID configurations would probably help. I will also look into adjusting I/O buffer settings at the kernel level.

If there is only one local disk, one writer thread on the end of a producer-consumer queue would be my favourite. Seeks, networked-disk delays and other hiccups will not leave any pooled threads that have finsihed their calculation stuck trying to write to the disk. Other disk operations, (eg. select another location/file/folder), are also easier/quicker if only one thread is accessing it - the queue will take up the slack and allow seamless calculation during the latency.
Writing directly from the calcualtion task or submitting the result-write as a separate task would work OK but you would need more threads in the pool to achieve pause-free operation.
Everything changes if there is more than one disk. More than one writer thread would then become a worthwhile proposition because of the increased overall performance. I would then probably go with an array/list of queues/write-threads, one for each disk.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js