How to get size of boost SPSC Queue? - c++

We would like to know the number of element in the queue at a given point of time.
We are pushing and popping objects, we would like to know the number of object in the Queue buffer.
Is there any inbuilt function for this ?
Or some other way to get it ?
http://www.boost.org/doc/libs/1_53_0/doc/html/boost/lockfree/spsc_queue.html

You can't reliably get the size because it invites race conditions. For the same reason, you won't find the empty() method: by the time the method returned a value, it will be irrelevant, because it might have changed.
Sometimes lockfree containers provide an "unreliable_size()" method (for purposes of statistics/logging)
The special case here is that SPSC assume single producer and consumers:
size_type read_available() const;
number of available elements that can be popped from the spsc_queue
size_type write_available() const;
get write space to write elements
Note these are only valid when used from the respective consumer/producer thread.

Looks like our operations are limited to the pop() and push() functions. In your software design, you have to focus on these operation. For example, if you are the consumer, you are limited to consume all of the items on the queue one at a time. And you have to rely on another channel of communication with the producer (conditional variable or atomic variable).
atomic<bool> producer_done(false); // producer set this variable to tell the consumer the status
spsc_queue<Obj> theQ; // assume producers have pushed
Obj tmpObj;
while (!producer.done) {
if (!theQ.pop(tmpObj)) {
cerr << "did not get any item from the producer\n";
// the producer may be too slow, your only choice is loop and wait, or other more complicated inter thread communication
// may be sleep a little
this_thread::sleep_for(1s);
}
else { // you got an item to work on
consume(tmpObj);
}
}
// now you know the single producer is no longer adding item to the queue
while (theQ.pop(tmpObj)) {
consume(tmpObj);
}
This is essentially the coding patter you can use with the spsc_queue at the consumer part.

Related

Where can we use std::barrier over std::latch?

I recently heard new c++ standard features which are:
std::latch
std::barrier
I cannot figure it out ,in which situations that they are applicable and useful over one-another.
If someone can raise an example for how to use each one of them wisely it would be really helpful.
Very short answer
They're really aimed at quite different goals:
Barriers are useful when you have a bunch of threads and you want to synchronise across of them at once, for example to do something that operates on all of their data at once.
Latches are useful if you have a bunch of work items and you want to know when they've all been handled, and aren't necessarily interested in which thread(s) handled them.
Much longer answer
Barriers and latches are often used when you have a pool of worker threads that do some processing and a queue of work items that is shared between. It's not the only situation where they're used, but it is a very common one and does help illustrate the differences. Here's some example code that would set up some threads like this:
const size_t worker_count = 7; // or whatever
std::vector<std::thread> workers;
std::vector<Proc> procs(worker_count);
Queue<std::function<void(Proc&)>> queue;
for (size_t i = 0; i < worker_count; ++i) {
workers.push_back(std::thread(
[p = &procs[i], &queue]() {
while (auto fn = queue.pop_back()) {
fn(*p);
}
}
));
}
There are two types that I have assumed exist in that example:
Proc: a type specific to your application that contains data and logic necessary to process work items. A reference to one is passed to each callback function that's run in the thread pool.
Queue: a thread-safe blocking queue. There is nothing like this in the C++ standard library (somewhat surprisingly) but there are a lot of open-source libraries containing them e.g. Folly MPMCQueue or moodycamel::ConcurrentQueue, or you can build a less fancy one yourself with std::mutex, std::condition_variable and std::deque (there are many examples of how to do this if you Google for them).
Latch
A latch is often used to wait until some work items you push onto the queue have all finished, typically so you can inspect the result.
std::vector<WorkItem> work = get_work();
std::latch latch(work.size());
for (WorkItem& work_item : work) {
queue.push_back([&work_item, &latch](Proc& proc) {
proc.do_work(work_item);
latch.count_down();
});
}
latch.wait();
// Inspect the completed work
How this works:
The threads will - eventually - pop the work items off of the queue, possibly with multiple threads in the pool handling different work items at the same time.
As each work item is finished, latch.count_down() is called, effectively decrementing an internal counter that started at work.size().
When all work items have finished, that counter reaches zero, at which point latch.wait() returns and the producer thread knows that the work items have all been processed.
Notes:
The latch count is the number of work items that will be processed, not the number of worker threads.
The count_down() method could be called zero times, one time, or multiple times on each thread, and that number could be different for different threads. For example, even if you push 7 messages onto 7 threads, it might be that all 7 items are processed onto the same one thread (rather than one for each thread) and that's fine.
Other unrelated work items could be interleaved with these ones (e.g. because they weree pushed onto the queue by other producer threads) and again that's fine.
In principle, it's possible that latch.wait() won't be called until after all of the worker threads have already finished processing all of the work items. (This is the sort of odd condition you need to look out for when writing threaded code.) But that's OK, it's not a race condition: latch.wait() will just immediately return in that case.
An alternative to using a latch is that there's another queue, in addition to the one shown here, that contains the result of the work items. The thread pool callback pushes results on to that queue while the producer thread pops results off of it. Basically, it goes in the opposite direction to the queue in this code. That's a perfectly valid strategy too, in fact if anything it's more common, but there are other situations where the latch is more useful.
Barrier
A barrier is often used to make all threads wait simultaneously so that the data associated with all of the threads can be operated on simultaneously.
typedef Fn std::function<void()>;
Fn completionFn = [&procs]() {
// Do something with the whole vector of Proc objects
};
auto barrier = std::make_shared<std::barrier<Fn>>(worker_count, completionFn);
auto workerFn = [barrier](Proc&) {
barrier->count_down_and_wait();
};
for (size_t i = 0; i < worker_count; ++i) {
queue.push_back(workerFn);
}
How this works:
All of the worker threads will pop one of these workerFn items off of the queue and call barrier.count_down_and_wait().
Once all of them are waiting, one of them will call completionFn() while the others continue to wait.
Once that function completes they will all return from count_down_and_wait() and be free to pop other, unrelated, work items from the queue.
Notes:
Here the barrier count is the number of worker threads.
It is guaranteed that each thread will pop precisely one workerFn off of the queue and handle it. Once a thread has popped one off of the queue, it will wait in barrier.count_down_and_wait() until all the other copies of workerFn have been popped off by other threads, so there is no chance of it popping another one off.
I used a shared pointer to the barrier so that it will be destroyed automatically once all the work items are done. This wasn't an issue with the latch because there we could just make it a local variable in the producer thread function, because it waits until the worker threads have used the latch (it calls latch.wait()). Here the producer thread doesn't wait for the barrier so we need to manage the memory in a different way.
If you did want the original producer thread to wait until the barrier has been finished, that's fine, it can call count_down_and_wait() too, but you will obviously need to pass worker_count + 1 to the barrier's constructor. (And then you wouldn't need to use a shared pointer for the barrier.)
If other work items are being pushed onto the queue at the same time, that's fine too, although it will potentially waste time as some threads will just be sitting there waiting for the barrier to be acquired while other threads are distracted by other work before they acquire the barrier.
!!! DANGER !!!
The last bullet point about other working being pushed onto the queue being "fine" is only the case if that other work doesn't also use a barrier! If you have two different producer threads putting work items with a barrier on to the same queue and those items are interleaved, then some threads will wait on one barrier and others on the other one, and neither will ever reach the required wait count - DEADLOCK. One way to avoid this is to only ever use barriers like this from a single thread, or even to only ever use one barrier in your whole program (this sounds extreme but is actually quite a common strategy, as barriers are often used for one-time initialisation on startup). Another option, if the thread queue you're using supports it, is to atomically push all work items for the barrier onto the queue at once so they're never interleaved with any other work items. (This won't work with the moodycamel queue, which supports pushing multiple items at once but doesn't guarantee that they won't be interleved with items pushed on by other threads.)
Barrier without completion function
At the point when you asked this question, the proposed experimental API didn't support completion functions. Even the current API at least allows not using them, so I thought I should show an example of how barriers can be used like that too.
auto barrier = std::make_shared<std::barrier<>>(worker_count);
auto workerMainFn = [&procs, barrier](Proc&) {
barrier->count_down_and_wait();
// Do something with the whole vector of Proc objects
barrier->count_down_and_wait();
};
auto workerOtherFn = [barrier](Proc&) {
barrier->count_down_and_wait(); // Wait for work to start
barrier->count_down_and_wait(); // Wait for work to finish
}
queue.push_back(std::move(workerMainFn));
for (size_t i = 0; i < worker_count - 1; ++i) {
queue.push_back(workerOtherFn);
}
How this works:
The key idea is to wait for the barrier twice in each thread, and do the work in between. The first waits have the same purpose as the previous example: they ensure any earlier work items in the queue are finished before starting this work. The second waits ensure that any later items in the queue don't start until this work has finished.
Notes:
The notes are mostly the same as the previous barrier example, but here are some differences:
One difference is that, because the barrier is not tied to the specific completion function, it's more likely that you can share it between multiple uses, like we did in the latch example, avoiding the use of a shared pointer.
This example makes it look like using a barrier without a completion function is much more fiddly, but that's just because this situation isn't well suited to them. Sometimes, all you need is to reach the barrier. For example, whereas we initialised a queue before the threads started, maybe you have a queue for each thread but initialised in the threads' run functions. In that case, maybe the barrier just signifies that the queues have been initialised and are ready for other threads to pass messages to each other. In that case, you can use a barrier with no completion function without needing to wait on it twice like this.
You could actually use a latch for this, calling count_down() and then wait() in place of count_down_and_wait(). But using a barrier makes more sense, both because calling the combined function is a little simpler and because using a barrier communicates your intention better to future readers of the code.
Any any case, the "DANGER" warning from before still applies.

Increasing concurrency in producer consumer

I am solving a specific kind of producer-consumer which goes as follows -
There is a buffer of size n. Consumers take items from buffer, one item at a time (By this I mean, whenever a consumer thread has access to buffer, it won't take more than one item). Whenever the buffer is empty, a call to producer must be raised. The producer completely fills in this buffer and then blocks itself until a call is made again. I have modelled each producer and consumer as a thread and implemented it this way -
bool buffer[n];
//Producer
while(true){
lock(bufferLock);
wait(producerSemaphore,bufferLock);
completelyFillbuffer();
signalAll(consumerSemaphore);
unlock(bufferLock);
}
//Consumer
while(true){
lock(bufferLock);
if(buffer.isEmpty()){
signal(producerSemaphore);
wait(consumerSemaphore,bufferLock);
}
takeSliceFrombuffer();
unlock(bufferLock);
}
takeItemFrombuffer(){
take any true item and make it false;
}
completelyFillbuffer(){
make all items true;
}
The problem is that, I am using a single lock for complete buffer. So at any point, only a single consumer can take an item. But when the buffer is of large size, it makes sense to allow more consumers to take items simultaneously. How do I implement this?
I think you are able to safely remove items from the buffer of safely mark them false. Just make this operation atomic. For instance:
// consumer
getTheItem(buffer);
if( item != false)
checkAndChange(item);
checkAndChange(item):
if( item != false)
atomicChange(item)
And you can lock the buffer only for producer. Another way is to use lock-free structures.
You can't; if you do that then how will you know that multiple consumers do not take the same element? You do need a single lock for the single producer buffer in order to safely remove one item at a time (serially). So, you can't parallelize the fetching of items from that queue/buffer, but you can parallelize the processing of the values.
// consumer
while (true) {
item = fetchItemFromBuffer();
process( item );
}
fetchItemFromBuffer() {
lock(bufferLock);
while (buffer.isEmpty()) {
signal(producerSemaphore);
wait(consumerSemaphore,bufferLock);
}
item = buffer.remove( 0 );
unlock(bufferLock);
return item;
}
A relatively simple idea would be to split the buffer into chunks.
For example, let's say you have a buffer of size 1024. You could split it into 64 chunks of size 16 each, or in any other way that suits your needs.
You will then need a mutex for each chunk. Each consumer then decides which element it wants to remove and then proceeds to lock the appropriate chunk. However, it may need to re-select and lock other chunks, if the initially selected chunk only has false values.
Another approach is lock-free programming, but it depends on how far you want to go into this. A good introduction is here: http://preshing.com/20120612/an-introduction-to-lock-free-programming/

concurrent message processing ordered chronologically

I want to optimize a message decoder written in C++ in terms of performance. The decoder is designed completely sequentially. The concept for the actual parallelization is kind of simple:
As soon as new data arrives on a certain socket, tell a thread-pool to run another thread that will decode the received message.
At the end of each thread, a method will be invoked (namely a Qt signal will be emitted) and an object created during processing will be passed.
My problem is: length and complexity of the processed messages vary, such that the order in which threads finish might differ from the order that the messages have been received. In other words, I need to serialize in place without the use of a threadsafe container.
How can I make sure that threads, as soon as they finish, call the method mentioned above in the correct chronological order without queueing them in a threadsafe container?
My first idea was to create as many mutexes as there are threads in the thread-pool and then use each mutex to send a "finished"-signal from an older thread to a newer thread.
Any comments appreciated!
If you really don't want to use a data structure like a priority_queue or a sequence of pre-reserved buffers and block your threads instead, you can do the following:
Pair your message with an index that indicates its original
position and pass it on to the thread pool.
Use a common (e.g. global, atomic) counter variable that indicates the last processed message.
Let each thread wait until this variable indicates that the previous message has been processed.
Pass on the produced object and increase the counter
The code would look something like this:
struct MsgIndexed {
size_t idx;
Msg msg;
};
//Single thread that receives all messages sequentially
void threadReceive() {
for (size_t i = 1; true ; i++)
{
Msg m = readMsg();
dipatchMsg(MsgIndexed{i,m});
}
}
std::atomic<size_t> cnt=0;
//multiple worker threads that work in parallel
void threadWork() {
while (1) {
MsgIndexed msg = waitforMsg();
Obj obj = processMsg(msg.msg);
//Just for demonstration purposes.
//You probably don't want to use a spinlock here, but e.g. a condition variable instead
while (cnt != (msg.idx - 1u)) { std::this_thread::yield(); }
forwardObj(obj);
cnt++;
}
}
Just be aware that this is a quite inefficent solution, as your workerthreads still have to wait around after they are done with their actual work.

Concurrency with a producer/consumer (sort of) and STL

I have the following situation: I have two threads
thread1, which is a worker thread that executes an algorithm until its input list size is > 0
thread2, which is asynchronous (user driven) and can add elements to the input list to be processed
Now, thread1 loop does something similar to the following
list input_list
list buffer_list
if (input_list.size() == 0)
sleep
while (input_list.size() > 0) {
for (item in input_list) {
process(item);
possibly add items to buffer_list
}
input_list = buffer_list (or copy it)
buffer_list = new list (or empty it)
sleep X ms (100 < X < 500, still have to decide)
}
Now thread2 will just add elements to buffer_list (which will be the next pass of the algorithm) and possibly manage to awake thread1 if it was stopped.
I'm trying to understand which multithread issues can occur in this situation, assuming that I'm programming it into C++ with aid of STL (no assumption on thread-safety of the implementation), and I have of course access to standard library (like mutex).
I would like to avoid any possible delay with thread2, since it's bound to user interface and it would create delays. I was thinking about using 3 lists to avoid synchronization issues but I'm not real sure so far. I'm still unsure either if there is a safer container within STL according to this specific situation.. I don't want to just place a mutex outside everything and lose so much performance.
Any advice would be very appreciated, thanks!
EDIT:
This is what I managed so far, wondering if it's thread safe and enough efficient:
std::set<Item> *extBuffer, *innBuffer, *actBuffer;
void thread1Function()
{
actBuffer->clear();
sem_wait(&mutex);
if (!extBuffer->empty())
std::swap(actBuffer, extBuffer);
sem_post(&mutex);
if (!innBuffer->empty())
{
if (actBuffer->empty())
std::swap(innBuffer, actBuffer);
else if (!innBuffer->empty())
actBuffer->insert(innBuffer->begin(), innBuffer->end());
}
if (!actBuffer->empty())
{
set<Item>::iterator it;
for (it = actBuffer.begin; it != actBuffer.end(); ++it)
// process
// possibly innBuffer->insert(...)
}
}
void thread2Add(Item item)
{
sem_wait(&mutex);
extBuffer->insert(item);
sem_post(&mutex);
}
Probably I should open another question
If you are worried about thread2 being blocked for a long time because thread1 is holding on to the lock, then make sure that thread1 guarentees to only take the lock for a really short time.
This can be easily accomplished if you have two instances of buffer list. So your attempt is already in the right direction.
Each buffer is pointed to with a pointer. One pointer you use to insert items into the list (thread2) and the other pointer is used to process the items in the other list (thread1). The insert operation of thread2 must be surrounded by a lock.
If thread1 is done processing all the items it only has to swap the pointers (e.g. with std::swap) this is a very quick operation which must be surrounded by a lock. Only the swap operation though. The actual processing of the items is lock-free.
This solution has the following advantages:
The lock in thread1 is always very short, so the amount of time that it may block thread2 is minimal
No constant dynamic allocation of buffers, which is faster and less likely to cause memory leak bugs.
You just need a mutex around inserting, removing, or when accessing the size of the container. You could develop a class to encapsulate the container and that would have the mutex. This would keep things simple and the class would handle the functionally of using he mutex. If you limit the items accessing (what is exposed for functions/interfaces) and make the functions small (just calling the container classes function enacapsulated in the mutex), they will return relatively quickly. You should need only on list for this in that case.
Depending on the system, if you have semaphore available, you may want to check and see if they are more efficent and use them instead of the mutex. Same concept applies, just in a different manner.
You may want to check in the concept of guards in case one of the threads dies, you would not get a deadlock condition.

What is better for a message queue? mutex & cond or mutex&semaphore?

I am implementing a C++ message queue based on a std::queue.
As I need popers to wait on an empty queue I was considering using mutex for mutual exclusion and cond for suspending threads on empty queue, as glib does with the gasyncqueue.
However it looks to me that a mutex&semaphore would do the job, I think it contains an integer and that seems like a pretty high number to reach on pending messages.
Pros of semaphore are that you don't need to check manually the condition each time you return from a wait, as you now for sure that someone inserted something(when someone inserted 2 items and you are the second thread arriving).
Which one would you choose?
EDIT:
Changed the question in response to #Greg Rogers
A single semaphore does not do the job - you need to be comparing (mutex + semaphore) and (mutex + condition variable).
It is pretty easy to see this by trying to implement it:
void push(T t)
{
queue.push(t);
sem.post();
}
T pop()
{
sem.wait();
T t = queue.top();
queue.pop();
return t;
}
As you can see there is no mutual exclusion when you are actually reading/writing to the queue, even though the signalling (from the semaphore) is there. Multiple threads can call push at the same time and break the queue, or multiple threads could call pop at the same time and break it. Or, a thread could call pop and be removing the first element of the queue while another thread called push.
You should use whichever you think is easier to implement, I doubt performance will vary much if any (it might be interesting to measure though).
Personally I use a mutex to serialize access to the list, and wake up the consumer by sending a byte over a socket (produced by socketpair()). That may be somewhat less efficient than a semaphore or condition variable, but it has the advantage of allowing the consumer to block in select()/poll(). That way the consumer can also wait on other things besides the data queue, if it wants to. It also lets you use the exact same queueing code on almost all OS's, since practically every OS supports the BSD sockets API.
Psuedocode follows:
// Called by the producer. Adds a data item to the queue, and sends a byte
// on the socket to notify the consumer, if necessary
void PushToQueue(const DataItem & di)
{
mutex.Lock();
bool sendSignal = (queue.size() == 0);
queue.push_back(di);
mutex.Unlock();
if (sendSignal) producerSocket.SendAByteNonBlocking();
}
// Called by consumer after consumerSocket selects as ready-for-read
// Returns true if (di) was written to, or false if there wasn't anything to read after all
// Consumer should call this in a loop until it returns false, and then
// go back to sleep inside select() to wait for further data from the producer
bool PopFromQueue(DataItem & di)
{
consumerSocket.ReadAsManyBytesAsPossibleWithoutBlockingAndThrowThemAway();
mutex.Lock();
bool ret = (queue.size() > 0);
if (ret) queue.pop_front(di);
mutex.Unlock();
return ret;
}
If You want to allow multiple simultaneously users at a time to use your queue, you should use semaphores.
sema(10) // ten threads/process have the concurrent access.
sema_lock(&sema_obj)
queue
sema_unlock(&sema_obj)
Mutex will "authorize" only one user at a time.
pthread_mutex_lock(&mutex_obj)
global_data;
pthread_mutex_unlock(&mutex_obj)
That's the main difference and You should decide which solution will fit your requirements.
But I'd choose mutex approach, because You don't need to specifies how many users can grab your resource.