Single Producer Multiple Consumer Circular Buffer

Single Producer Multiple Consumer Circular Buffer - c++

In my current application I am receiving spectral data through a spectrometer. This data is accumulated for one second and then put into a circular buffer. For now I have one consumer, who pops entries from the buffer and then saves everything to disk. Ok all of that stuff works. Now what I need to do is add another consumer, who, in parallel to the saving, does some processing with the spectra. So I have two consumers needing the exact same data (note: they only read and don't modify). Ok but this doesn't work because if one of the consumers pops one entry of the buffer it is gone, so the other would not receive it. I guess the simplest solution to this problem is to give every consumer it's own circular buffer. Fine, but the only problem is: the data entries are big. One entry has a maximum size of around 80MB, so in order to save memory it would be great to not have the same data there twice. Is there any better solution?
Note: I am using a circular buffer so it is ensured that the buffer has a growing limit.

Keep two different tail pointers in your buffer, one for each consumer. When the producer is updating the queue, use the farthest tail pointer (the tail pointer which is lagging behind) to check if the buffer is full. Consumers can use their own tail pointers to check if the buffer is empty. This way we get a lockfree buffer, and there is no copying around of data.
See the implementation of disruptor exchange for a discussion about the performance improvement with this solution.

I should hope you're receiving your data directly into the queue and not copying it around much....
Any valid solution that would keep a single copy of the data would have to sync all the consumers so that only when they're all done with an entry it can be popped.
You can keep your circular buffer. You only need a single remover to remove an entry when the readers are done with it. I strongly suggest this remover to be the writer of the data. This way it'd be the only guy with write access to the queue, and that simplifies things.
The remover can be fed from the consumers telling it what are they done with.
Consumers can share their read offsets with the remover. You can use atomic_store on the consumer side, and atomic_load on the remover side.
It should be something like that:
struct Consumer {
...
long offset = 0;
...
Consumer() {
q.remover->add(this);
}
...
void run() {
for(;;) {
entry& e = q.read( offset );
process( e );
atomic_store( &offest, offset + e.size() );
}
}
};
struct Remover {
...
long remove_offset = 0;
std::list<Consumer*> cons;
...
void remove() {
// find lowest read point
long cons_offset = MAX_LONG;
for( auto p : cons ) {
cons_offset = std::min( cons_offset, atomic_load(&p->offset) );
}
// remove up to that point
while( cons_offset > remove_offset ) {
entry& e = q.read(remove_offset);
remove_offset += e.size();
q.remove( e.size() );
}
}
};

Related

lock-free "closable" MPSC queue

Multiple producers single consumer scenario, except consumption happens once and after that the queue is "closed" and no more work is allowed. I have a MPSC queue, so I tried to add a lock-free algorithm to "close" the queue. I believe it's correct and it passes my tests. The problem is when I try to optimise memory order it stops working (I think work is lost, e.g. enqueued after the queue is closed). Even on x64 which has "kind of" strong memory model, even with a single producer.
My attempt to fine-tune memory order is commented out:
// thread-safe for multi producers single consumer use
// linked-list based, and so it's growable
MPSC_queue work_queue;
std::atomic<bool> closed{ false };
std::atomic<int32_t> producers_num{ 0 };
bool produce(Work&& work)
{
bool res = false;
++producers_num;
// producers_num.fetch_add(1, std::memory_order_release);
if (!closed)
// if (!closed.load(std::memory_order_acquire))
{
work_queue.push(std::move(work));
res = true;
}
--producers_num;
// producers_num.fetch_sub(1, std::memory_order_release);
return res;
}
void consume()
{
closed = true;
// closed.store(true, std::memory_order_release);
while (producers_num != 0)
// while (producers_num.load(std::memory_order_acquire) != 0)
std::this_thread::yield();
Work work;
while (work_queue.pop(work))
process(work);
}
I also tried std::memory_order_acq_rel for read-modify-write ops on producers_num, doesn't work either.
A bonus question:
This algorithm is used with MPSC queue, which already does some synchronisation inside. It would be nice to combine them for better performance. Do you know any such algorithm for "closable" MPSC queue?

I think closed = true; does need to be seq_cst to make sure it's visible to other threads before you check producers_num the first time. Otherwise this ordering is possible:
producer: ++producers_num;
consumer: producers_num == 0
producer: if (!closed) finds it still open
consumer: close.store(true, release) becomes globally visible.
consumer: work_queue.pop(work) finds the queue empty
producer: work_queue.push(std::move(work)); adds work to the queue after consumer has stopped looking.
You can still avoid seq_cst if you have the consumer check producers_num == 0 before returning, like
while (producers_num != 0)
// while (producers_num.load(std::memory_order_acquire) != 0)
std::this_thread::yield();
do {
Work work;
while (work_queue.pop(work))
process(work);
} while(producers_num.load(acquire) != 0);
// safe if pop included a full barrier, I think
I'm not 100% sure I have this right, but I think checking producer_num after a full barrier is sufficient.
However, the producer side does need ++producers_num; to be at least acq_rel, otherwise it can reorder past if (!closed). (An acquire fence after it, before if(!closed) might also work).
Since you only want to use the queue once, it doesn't need to wrap around and can probably be quite a lot simpler. Like an atomic producer-position counter that writers increment to claim a spot, and if they get a position > size then the queue was full. I haven't thought through the full details, though.
That might allow a cleaner solution to the above problem, perhaps by having the consumer look at that write index to see if there were any producer

C++ Threading using 2 Containers

I have the following problem. I use a vector that gets filled up with values from a temperature sensor. This function runs in one thread. Then I have another thread responsible for publishing all the values into a data base which runs once every second. Now the publishing thread will lock the vector using a mutex, so the function that fills it with values will get blocked. However, while the thread that publishes the values is using the vector I want to use another vector to save the temperature values so that I don't lose any values while the data is getting published. How do I get around this problem? I thought about using a pointer that points to the containers and then switching it to the other container once it gets locked to keep saving values, but I dont quite know how.
I tried to add a minimal reproducable example, I hope it kind of explains my situation.
void publish(std::vector<temperature> &inputVector)
{
//this function would publish the values into a database
//via mqtt and also runs in a thread.
}
int main()
{
std::vector<temperature> testVector;
std::vector<temperature> testVector2;
while(1)
{
//I am repeatedly saving values into the vector.
//I want to do this in a thread but if the vector locked by a mutex
//i want to switch over to the other vector
testVector.push_back(testSensor.getValue());
}
}

Assuming you are using std::mutex, you can use mutex::try_lock on the producer side. Something like this:
while(1)
{
if (myMutex.try_lock()) {
// locking succeeded - move all queued values and push the new value
std::move(testVector2.begin(), testVector2.end(), std::back_inserter(testVector));
testVector2.clear();
testVector.push_back(testSensor.getValue());
myMutex.unlock();
} else {
// locking failed - queue the value
testVector2.push_back(testSensor.getValue());
}
}
Of course publish() needs to lock the mutex, too.
void publish(std::vector<temperature> &inputVector)
{
std::lock_guard<std::mutex> lock(myMutex);
//this function would publish the values into a database
//via mqtt and also runs in a thread.
}

This seems like the perfect opportunity for an additional (shared) buffer or queue, that's protected by the lock.
main would be essentially as it is now, pushing your new values into the shared buffer.
The other thread would, when it can, lock that buffer and take the new values from it. This should be very fast.
Then, it does not need to lock the shared buffer while doing its database things (which take longer), as it's only working on its own vector during that procedure.
Here's some pseudo-code:
std::mutex pendingTempsMutex;
std::vector<temperature> pendingTemps;
void thread2()
{
std::vector<temperature> temps;
while (1)
{
// Get new temps if we have any
{
std::scoped_lock l(pendingTempsMutex);
temps.swap(pendingTemps);
}
if (!temps.empty())
publish(temps);
}
}
void thread1()
{
while (1)
{
std::scoped_lock l(pendingTempsMutex);
pendingTemps.push_back(testSensor.getValue());
/*
Or, if getValue() blocks:
temperature newValue = testSensor.getValue();
std::scoped_lock l(pendingTempsMutex);
pendingTemps.push_back(newValue);
*/
}
}
Usually you'd use a std::queue for pendingTemps though. I don't think it really matters in this example, because you're always consuming everything in thread 2, but it's more conventional and can be more efficient in some scenarios. It can't lose you much as it's backed by a std::deque. But you can measure/test to see what's best for you.
This solution is pretty much what you already proposed/explored in the question, except that the producer shouldn't be in charge of managing the second vector.
You can improve it by having thread2 wait to be "informed" that there are new values, with a condition variable, otherwise you're going to be doing a lot of busy-waiting. I leave that as an exercise to the reader ;) There should be an example and discussion in your multi-threaded programming book.

SPSC lock free queue without atomics

I have below a SPSC queue for my logger.
It is certainly not a general-use SPSC lock-free queue.
However, given a bunch of assumptions around how it will be used, target architecture etc, and a number of acceptable tradeoffs, which I go into detail below, my questions is basically, is it safe / does it work?
It will only be used on x86_64 architecture, so writes to uint16_t will be atomic.
Only the producer updates the tail.
Only the consumer updates the head.
If the producer reads an old value of head, it will look like there is less space in the queue than reality, which is an acceptable limitation in the context in which is is used.
If the consumer reads an old value of tail, it will look like there is less data waiting in the queue than reality, again an acceptable limitation.
The limitations above are acceptable because:
the consumer may not get the latest tail immediately, but eventually the latest tail will arrive, and queued data will be logged.
the producer may not get the latest head immediately, so the queue will look more full than it really is. In our load testing we have found the amount we log vs the size of the queue, and the speed at which the logger drains the queue, this limitation has no effect - there is always space in the queue.
A final point, the use of volatile is necessary to prevent the variable which each thread only reads from being optimised out.
My questions:
Is this logic correct?
Is the queue thread safe?
Is volatile sufficient?
Is volatile necessary?
My queue:
class LogBuffer
{
public:
bool is_empty() const { return head_ == tail_; }
bool is_full() const { return uint16_t(tail_ + 1) == head_; }
LogLine& head() { return log_buffer_[head_]; }
LogLine& tail() { return log_buffer_[tail_]; }
void advance_head() { ++head_; }
void advance_hail() { ++tail_; }
private:
volatile uint16_t tail_ = 0; // write position
LogLine log_buffer_[0xffff + 1]; // relies on the uint16_t overflowing
volatile uint16_t head_ = 0; // read position
};

Is this logic correct?
Yes.
Is the queue thread safe?
No.
Is volatile sufficient? Is volatile necessary?
No, to both. Volatile is not a magic keyword that makes any variable threadsafe. You still need to use atomic variables or memory barriers for the indexes to ensure memory ordering is correct when you produce or consume an item.
To be more specific, after you produce or consume an item for your queue you need to issue a memory barrier to guarantee that other threads will see the changes. Many atomic libraries will do this for you when you update an atomic variable.
As an aside, use "was_empty" instead of "is_empty" to be clear about what it does. The result of this call is one instance in time which may have changed by the time you act on its value.

Looking for critique of my thread safe, lock-free queue implementation

So, I've written a queue, after a bit of research. It uses a fixed-size buffer, so it's a circular queue. It has to be thread-safe, and I've tried to make it lock-free. I'd like to know what's wrong with it, because these kinds of things are difficult to predict on my own.
Here's the header:
template <class T>
class LockFreeQueue
{
public:
LockFreeQueue(uint buffersize) : buffer(NULL), ifront1(0), ifront2(0), iback1(0), iback2(0), size(buffersize) { buffer = new atomic <T>[buffersize]; }
~LockFreeQueue(void) { if (buffer) delete[] buffer; }
bool pop(T* output);
bool push(T input);
private:
uint incr(const uint val)
{return (val + 1) % size;}
atomic <T>* buffer;
atomic <uint> ifront1, ifront2, iback1, iback2;
uint size;
};
And here's the implementation:
template <class T>
bool LockFreeQueue<T>::pop(T* output)
{
while (true)
{
/* Fetch ifront and store it in i. */
uint i = ifront1;
/* If ifront == iback, the queue is empty. */
if (i == iback2)
return false;
/* If i still equals ifront, increment ifront, */
/* Incrememnting ifront1 notifies pop() that it can read the next element. */
if (ifront1.compare_exchange_weak(i, incr(i)))
{
/* then fetch the output. */
*output = buffer[i];
/* Incrememnting ifront2 notifies push() that it's safe to write. */
++ifront2;
return true;
}
/* If i no longer equals ifront, we loop around and try again. */
}
}
template <class T>
bool LockFreeQueue<T>::push(T input)
{
while (true)
{
/* Fetch iback and store it in i. */
uint i = iback1;
/* If ifront == (iback +1), the queue is full. */
if (ifront2 == incr(i))
return false;
/* If i still equals iback, increment iback, */
/* Incrememnting iback1 notifies push() that it can write a new element. */
if (iback1.compare_exchange_weak(i, incr(i)))
{
/* then store the input. */
buffer[i] = input;
/* Incrementing iback2 notifies pop() that it's safe to read. */
++iback2;
return true;
}
/* If i no longer equals iback, we loop around and try again. */
}
}
EDIT: I made some major modifications to the code, based on comments (Thanks KillianDS and n.m.!). Most importantly, ifront and iback are now ifront1, ifront2, iback1, and iback2. push() will now increment iback1, notifying other pushing threads that they can safely write to the next element (as long as it's not full), write the element, then increment iback2. iback2 is all that gets checked by pop(). pop() does the same thing, but with the ifrontn indices.
Now, once again, I fall into the trap of "this SHOULD work...", but I don't know anything about formal proofs or anything like that. At least this time, I can't think of a potential way that it could fail. Any advice is appreciated, except for "stop trying to write lock-free code".

The proper way to approach a lock free data structure is to write a semi formal proof that your design works in pseudo code. You shouldn't be asking "is this lock free code thread safe", but rather "does my proof that this lock free code is thread safe have any errors?"
Only after you have a formal proof that a pseudo code design works do you try to implement it. Often this brings to light issues like garbage collection that have to be handled carefully.
Your code should be the formal proof and pseudo code in comments, with the relatively unimportant implementation interspersed within.
Verifying your code is correct then consists of understanding the pseudo code, checking the proof, then checking for failure for your code to map to your pseudo code and proof.
Directly taking code and trying to check that it is lock free is impractical. The proof is the important thing in correctly designing this kind of thing, the actual code is secondary, as the proof is the hard part.
And after and while you have done all of the above, and have other people validate it, you have to put your code through practical tests to see if you have a blind spot and there is a hole, or don't understand your concurrency primitives, or if your concurrency primitives have bugs in them.
If you aren't interested in writing semi formal proofs to design your code, you shouldn't be hand rolling lock free algorithms and data structures and putting them into place in production code.
Determining if a pile of code "is thread safe" is putting all of the work load on other people. You need to have an argument why your code "is thread safe" arranged in such a way that it is as easy as possible for others to find holes in it. If your argument why your code "is thread safe" is arranged in ways that makes it harder to find holes, your code cannot be presumed to be thread safe, even if nobody can spot a hole in your code.
The code you posted above is a mess. It contains commented out code, no formal invariants, no proofs that the lines, no strong description of why it is thread safe, and in general does not put forward an attempt to show itself as thread safe in a way that makes it easy to spot flaws. As such, no reasonable reader will consider the code thread safe, even if they cannot find any errors in it.

No, it's not thread safe - consider the following sequence if events:
First thread completes if (ifront.compare_exchange_weak(i, incr(i))) in pop and goes to sleep by scheduler.
Second thread calls push size times (just enough to make ifront be equal to value of i in the first thread).
First thread wakes.
In this case pop buffer[i] will contain the last pushed value, which is wrong.

There are some issues when considering wrap-around but I think the main issue of your code is that it may pop invalid values from the buffer.
Consider this:
ifront = iback = 0
Push gets called and CAS increases the value of iback 0 -> 1. However the thread now get's stalled before buffer[0] is assigned.
ifront = 0, iback = 1
Pop is now called. CAS increases ifront 1 -> 1 and buffer[0] is read before it's assigned.
A stale or invalid value is popped.
PS. Some researches therefore have asked for a DCAS or TCAS (Di and Tri CAS).

Multithreaded data processing pipeline in Qt

What would be a good way to solve the following problem in Qt:
I have a sensor class, which continuously produces data. On this data, several operations have to be performed after another, which may take quite long. For this I have some additional classes. Basically, every time a new data item is recorded, the first class should get the data, process it, pass it to the next and so on.
sensor --> class 1 --> ... --> last class
I want to put the individual classes of the pipeline into their own threads, so that class 1 may already work on sample n+1 when class 2 is processing sample n...
Also, as the individual steps may differ greatly in their performance (e.g. the sensor is way faster than the rest) and I'm not interested in outdated data, I want class 1 (and everything after it) to always get the newest data from their predecessor, discarding old data. So, no big buffer between the steps of the pipeline.
First I thought about using Qt::QueuedConnections for signals/slots, but I guess that this would introduce a queue full of outdated samples waiting to be processed by the slower parts of the pipeline?

Just build your own one-element "queue" class. It should have:
A piece of data (or pointer to data)
A Boolean "dataReady"
A mutex
A condition variable
The "enqueue" function is just:
lock mutex
Replace data with new data
dataReady = true
signal condition variable
The "dequeue" function is just:
lock mutex
while (!dataReady) cond_wait(condition, mutex)
tmpData = data
data = NULL (or zero)
dataReady = false
unlock mutext
return tmpData
The type of the data can be a template parameter.

What you are dealing with is a Producer Consumer Pattern. You can find a general overview of that here. http://en.wikipedia.org/wiki/Producer-consumer_problem
You want to use a QMutex to limit access to the data to one thread at a time. Use the QMutexLocker to lock it.
For a VERY simplified example:
QList<quint32> data;
QMutex mutex;
// Consumer Thread calls this
int GetData()
{
quint32 result(-1); // if =1 is a valid value, you may have to return a bool and
// get the value through a reference to an int
// in the parameter list.
QMutexLocker lock(&mutex);
if (data.size())
{
result = data.front(); // or back
data.clear();
}
return result;
}
// Producer Thread calls this
void SetData(quint32 value)
{
QMutexLocker lock(&mutex);
data.push_back(value);
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js