As a purely mental exercise I'm trying to get this to work without locks or mutexes. The idea is that when the consumer thread is reading/executing messages it atomically swaps which std::vector the producer thread uses for writes. Is this possible? I've tried playing around with thread fences to no avail. There's a race condition here somewhere because it occasionally seg faults. I imagine it's somewhere in the enqueue function. Any ideas?
// should execute functions on the original thread
class message_queue {
public:
using fn = std::function<void()>;
using queue = std::vector<fn>;
message_queue() : write_index(0) {
}
// should only be called from consumer thread
void run () {
// atomically gets the current pending queue and switches it with the other one
// for example if we're writing to queues[0], we grab a reference to queue[0]
// and tell the producer to write to queues[1]
queue& active = queues[write_index.fetch_xor(1)];
// skip if we don't have any messages
if (active.size() == 0) return;
// run all messages/callbacks
for (auto fn : active) {
fn();
}
// clear the active queue so it can be re-used
active.clear();
// swap active and pending threads
write_index.fetch_xor(1);
}
void enqueue (fn value) {
// loads the current pending queue and append some work
queues[write_index.load()].push_back(value);
}
private:
queue queues[2];
std::atomic<bool> is_empty; // unused for now
std::atomic<int> write_index;
};
int main(int argc, const char * argv[])
{
message_queue queue{};
// flag to stop the message loop
// doesn't actually need to be atomic because it's only read/wrote on the main thread
std::atomic<bool> done(false);
std::thread worker([&queue, &done] {
int count = 100;
// send 100 messages
while (--count) {
queue.enqueue([count] {
// should be executed in the main thread
std::cout << count << "\n";
});
}
// finally tell the main thread we're done
queue.enqueue([&] {
std::cout << "done!\n";
done = true;
});
});
// run messages until the done flag is set
while(!done) queue.run();
worker.join();
}
if I understand your code correctly, there are data races, e.g.:
// producer
int r0 = write_index.load(); // r0 == 0
// consumer
int r1 = write_index.fetch_xor(1); // r1 == 0
queue& active = queues[r1];
active.size();
// producer
queue[r0].push_back(...);
Now both threads access the same queue at the same time. That's a data race, and that means undefined behaviour.
Your lock-free queue fails to work because you did not start with at least a semi-formal proof of correctness, then turn that proof into an algorithm with the proof being the primary text, comments connecting the proof to the code, all interconnected with the code.
Unless you are copy/pasting someone else's implementation who did do that, any attempt to write a lock-free algorithm will fail. If you are copy-pasting someone else's implementation, please provide it.
Lock free algorithms are not robust unless you have such a proof that they are correct, because the kind of errors that make them fail are subtle, and extreme care must be taken. Simply "rolling" a lock free algorithm, even if it fails to result in apparent problems during testing, is a recipe for unreliable code.
One way to get around writing a formal proof in this kind of situation is to track down someone who has written proven correct pseudo code or the like. Sketch out the pseudo code, together with the proof of correctness, in comments. Then fill in the code in the holes.
In general, proving an "almost correct" lock free algorithm is flawed is harder than writing a solid proof that a lock free algorithm is correct if implemented in a particular way, then implementing it. Now, if your algorithm is so flawed that it is easy to find the flaws, then you aren't showing a basic understanding of the problem domain.
In short, by posting "why is my algorithm wrong", you are approaching how to write lock free algorithms incorrectly. "Where is the flaw in my proof?", "I proved this pseudo-code correct here, and then I implemented it, why do my tests show deadlocks?" are good lock-free questions. "Here is a bunch of code with comments that merely describe what the next line of code does, and no comments describing why I do the next line of code, or how that line of code maintains my lock-free invariants" is not a good lock-free question.
Step back. Find some proven-correct algorithms. Learn how the proof work. Implement some proven correct algorithms via monkey-see monkey-do. Look at the footnotes to note the issues their proof overlooked (like A-B issues). After you have a bunch of those under your belt, try a variation, and do the proof, and check the proof, and do the implementation, and check the implementation.
Related
Multiple producers single consumer scenario, except consumption happens once and after that the queue is "closed" and no more work is allowed. I have a MPSC queue, so I tried to add a lock-free algorithm to "close" the queue. I believe it's correct and it passes my tests. The problem is when I try to optimise memory order it stops working (I think work is lost, e.g. enqueued after the queue is closed). Even on x64 which has "kind of" strong memory model, even with a single producer.
My attempt to fine-tune memory order is commented out:
// thread-safe for multi producers single consumer use
// linked-list based, and so it's growable
MPSC_queue work_queue;
std::atomic<bool> closed{ false };
std::atomic<int32_t> producers_num{ 0 };
bool produce(Work&& work)
{
bool res = false;
++producers_num;
// producers_num.fetch_add(1, std::memory_order_release);
if (!closed)
// if (!closed.load(std::memory_order_acquire))
{
work_queue.push(std::move(work));
res = true;
}
--producers_num;
// producers_num.fetch_sub(1, std::memory_order_release);
return res;
}
void consume()
{
closed = true;
// closed.store(true, std::memory_order_release);
while (producers_num != 0)
// while (producers_num.load(std::memory_order_acquire) != 0)
std::this_thread::yield();
Work work;
while (work_queue.pop(work))
process(work);
}
I also tried std::memory_order_acq_rel for read-modify-write ops on producers_num, doesn't work either.
A bonus question:
This algorithm is used with MPSC queue, which already does some synchronisation inside. It would be nice to combine them for better performance. Do you know any such algorithm for "closable" MPSC queue?
I think closed = true; does need to be seq_cst to make sure it's visible to other threads before you check producers_num the first time. Otherwise this ordering is possible:
producer: ++producers_num;
consumer: producers_num == 0
producer: if (!closed) finds it still open
consumer: close.store(true, release) becomes globally visible.
consumer: work_queue.pop(work) finds the queue empty
producer: work_queue.push(std::move(work)); adds work to the queue after consumer has stopped looking.
You can still avoid seq_cst if you have the consumer check producers_num == 0 before returning, like
while (producers_num != 0)
// while (producers_num.load(std::memory_order_acquire) != 0)
std::this_thread::yield();
do {
Work work;
while (work_queue.pop(work))
process(work);
} while(producers_num.load(acquire) != 0);
// safe if pop included a full barrier, I think
I'm not 100% sure I have this right, but I think checking producer_num after a full barrier is sufficient.
However, the producer side does need ++producers_num; to be at least acq_rel, otherwise it can reorder past if (!closed). (An acquire fence after it, before if(!closed) might also work).
Since you only want to use the queue once, it doesn't need to wrap around and can probably be quite a lot simpler. Like an atomic producer-position counter that writers increment to claim a spot, and if they get a position > size then the queue was full. I haven't thought through the full details, though.
That might allow a cleaner solution to the above problem, perhaps by having the consumer look at that write index to see if there were any producer
I have the following problem. I use a vector that gets filled up with values from a temperature sensor. This function runs in one thread. Then I have another thread responsible for publishing all the values into a data base which runs once every second. Now the publishing thread will lock the vector using a mutex, so the function that fills it with values will get blocked. However, while the thread that publishes the values is using the vector I want to use another vector to save the temperature values so that I don't lose any values while the data is getting published. How do I get around this problem? I thought about using a pointer that points to the containers and then switching it to the other container once it gets locked to keep saving values, but I dont quite know how.
I tried to add a minimal reproducable example, I hope it kind of explains my situation.
void publish(std::vector<temperature> &inputVector)
{
//this function would publish the values into a database
//via mqtt and also runs in a thread.
}
int main()
{
std::vector<temperature> testVector;
std::vector<temperature> testVector2;
while(1)
{
//I am repeatedly saving values into the vector.
//I want to do this in a thread but if the vector locked by a mutex
//i want to switch over to the other vector
testVector.push_back(testSensor.getValue());
}
}
Assuming you are using std::mutex, you can use mutex::try_lock on the producer side. Something like this:
while(1)
{
if (myMutex.try_lock()) {
// locking succeeded - move all queued values and push the new value
std::move(testVector2.begin(), testVector2.end(), std::back_inserter(testVector));
testVector2.clear();
testVector.push_back(testSensor.getValue());
myMutex.unlock();
} else {
// locking failed - queue the value
testVector2.push_back(testSensor.getValue());
}
}
Of course publish() needs to lock the mutex, too.
void publish(std::vector<temperature> &inputVector)
{
std::lock_guard<std::mutex> lock(myMutex);
//this function would publish the values into a database
//via mqtt and also runs in a thread.
}
This seems like the perfect opportunity for an additional (shared) buffer or queue, that's protected by the lock.
main would be essentially as it is now, pushing your new values into the shared buffer.
The other thread would, when it can, lock that buffer and take the new values from it. This should be very fast.
Then, it does not need to lock the shared buffer while doing its database things (which take longer), as it's only working on its own vector during that procedure.
Here's some pseudo-code:
std::mutex pendingTempsMutex;
std::vector<temperature> pendingTemps;
void thread2()
{
std::vector<temperature> temps;
while (1)
{
// Get new temps if we have any
{
std::scoped_lock l(pendingTempsMutex);
temps.swap(pendingTemps);
}
if (!temps.empty())
publish(temps);
}
}
void thread1()
{
while (1)
{
std::scoped_lock l(pendingTempsMutex);
pendingTemps.push_back(testSensor.getValue());
/*
Or, if getValue() blocks:
temperature newValue = testSensor.getValue();
std::scoped_lock l(pendingTempsMutex);
pendingTemps.push_back(newValue);
*/
}
}
Usually you'd use a std::queue for pendingTemps though. I don't think it really matters in this example, because you're always consuming everything in thread 2, but it's more conventional and can be more efficient in some scenarios. It can't lose you much as it's backed by a std::deque. But you can measure/test to see what's best for you.
This solution is pretty much what you already proposed/explored in the question, except that the producer shouldn't be in charge of managing the second vector.
You can improve it by having thread2 wait to be "informed" that there are new values, with a condition variable, otherwise you're going to be doing a lot of busy-waiting. I leave that as an exercise to the reader ;) There should be an example and discussion in your multi-threaded programming book.
Recently, I found myself often in a situation that shared data get read a lot, but written rarely, so I begin to wonder is it possible to speed up the sync a little bit.
Take the following as an example, in which mutiple threads occasionally write the data, a single thread frequently read the data, all synched with a normal mutex.
#include <iostream>
#include <unistd.h>
#include <unordered_map>
#include <mutex>
#include <thread>
using namespace std;
unordered_map<int, int> someData({{1,10}});
mutex mu;
void writeData(){
while(true){
{
lock_guard<mutex> lock(mu);
int r = rand()%10;
someData[1] = r;
printf("data changed to %d\n", r);
}
usleep(rand()%100000000 + 100000000);
}
}
void readData(){
while(true){
{
lock_guard<mutex> lock(mu);
for(auto &i:someData){
printf("%d:%d\n", i.first, i.second);
}
}
usleep(100);
}
}
int main() {
thread writeT1(&writeData2);
thread writeT2(&writeData2);
thread readT(&readData2);
readT.join();
}
using normal lock mechanism, every read requires a lock, and I'm thinking to speed up to a single atomic read in most cases:
unordered_map<int, int> someData({{1,10}});
mutex mu;
atomic_int dataVersion{0};
void writeData2(){
while(true){
{
lock_guard<mutex> lock(mu);
dataVersion.fetch_add(1, memory_order_acquire);
int r = rand()%10;
someData[1] = r;
printf("data changed to %d\n", r);
}
usleep(rand()%100000000 + 100000000);
}
}
void readData2(){
mu.lock();
int versionCopy = dataVersion.load();
auto dataCopy = someData;
mu.unlock();
while(true){
if (versionCopy != dataVersion.load(memory_order_relaxed)){
lock_guard<mutex> lock(mu);
versionCopy = dataVersion.load(memory_order_relaxed);
dataCopy = someData;
}
else{
for(auto &i:dataCopy){
printf("%d:%d\n", i.first, i.second);
}
usleep(100);
}
}
}
The data type unordered_map here is just an example, it could be any type, and I'm not looking for a pure lock-free algorithm, as that might be a whole other story. Just for a normal lock based sync, in a situation that most operation is read, using a trick like this, is it logically ok? Are there any established approaches for this?
[edit]
I'm aware of the shared mutex, but it isn't really the situation that I was talking about. firstly shared lock is not cheap, probably more expensive than the plain mutex, certainly heavier than atomics; secondly, in the example I showed a single reading thread which can't take much advantage of it.
I was interested particularly in the locking operation cost. Reducing blocking, critical section sure is the first thing to look at in a real case, but I wasn't targeting that here.
The unordered_map data type is just an example, not looking for a data structure that better suits for a specific task, or a lock free algorithm, the data type could be anything.
sleep time is to demonstrate that read happens way much more than write, to a degree that we begin to not so care the extra lock and copy time in the if block.
Thanks~
You are storing the data in an unordered_map. What guarantees does the unordered_map class make about concurrent access for readers & writers. If it is unhappy with that prospect, the atomics are not your friend.
In most (every?) OS, locking primitives themselves are handled with atomics in the uncontested case; only reverting to a kernel when contested. With that in mind, you are best to minimize the amount of code while the lock is held, so your first loop should be:
int r = rand()%10;
mu.lock();
someData[1] = r;
mu.unlock();
printf("data changed to %d\n", r);
I don't know how you would fix the read side, but if you chose a friendlier data store, you could minimize access to it in the same way.
I will first try to describe my own understanding of your idea:
Frequent reads, occasional write.
Locks are expensive ... should be benchmarked, try std::shared_mutex or Slim Reader/Writer SRW Locks - Windows only, or some other slim implementation, which usually use some cheap and optimistic (atomic/spin-lock) mechanism, that has little-to-no impact in case of no collision (no writer most of the time).
You don't seem to care how old/recent your copy is. That is acceptable for some informative performance counters, but I would think twice about that - it is not something somebody else having to maintain your code would expect or even think about. The consequences can be catastrophic.
You only access the writable data under the lock, read a copy you create holding the lock. That means your approach is safe from simple threading synchronization view, except the above point (readers working with old data, multiple readers can have different copies ... is it worth it?).
Anyway, you should really try to benchmark first and then try to find better solution which somebody else already wrote (slim rw-locks), before even attemting to come-up with your own synchronization mechanism (that is generally very hard to do correctly).
EDIT: Found some article with concreate shared_mutex implementation using std::atomic:
Code Project: We make a std::shared_mutex 10 times faster
Coliru test here
So, I've written a queue, after a bit of research. It uses a fixed-size buffer, so it's a circular queue. It has to be thread-safe, and I've tried to make it lock-free. I'd like to know what's wrong with it, because these kinds of things are difficult to predict on my own.
Here's the header:
template <class T>
class LockFreeQueue
{
public:
LockFreeQueue(uint buffersize) : buffer(NULL), ifront1(0), ifront2(0), iback1(0), iback2(0), size(buffersize) { buffer = new atomic <T>[buffersize]; }
~LockFreeQueue(void) { if (buffer) delete[] buffer; }
bool pop(T* output);
bool push(T input);
private:
uint incr(const uint val)
{return (val + 1) % size;}
atomic <T>* buffer;
atomic <uint> ifront1, ifront2, iback1, iback2;
uint size;
};
And here's the implementation:
template <class T>
bool LockFreeQueue<T>::pop(T* output)
{
while (true)
{
/* Fetch ifront and store it in i. */
uint i = ifront1;
/* If ifront == iback, the queue is empty. */
if (i == iback2)
return false;
/* If i still equals ifront, increment ifront, */
/* Incrememnting ifront1 notifies pop() that it can read the next element. */
if (ifront1.compare_exchange_weak(i, incr(i)))
{
/* then fetch the output. */
*output = buffer[i];
/* Incrememnting ifront2 notifies push() that it's safe to write. */
++ifront2;
return true;
}
/* If i no longer equals ifront, we loop around and try again. */
}
}
template <class T>
bool LockFreeQueue<T>::push(T input)
{
while (true)
{
/* Fetch iback and store it in i. */
uint i = iback1;
/* If ifront == (iback +1), the queue is full. */
if (ifront2 == incr(i))
return false;
/* If i still equals iback, increment iback, */
/* Incrememnting iback1 notifies push() that it can write a new element. */
if (iback1.compare_exchange_weak(i, incr(i)))
{
/* then store the input. */
buffer[i] = input;
/* Incrementing iback2 notifies pop() that it's safe to read. */
++iback2;
return true;
}
/* If i no longer equals iback, we loop around and try again. */
}
}
EDIT: I made some major modifications to the code, based on comments (Thanks KillianDS and n.m.!). Most importantly, ifront and iback are now ifront1, ifront2, iback1, and iback2. push() will now increment iback1, notifying other pushing threads that they can safely write to the next element (as long as it's not full), write the element, then increment iback2. iback2 is all that gets checked by pop(). pop() does the same thing, but with the ifrontn indices.
Now, once again, I fall into the trap of "this SHOULD work...", but I don't know anything about formal proofs or anything like that. At least this time, I can't think of a potential way that it could fail. Any advice is appreciated, except for "stop trying to write lock-free code".
The proper way to approach a lock free data structure is to write a semi formal proof that your design works in pseudo code. You shouldn't be asking "is this lock free code thread safe", but rather "does my proof that this lock free code is thread safe have any errors?"
Only after you have a formal proof that a pseudo code design works do you try to implement it. Often this brings to light issues like garbage collection that have to be handled carefully.
Your code should be the formal proof and pseudo code in comments, with the relatively unimportant implementation interspersed within.
Verifying your code is correct then consists of understanding the pseudo code, checking the proof, then checking for failure for your code to map to your pseudo code and proof.
Directly taking code and trying to check that it is lock free is impractical. The proof is the important thing in correctly designing this kind of thing, the actual code is secondary, as the proof is the hard part.
And after and while you have done all of the above, and have other people validate it, you have to put your code through practical tests to see if you have a blind spot and there is a hole, or don't understand your concurrency primitives, or if your concurrency primitives have bugs in them.
If you aren't interested in writing semi formal proofs to design your code, you shouldn't be hand rolling lock free algorithms and data structures and putting them into place in production code.
Determining if a pile of code "is thread safe" is putting all of the work load on other people. You need to have an argument why your code "is thread safe" arranged in such a way that it is as easy as possible for others to find holes in it. If your argument why your code "is thread safe" is arranged in ways that makes it harder to find holes, your code cannot be presumed to be thread safe, even if nobody can spot a hole in your code.
The code you posted above is a mess. It contains commented out code, no formal invariants, no proofs that the lines, no strong description of why it is thread safe, and in general does not put forward an attempt to show itself as thread safe in a way that makes it easy to spot flaws. As such, no reasonable reader will consider the code thread safe, even if they cannot find any errors in it.
No, it's not thread safe - consider the following sequence if events:
First thread completes if (ifront.compare_exchange_weak(i, incr(i))) in pop and goes to sleep by scheduler.
Second thread calls push size times (just enough to make ifront be equal to value of i in the first thread).
First thread wakes.
In this case pop buffer[i] will contain the last pushed value, which is wrong.
There are some issues when considering wrap-around but I think the main issue of your code is that it may pop invalid values from the buffer.
Consider this:
ifront = iback = 0
Push gets called and CAS increases the value of iback 0 -> 1. However the thread now get's stalled before buffer[0] is assigned.
ifront = 0, iback = 1
Pop is now called. CAS increases ifront 1 -> 1 and buffer[0] is read before it's assigned.
A stale or invalid value is popped.
PS. Some researches therefore have asked for a DCAS or TCAS (Di and Tri CAS).
Can you spot the error in the code? tickets ends up going below 0 causing long stalls.
struct SContext {
volatile unsigned long* mutex;
volatile long* ticket;
volatile bool* done;
};
static unsigned int MyThreadFunc(SContext* ctxt) {
// -- keep going until we signal for thread to close
while(*ctxt->done == false) {
while(*ctxt->ticket) { // while we have tickets waiting
unsigned int lockedaquired = 0;
do {
if(*ctxt->mutex == 0) { // only try if someone doesn't have mutex locked
// -- if the compare and swap doesn't work then the function returns
// -- the value it expects
lockedaquired = InterlockedCompareExchange(ctxt->mutex, 1, 0);
}
} while(lockedaquired != 0); // loop while we didn't aquire lock
// -- enter critical section
// -- grab a ticket
if(*ctxt->ticket > 0);
(*ctxt->ticket)--;
// -- exit critical section
*ctxt->mutex = 0; // release lock
}
}
return 0;
}
Calling function waiting for threads to finish
for(unsigned int loops = 0; loops < eLoopCount; ++loops) {
*ctxt.ticket = eNumThreads; // let the threads start!
// -- wait for threads to finish
while(*ctxt.ticket != 0)
;
}
done = true;
EDIT:
The answer to this question is simple and unfortunately after I spent the time trimming down the example to post a simplified version I immediately find the answer after I post the question. Sigh..
I initialize lockaquired to 0. Then as an optimization to not take up bus bandwith I don't do the CAS if the mutex is taken.
Unfortunately, in that case where the lock is taken the while loop will let the second thread through!
Sorry for the extra question. I thought I didn't understand windows low level synchronization primitives but really I just had a simple mistake.
I see another race in your code: One thread can cause *ctxt.ticket to hit 0, allowing the parent loop to go back and re-set *ctxt.ticket = eNumThreads without holding *ctxt.mutex. Some other thread may already now hold the mutex (in fact, it probably does) and operate on *ctxt.ticket. For your simplified example this only prevents "batches" from being cleanly separated, but if you had more complex initialization (as in more complex than a single word write) at the top of the loops loop you could see strange behavior.
I posted a bug where I thought it was a legitimate multithreaded problem but really it was just bad logic. I solved the bug as soon as I posted. Here is the problem lines and answer
unsigned int lockedaquired = 0;
I initialized lockaquired to 0 and then after I added an if statement to skip the expensive operation of doing a CAS. This optimization caused it to fall out of the while loop and into the critical section. Changing the code to
unsigned int lockedaquired = 1;
Fixes the problem. There is another hidden problem in the code that I found as well(I really shouldn't code late at night anymore). Anyone notice the semicolon after the if statement in the critical section? Sigh...
if(*ctxt->ticket > 0);
(*ctxt->ticket)--;
That should be
if(*ctxt->ticket > 0)
Also, Ben Jackson pointed out that a thread probably will be inside the critical section when we reset the ticket to eNumThreads. While this is perfectly fine in this sample code if you were to apply it to a problem where you needed to do more operations it might not be safe because the threads aren't running in lockstep so keep that in mind if you apply this to your code.
A final note, if anyone does decide to use this code for their own implementation of a mutex please remember that your main driver thread is spinning idle. If you are doing a large operation in the critical section that takes a deal of time and your ticket count is high consider yielding your thread to let other software make use of the CPU while its waiting. Also, consider using a spin lock if the critical section is large.
Thank you