Looking for critique of my thread safe, lock-free queue implementation - c++

So, I've written a queue, after a bit of research. It uses a fixed-size buffer, so it's a circular queue. It has to be thread-safe, and I've tried to make it lock-free. I'd like to know what's wrong with it, because these kinds of things are difficult to predict on my own.
Here's the header:
template <class T>
class LockFreeQueue
{
public:
LockFreeQueue(uint buffersize) : buffer(NULL), ifront1(0), ifront2(0), iback1(0), iback2(0), size(buffersize) { buffer = new atomic <T>[buffersize]; }
~LockFreeQueue(void) { if (buffer) delete[] buffer; }
bool pop(T* output);
bool push(T input);
private:
uint incr(const uint val)
{return (val + 1) % size;}
atomic <T>* buffer;
atomic <uint> ifront1, ifront2, iback1, iback2;
uint size;
};
And here's the implementation:
template <class T>
bool LockFreeQueue<T>::pop(T* output)
{
while (true)
{
/* Fetch ifront and store it in i. */
uint i = ifront1;
/* If ifront == iback, the queue is empty. */
if (i == iback2)
return false;
/* If i still equals ifront, increment ifront, */
/* Incrememnting ifront1 notifies pop() that it can read the next element. */
if (ifront1.compare_exchange_weak(i, incr(i)))
{
/* then fetch the output. */
*output = buffer[i];
/* Incrememnting ifront2 notifies push() that it's safe to write. */
++ifront2;
return true;
}
/* If i no longer equals ifront, we loop around and try again. */
}
}
template <class T>
bool LockFreeQueue<T>::push(T input)
{
while (true)
{
/* Fetch iback and store it in i. */
uint i = iback1;
/* If ifront == (iback +1), the queue is full. */
if (ifront2 == incr(i))
return false;
/* If i still equals iback, increment iback, */
/* Incrememnting iback1 notifies push() that it can write a new element. */
if (iback1.compare_exchange_weak(i, incr(i)))
{
/* then store the input. */
buffer[i] = input;
/* Incrementing iback2 notifies pop() that it's safe to read. */
++iback2;
return true;
}
/* If i no longer equals iback, we loop around and try again. */
}
}
EDIT: I made some major modifications to the code, based on comments (Thanks KillianDS and n.m.!). Most importantly, ifront and iback are now ifront1, ifront2, iback1, and iback2. push() will now increment iback1, notifying other pushing threads that they can safely write to the next element (as long as it's not full), write the element, then increment iback2. iback2 is all that gets checked by pop(). pop() does the same thing, but with the ifrontn indices.
Now, once again, I fall into the trap of "this SHOULD work...", but I don't know anything about formal proofs or anything like that. At least this time, I can't think of a potential way that it could fail. Any advice is appreciated, except for "stop trying to write lock-free code".

The proper way to approach a lock free data structure is to write a semi formal proof that your design works in pseudo code. You shouldn't be asking "is this lock free code thread safe", but rather "does my proof that this lock free code is thread safe have any errors?"
Only after you have a formal proof that a pseudo code design works do you try to implement it. Often this brings to light issues like garbage collection that have to be handled carefully.
Your code should be the formal proof and pseudo code in comments, with the relatively unimportant implementation interspersed within.
Verifying your code is correct then consists of understanding the pseudo code, checking the proof, then checking for failure for your code to map to your pseudo code and proof.
Directly taking code and trying to check that it is lock free is impractical. The proof is the important thing in correctly designing this kind of thing, the actual code is secondary, as the proof is the hard part.
And after and while you have done all of the above, and have other people validate it, you have to put your code through practical tests to see if you have a blind spot and there is a hole, or don't understand your concurrency primitives, or if your concurrency primitives have bugs in them.
If you aren't interested in writing semi formal proofs to design your code, you shouldn't be hand rolling lock free algorithms and data structures and putting them into place in production code.
Determining if a pile of code "is thread safe" is putting all of the work load on other people. You need to have an argument why your code "is thread safe" arranged in such a way that it is as easy as possible for others to find holes in it. If your argument why your code "is thread safe" is arranged in ways that makes it harder to find holes, your code cannot be presumed to be thread safe, even if nobody can spot a hole in your code.
The code you posted above is a mess. It contains commented out code, no formal invariants, no proofs that the lines, no strong description of why it is thread safe, and in general does not put forward an attempt to show itself as thread safe in a way that makes it easy to spot flaws. As such, no reasonable reader will consider the code thread safe, even if they cannot find any errors in it.

No, it's not thread safe - consider the following sequence if events:
First thread completes if (ifront.compare_exchange_weak(i, incr(i))) in pop and goes to sleep by scheduler.
Second thread calls push size times (just enough to make ifront be equal to value of i in the first thread).
First thread wakes.
In this case pop buffer[i] will contain the last pushed value, which is wrong.

There are some issues when considering wrap-around but I think the main issue of your code is that it may pop invalid values from the buffer.
Consider this:
ifront = iback = 0
Push gets called and CAS increases the value of iback 0 -> 1. However the thread now get's stalled before buffer[0] is assigned.
ifront = 0, iback = 1
Pop is now called. CAS increases ifront 1 -> 1 and buffer[0] is read before it's assigned.
A stale or invalid value is popped.
PS. Some researches therefore have asked for a DCAS or TCAS (Di and Tri CAS).

Related

lock-free "closable" MPSC queue

Multiple producers single consumer scenario, except consumption happens once and after that the queue is "closed" and no more work is allowed. I have a MPSC queue, so I tried to add a lock-free algorithm to "close" the queue. I believe it's correct and it passes my tests. The problem is when I try to optimise memory order it stops working (I think work is lost, e.g. enqueued after the queue is closed). Even on x64 which has "kind of" strong memory model, even with a single producer.
My attempt to fine-tune memory order is commented out:
// thread-safe for multi producers single consumer use
// linked-list based, and so it's growable
MPSC_queue work_queue;
std::atomic<bool> closed{ false };
std::atomic<int32_t> producers_num{ 0 };
bool produce(Work&& work)
{
bool res = false;
++producers_num;
// producers_num.fetch_add(1, std::memory_order_release);
if (!closed)
// if (!closed.load(std::memory_order_acquire))
{
work_queue.push(std::move(work));
res = true;
}
--producers_num;
// producers_num.fetch_sub(1, std::memory_order_release);
return res;
}
void consume()
{
closed = true;
// closed.store(true, std::memory_order_release);
while (producers_num != 0)
// while (producers_num.load(std::memory_order_acquire) != 0)
std::this_thread::yield();
Work work;
while (work_queue.pop(work))
process(work);
}
I also tried std::memory_order_acq_rel for read-modify-write ops on producers_num, doesn't work either.
A bonus question:
This algorithm is used with MPSC queue, which already does some synchronisation inside. It would be nice to combine them for better performance. Do you know any such algorithm for "closable" MPSC queue?
I think closed = true; does need to be seq_cst to make sure it's visible to other threads before you check producers_num the first time. Otherwise this ordering is possible:
producer: ++producers_num;
consumer: producers_num == 0
producer: if (!closed) finds it still open
consumer: close.store(true, release) becomes globally visible.
consumer: work_queue.pop(work) finds the queue empty
producer: work_queue.push(std::move(work)); adds work to the queue after consumer has stopped looking.
You can still avoid seq_cst if you have the consumer check producers_num == 0 before returning, like
while (producers_num != 0)
// while (producers_num.load(std::memory_order_acquire) != 0)
std::this_thread::yield();
do {
Work work;
while (work_queue.pop(work))
process(work);
} while(producers_num.load(acquire) != 0);
// safe if pop included a full barrier, I think
I'm not 100% sure I have this right, but I think checking producer_num after a full barrier is sufficient.
However, the producer side does need ++producers_num; to be at least acq_rel, otherwise it can reorder past if (!closed). (An acquire fence after it, before if(!closed) might also work).
Since you only want to use the queue once, it doesn't need to wrap around and can probably be quite a lot simpler. Like an atomic producer-position counter that writers increment to claim a spot, and if they get a position > size then the queue was full. I haven't thought through the full details, though.
That might allow a cleaner solution to the above problem, perhaps by having the consumer look at that write index to see if there were any producer

SPSC lock free queue without atomics

I have below a SPSC queue for my logger.
It is certainly not a general-use SPSC lock-free queue.
However, given a bunch of assumptions around how it will be used, target architecture etc, and a number of acceptable tradeoffs, which I go into detail below, my questions is basically, is it safe / does it work?
It will only be used on x86_64 architecture, so writes to uint16_t will be atomic.
Only the producer updates the tail.
Only the consumer updates the head.
If the producer reads an old value of head, it will look like there is less space in the queue than reality, which is an acceptable limitation in the context in which is is used.
If the consumer reads an old value of tail, it will look like there is less data waiting in the queue than reality, again an acceptable limitation.
The limitations above are acceptable because:
the consumer may not get the latest tail immediately, but eventually the latest tail will arrive, and queued data will be logged.
the producer may not get the latest head immediately, so the queue will look more full than it really is. In our load testing we have found the amount we log vs the size of the queue, and the speed at which the logger drains the queue, this limitation has no effect - there is always space in the queue.
A final point, the use of volatile is necessary to prevent the variable which each thread only reads from being optimised out.
My questions:
Is this logic correct?
Is the queue thread safe?
Is volatile sufficient?
Is volatile necessary?
My queue:
class LogBuffer
{
public:
bool is_empty() const { return head_ == tail_; }
bool is_full() const { return uint16_t(tail_ + 1) == head_; }
LogLine& head() { return log_buffer_[head_]; }
LogLine& tail() { return log_buffer_[tail_]; }
void advance_head() { ++head_; }
void advance_hail() { ++tail_; }
private:
volatile uint16_t tail_ = 0; // write position
LogLine log_buffer_[0xffff + 1]; // relies on the uint16_t overflowing
volatile uint16_t head_ = 0; // read position
};
Is this logic correct?
Yes.
Is the queue thread safe?
No.
Is volatile sufficient? Is volatile necessary?
No, to both. Volatile is not a magic keyword that makes any variable threadsafe. You still need to use atomic variables or memory barriers for the indexes to ensure memory ordering is correct when you produce or consume an item.
To be more specific, after you produce or consume an item for your queue you need to issue a memory barrier to guarantee that other threads will see the changes. Many atomic libraries will do this for you when you update an atomic variable.
As an aside, use "was_empty" instead of "is_empty" to be clear about what it does. The result of this call is one instance in time which may have changed by the time you act on its value.

Lock-free stack - Is this a correct usage of c++11 relaxed atomics? Can it be proven?

I've written a container for a very simple piece of data that needs to be synchronized across threads. I want the top performance. I don't want to use locks.
I want to use "relaxed" atomics. Partly for that little bit of extra oomph, and partly to really understand them.
I've been working on this a lot, and I'm at the point where this code passes all tests I throw at it. That's not quite "proof" though, and so I'm wondering if there's anything I'm missing, or any other ways I can test this?
Here's my premise:
It is only important that a Node be properly pushed and popped, and that the Stack can never be invalidated.
I believe that the order of operations in memory is only important in one place:
Between the compare_exchange operations themselves. This is guaranteed, even with relaxed atomics.
The "ABA" problem is solved by adding identification numbers to the pointers. On 32 bit systems, this requires a double-word compare_exchange, and on 64 bit systems the unused 16 bits of the pointer are filled with id numbers.
Therefore: the stack will always be in a valid state. (right?)
Here's what I'm thinking. "Normally", the way we reason about code that we're reading is to look at the order in which it's written. Memory can be read or written to "out of order", but not in a way that invalidates the correctness of the program.
That changes in a multi-threaded environment. That's what memory fences are for - so that we can still look at the code and be able to reason about how it's going to work.
So if everything can go all out-of-order here, what am I doing with relaxed atomics? Isn't that a bit too far?
I don't think so, but that's why I'm here asking for help.
The compare_exchange operations themselves give a guarantee of sequential constancy with each other.
The only other time there is read or write to an atomic is to get the head's initial value before a compare_exchange. It is set as part of the initialization of a variable. As far as I can tell, it would be irrelevant whether or not this operation brings back a "proper" value.
Current code:
struct node
{
node *n_;
#if PROCESSOR_BITS == 64
inline constexpr node() : n_{ nullptr } { }
inline constexpr node(node* n) : n_{ n } { }
inline void tag(const stack_tag_t t) { reinterpret_cast<stack_tag_t*>(this)[3] = t; }
inline stack_tag_t read_tag() { return reinterpret_cast<stack_tag_t*>(this)[3]; }
inline void clear_pointer() { tag(0); }
#elif PROCESSOR_BITS == 32
stack_tag_t t_;
inline constexpr node() : n_{ nullptr }, t_{ 0 } { }
inline constexpr node(node* n) : n_{ n }, t_{ 0 } { }
inline void tag(const stack_tag_t t) { t_ = t; }
inline stack_tag_t read_tag() { return t_; }
inline void clear_pointer() { }
#endif
inline void set(node* n, const stack_tag_t t) { n_ = n; tag(t); }
};
using std::memory_order_relaxed;
class stack
{
public:
constexpr stack() : head_{}{}
void push(node* n)
{
node next{n}, head{head_.load(memory_order_relaxed)};
do
{
n->n_ = head.n_;
next.tag(head.read_tag() + 1);
} while (!head_.compare_exchange_weak(head, next, memory_order_relaxed, memory_order_relaxed));
}
bool pop(node*& n)
{
node clean, next, head{head_.load(memory_order_relaxed)};
do
{
clean.set(head.n_, 0);
if (!clean.n_)
return false;
next.set(clean.n_->n_, head.read_tag() + 1);
} while (!head_.compare_exchange_weak(head, next, memory_order_relaxed, memory_order_relaxed));
n = clean.n_;
return true;
}
protected:
std::atomic<node> head_;
};
What's different about this question compared to others? Relaxed atomics. They make a big difference to the question.
So, what do you think? Is there anything I'm missing?
push is broken, since you do not update node->_next after a compareAndSwap failure. It's possible that the node you originally stored with node->setNext has been popped from the top of stack by another thread when the next compareAndSwap attempt succeeds. As a result, some thread thinks it has popped a node from the stack but this thread has put it back in the stack. It should be:
void push(Node* node) noexcept
{
Node* n = _head.next();
do {
node->setNext(n);
} while (!_head.compareAndSwap(n, node));
}
Also, since next and setNext use memory_order_relaxed, there's no guarantee that _head_.next() here is returning the node most recently pushed. It's possible to leak nodes from the top of the stack. The same problem obviously exists in pop as well: _head.next() may return a node that was previously but is no longer at the top of the stack. If the returned value is nullptr, you may fail to pop when the stack is not actually empty.
pop can also have undefined behavior if two threads try to pop the last node from the stack at the same time. They both see the same value for _head.next(), one thread successfully completes pop. The other thread enters the while loop - since the observed node pointer is not nullptr - but the compareAndSwap loop soon updates it to nullptr since the stack is now empty. On the next iteration of the loop, that nullptr is dererenced to get its _next pointer and much hilarity ensues.
pop is also clearly suffering from ABA. Two threads can see the same node at the top of the stack. Say one thread gets to the point of evaluating the _next pointer and then blocks. The other thread successfully pops the node, pushes 5 new nodes, and then pushes that original node again all before the other thread wakes. That other thread's compareAndSwap will succeed - the top-of-stack node is the same - but store the old _next value into _head instead of the new one. The five nodes pushed by the other thread are all leaked. This would be the case with memory_order_seq_cst as well.
Leaving to one side the difficulty of implementing the pop operation, I think memory_order_relaxed is inadequate. Before pushing the node, one assumes that some value(s) will be written into to it, which will be read when the node is popped. You need some synchronization mechanism to ensure that the values have actually been written before they are read. memory_order_relaxed is not providing that synchronization... memory_order_acquire/memory_order_release would.
This code is completely broken.
The only reason this appears to work is that current compilers aren't very aggressive with reordering across atomic operations and x86 processors have pretty strong guarantees.
The first problem is that without synchronization, there is no guarantee that the client of this data structure will even see the fields of the node object to be initialized. The next issue is that without synchronization, the push operation can read arbitrarily old values for the head's tag.
We have developed a tool, CDSChecker, that simulates most behaviors that the memory model allows. It is open source and free. Run it on your data structure to see some interesting executions.
Proving anything about code that utilizes relaxed atomics is a big challenge at this point. Most proof methods break down because they are typically inductive in nature, and you don't have an order to induct on. So you get out of thin air read issues...

Why does my lock-free message queue segfault :(?

As a purely mental exercise I'm trying to get this to work without locks or mutexes. The idea is that when the consumer thread is reading/executing messages it atomically swaps which std::vector the producer thread uses for writes. Is this possible? I've tried playing around with thread fences to no avail. There's a race condition here somewhere because it occasionally seg faults. I imagine it's somewhere in the enqueue function. Any ideas?
// should execute functions on the original thread
class message_queue {
public:
using fn = std::function<void()>;
using queue = std::vector<fn>;
message_queue() : write_index(0) {
}
// should only be called from consumer thread
void run () {
// atomically gets the current pending queue and switches it with the other one
// for example if we're writing to queues[0], we grab a reference to queue[0]
// and tell the producer to write to queues[1]
queue& active = queues[write_index.fetch_xor(1)];
// skip if we don't have any messages
if (active.size() == 0) return;
// run all messages/callbacks
for (auto fn : active) {
fn();
}
// clear the active queue so it can be re-used
active.clear();
// swap active and pending threads
write_index.fetch_xor(1);
}
void enqueue (fn value) {
// loads the current pending queue and append some work
queues[write_index.load()].push_back(value);
}
private:
queue queues[2];
std::atomic<bool> is_empty; // unused for now
std::atomic<int> write_index;
};
int main(int argc, const char * argv[])
{
message_queue queue{};
// flag to stop the message loop
// doesn't actually need to be atomic because it's only read/wrote on the main thread
std::atomic<bool> done(false);
std::thread worker([&queue, &done] {
int count = 100;
// send 100 messages
while (--count) {
queue.enqueue([count] {
// should be executed in the main thread
std::cout << count << "\n";
});
}
// finally tell the main thread we're done
queue.enqueue([&] {
std::cout << "done!\n";
done = true;
});
});
// run messages until the done flag is set
while(!done) queue.run();
worker.join();
}
if I understand your code correctly, there are data races, e.g.:
// producer
int r0 = write_index.load(); // r0 == 0
// consumer
int r1 = write_index.fetch_xor(1); // r1 == 0
queue& active = queues[r1];
active.size();
// producer
queue[r0].push_back(...);
Now both threads access the same queue at the same time. That's a data race, and that means undefined behaviour.
Your lock-free queue fails to work because you did not start with at least a semi-formal proof of correctness, then turn that proof into an algorithm with the proof being the primary text, comments connecting the proof to the code, all interconnected with the code.
Unless you are copy/pasting someone else's implementation who did do that, any attempt to write a lock-free algorithm will fail. If you are copy-pasting someone else's implementation, please provide it.
Lock free algorithms are not robust unless you have such a proof that they are correct, because the kind of errors that make them fail are subtle, and extreme care must be taken. Simply "rolling" a lock free algorithm, even if it fails to result in apparent problems during testing, is a recipe for unreliable code.
One way to get around writing a formal proof in this kind of situation is to track down someone who has written proven correct pseudo code or the like. Sketch out the pseudo code, together with the proof of correctness, in comments. Then fill in the code in the holes.
In general, proving an "almost correct" lock free algorithm is flawed is harder than writing a solid proof that a lock free algorithm is correct if implemented in a particular way, then implementing it. Now, if your algorithm is so flawed that it is easy to find the flaws, then you aren't showing a basic understanding of the problem domain.
In short, by posting "why is my algorithm wrong", you are approaching how to write lock free algorithms incorrectly. "Where is the flaw in my proof?", "I proved this pseudo-code correct here, and then I implemented it, why do my tests show deadlocks?" are good lock-free questions. "Here is a bunch of code with comments that merely describe what the next line of code does, and no comments describing why I do the next line of code, or how that line of code maintains my lock-free invariants" is not a good lock-free question.
Step back. Find some proven-correct algorithms. Learn how the proof work. Implement some proven correct algorithms via monkey-see monkey-do. Look at the footnotes to note the issues their proof overlooked (like A-B issues). After you have a bunch of those under your belt, try a variation, and do the proof, and check the proof, and do the implementation, and check the implementation.

simple custom made mutex failing

Can you spot the error in the code? tickets ends up going below 0 causing long stalls.
struct SContext {
volatile unsigned long* mutex;
volatile long* ticket;
volatile bool* done;
};
static unsigned int MyThreadFunc(SContext* ctxt) {
// -- keep going until we signal for thread to close
while(*ctxt->done == false) {
while(*ctxt->ticket) { // while we have tickets waiting
unsigned int lockedaquired = 0;
do {
if(*ctxt->mutex == 0) { // only try if someone doesn't have mutex locked
// -- if the compare and swap doesn't work then the function returns
// -- the value it expects
lockedaquired = InterlockedCompareExchange(ctxt->mutex, 1, 0);
}
} while(lockedaquired != 0); // loop while we didn't aquire lock
// -- enter critical section
// -- grab a ticket
if(*ctxt->ticket > 0);
(*ctxt->ticket)--;
// -- exit critical section
*ctxt->mutex = 0; // release lock
}
}
return 0;
}
Calling function waiting for threads to finish
for(unsigned int loops = 0; loops < eLoopCount; ++loops) {
*ctxt.ticket = eNumThreads; // let the threads start!
// -- wait for threads to finish
while(*ctxt.ticket != 0)
;
}
done = true;
EDIT:
The answer to this question is simple and unfortunately after I spent the time trimming down the example to post a simplified version I immediately find the answer after I post the question. Sigh..
I initialize lockaquired to 0. Then as an optimization to not take up bus bandwith I don't do the CAS if the mutex is taken.
Unfortunately, in that case where the lock is taken the while loop will let the second thread through!
Sorry for the extra question. I thought I didn't understand windows low level synchronization primitives but really I just had a simple mistake.
I see another race in your code: One thread can cause *ctxt.ticket to hit 0, allowing the parent loop to go back and re-set *ctxt.ticket = eNumThreads without holding *ctxt.mutex. Some other thread may already now hold the mutex (in fact, it probably does) and operate on *ctxt.ticket. For your simplified example this only prevents "batches" from being cleanly separated, but if you had more complex initialization (as in more complex than a single word write) at the top of the loops loop you could see strange behavior.
I posted a bug where I thought it was a legitimate multithreaded problem but really it was just bad logic. I solved the bug as soon as I posted. Here is the problem lines and answer
unsigned int lockedaquired = 0;
I initialized lockaquired to 0 and then after I added an if statement to skip the expensive operation of doing a CAS. This optimization caused it to fall out of the while loop and into the critical section. Changing the code to
unsigned int lockedaquired = 1;
Fixes the problem. There is another hidden problem in the code that I found as well(I really shouldn't code late at night anymore). Anyone notice the semicolon after the if statement in the critical section? Sigh...
if(*ctxt->ticket > 0);
(*ctxt->ticket)--;
That should be
if(*ctxt->ticket > 0)
Also, Ben Jackson pointed out that a thread probably will be inside the critical section when we reset the ticket to eNumThreads. While this is perfectly fine in this sample code if you were to apply it to a problem where you needed to do more operations it might not be safe because the threads aren't running in lockstep so keep that in mind if you apply this to your code.
A final note, if anyone does decide to use this code for their own implementation of a mutex please remember that your main driver thread is spinning idle. If you are doing a large operation in the critical section that takes a deal of time and your ticket count is high consider yielding your thread to let other software make use of the CPU while its waiting. Also, consider using a spin lock if the critical section is large.
Thank you