SPSC lock free queue without atomics - c++

I have below a SPSC queue for my logger.
It is certainly not a general-use SPSC lock-free queue.
However, given a bunch of assumptions around how it will be used, target architecture etc, and a number of acceptable tradeoffs, which I go into detail below, my questions is basically, is it safe / does it work?
It will only be used on x86_64 architecture, so writes to uint16_t will be atomic.
Only the producer updates the tail.
Only the consumer updates the head.
If the producer reads an old value of head, it will look like there is less space in the queue than reality, which is an acceptable limitation in the context in which is is used.
If the consumer reads an old value of tail, it will look like there is less data waiting in the queue than reality, again an acceptable limitation.
The limitations above are acceptable because:
the consumer may not get the latest tail immediately, but eventually the latest tail will arrive, and queued data will be logged.
the producer may not get the latest head immediately, so the queue will look more full than it really is. In our load testing we have found the amount we log vs the size of the queue, and the speed at which the logger drains the queue, this limitation has no effect - there is always space in the queue.
A final point, the use of volatile is necessary to prevent the variable which each thread only reads from being optimised out.
My questions:
Is this logic correct?
Is the queue thread safe?
Is volatile sufficient?
Is volatile necessary?
My queue:
class LogBuffer
{
public:
bool is_empty() const { return head_ == tail_; }
bool is_full() const { return uint16_t(tail_ + 1) == head_; }
LogLine& head() { return log_buffer_[head_]; }
LogLine& tail() { return log_buffer_[tail_]; }
void advance_head() { ++head_; }
void advance_hail() { ++tail_; }
private:
volatile uint16_t tail_ = 0; // write position
LogLine log_buffer_[0xffff + 1]; // relies on the uint16_t overflowing
volatile uint16_t head_ = 0; // read position
};

Is this logic correct?
Yes.
Is the queue thread safe?
No.
Is volatile sufficient? Is volatile necessary?
No, to both. Volatile is not a magic keyword that makes any variable threadsafe. You still need to use atomic variables or memory barriers for the indexes to ensure memory ordering is correct when you produce or consume an item.
To be more specific, after you produce or consume an item for your queue you need to issue a memory barrier to guarantee that other threads will see the changes. Many atomic libraries will do this for you when you update an atomic variable.
As an aside, use "was_empty" instead of "is_empty" to be clear about what it does. The result of this call is one instance in time which may have changed by the time you act on its value.

Related

lock-free "closable" MPSC queue

Multiple producers single consumer scenario, except consumption happens once and after that the queue is "closed" and no more work is allowed. I have a MPSC queue, so I tried to add a lock-free algorithm to "close" the queue. I believe it's correct and it passes my tests. The problem is when I try to optimise memory order it stops working (I think work is lost, e.g. enqueued after the queue is closed). Even on x64 which has "kind of" strong memory model, even with a single producer.
My attempt to fine-tune memory order is commented out:
// thread-safe for multi producers single consumer use
// linked-list based, and so it's growable
MPSC_queue work_queue;
std::atomic<bool> closed{ false };
std::atomic<int32_t> producers_num{ 0 };
bool produce(Work&& work)
{
bool res = false;
++producers_num;
// producers_num.fetch_add(1, std::memory_order_release);
if (!closed)
// if (!closed.load(std::memory_order_acquire))
{
work_queue.push(std::move(work));
res = true;
}
--producers_num;
// producers_num.fetch_sub(1, std::memory_order_release);
return res;
}
void consume()
{
closed = true;
// closed.store(true, std::memory_order_release);
while (producers_num != 0)
// while (producers_num.load(std::memory_order_acquire) != 0)
std::this_thread::yield();
Work work;
while (work_queue.pop(work))
process(work);
}
I also tried std::memory_order_acq_rel for read-modify-write ops on producers_num, doesn't work either.
A bonus question:
This algorithm is used with MPSC queue, which already does some synchronisation inside. It would be nice to combine them for better performance. Do you know any such algorithm for "closable" MPSC queue?
I think closed = true; does need to be seq_cst to make sure it's visible to other threads before you check producers_num the first time. Otherwise this ordering is possible:
producer: ++producers_num;
consumer: producers_num == 0
producer: if (!closed) finds it still open
consumer: close.store(true, release) becomes globally visible.
consumer: work_queue.pop(work) finds the queue empty
producer: work_queue.push(std::move(work)); adds work to the queue after consumer has stopped looking.
You can still avoid seq_cst if you have the consumer check producers_num == 0 before returning, like
while (producers_num != 0)
// while (producers_num.load(std::memory_order_acquire) != 0)
std::this_thread::yield();
do {
Work work;
while (work_queue.pop(work))
process(work);
} while(producers_num.load(acquire) != 0);
// safe if pop included a full barrier, I think
I'm not 100% sure I have this right, but I think checking producer_num after a full barrier is sufficient.
However, the producer side does need ++producers_num; to be at least acq_rel, otherwise it can reorder past if (!closed). (An acquire fence after it, before if(!closed) might also work).
Since you only want to use the queue once, it doesn't need to wrap around and can probably be quite a lot simpler. Like an atomic producer-position counter that writers increment to claim a spot, and if they get a position > size then the queue was full. I haven't thought through the full details, though.
That might allow a cleaner solution to the above problem, perhaps by having the consumer look at that write index to see if there were any producer

Using a mutex to block execution from outside the critical section

I'm not sure I got the terminology right but here goes - I have this function that is used by multiple threads to write data (using pseudo code in comments to illustrate what I want)
//these are initiated in the constructor
int* data;
std::atomic<size_t> size;
void write(int value) {
//wait here while "read_lock"
//set "write_lock" to "write_lock" + 1
auto slot = size.fetch_add(1, std::memory_order_acquire);
data[slot] = value;
//set "write_lock" to "write_lock" - 1
}
the order of the writes is not important, all I need here is for each write to go to a unique slot
Every once in a while though, I need one thread to read the data using this function
int* read() {
//set "read_lock" to true
//wait here while "write_lock"
int* ret = data;
data = new int[capacity];
size = 0;
//set "read_lock" to false
return ret;
}
so it basically swaps out the buffer and returns the old one (I've removed capacity logic to make the snippets shorter)
In theory this should lead to 2 operating scenarios:
1 - just a bunch of threads writing into the container
2 - when some thread executes the read function, all new writers will have to wait, the reader will wait until all existing writes are finished, it will then do the read logic and scenario 1 can continue.
The question part is that I don't know what kind of a barrier to use for the locks -
A spinlock would be wasteful since there are many containers like this and they all need cpu cycles
I don't know how to apply std::mutex since I only want the write function to be in a critical section if the read function is triggered. Wrapping the whole write function in a mutex would cause unnecessary slowdown for operating scenario 1.
So what would be the optimal solution here?
If you have C++14 capability then you can use a std::shared_timed_mutex to separate out readers and writers. In this scenario it seems you need to give your writer threads shared access (allowing other writer threads at the same time) and your reader threads unique access (kicking all other threads out).
So something like this may be what you need:
class MyClass
{
public:
using mutex_type = std::shared_timed_mutex;
using shared_lock = std::shared_lock<mutex_type>;
using unique_lock = std::unique_lock<mutex_type>;
private:
mutable mutex_type mtx;
public:
// All updater threads can operate at the same time
auto lock_for_updates() const
{
return shared_lock(mtx);
}
// Reader threads need to kick all the updater threads out
auto lock_for_reading() const
{
return unique_lock(mtx);
}
};
// many threads can call this
void do_writing_work(std::shared_ptr<MyClass> sptr)
{
auto lock = sptr->lock_for_updates();
// update the data here
}
// access the data from one thread only
void do_reading_work(std::shared_ptr<MyClass> sptr)
{
auto lock = sptr->lock_for_reading();
// read the data here
}
The shared_locks allow other threads to gain a shared_lock at the same time but prevent a unique_lock gaining simultaneous access. When a reader thread tries to gain a unique_lock all shared_locks will be vacated before the unique_lock gets exclusive control.
You can also do this with regular mutexes and condition variables rather than shared. Supposedly shared_mutex has higher overhead, so I'm not sure which will be faster. With Gallik's solution you'd presumably be paying to lock the shared mutex on every write call; I got the impression from your post that write gets called way more than read so maybe this is undesirable.
int* data; // initialized somewhere
std::atomic<size_t> size = 0;
std::atomic<bool> reading = false;
std::atomic<int> num_writers = 0;
std::mutex entering;
std::mutex leaving;
std::condition_variable cv;
void write(int x) {
++num_writers;
if (reading) {
--num_writers;
if (num_writers == 0)
{
std::lock_guard l(leaving);
cv.notify_one();
}
{ std::lock_guard l(entering); }
++num_writers;
}
auto slot = size.fetch_add(1, std::memory_order_acquire);
data[slot] = x;
--num_writers;
if (reading && num_writers == 0)
{
std::lock_guard l(leaving);
cv.notify_one();
}
}
int* read() {
int* other_data = new int[capacity];
{
std::unique_lock enter_lock(entering);
reading = true;
std::unique_lock leave_lock(leaving);
cv.wait(leave_lock, [] () { return num_writers == 0; });
swap(data, other_data);
size = 0;
reading = false;
}
return other_data;
}
It's a bit complicated and took me some time to work out, but I think this should serve the purpose pretty well.
In the common case where only writing is happening, reading is always false. So you do the usual, and pay for two additional atomic increments and two untaken branches. So the common path does not need to lock any mutexes, unlike the solution involving a shared mutex, this is supposedly expensive: http://permalink.gmane.org/gmane.comp.lib.boost.devel/211180.
Now, suppose read is called. The expensive, slow heap allocation happens first, meanwhile writing continues uninterrupted. Next, the entering lock is acquired, which has no immediate effect. Now, reading is set to true. Immediately, any new calls to write enter the first branch, and eventually hit the entering lock which they are unable to acquire (as its already taken), and those threads then get put to sleep.
Meanwhile, the read thread is now waiting on the condition that the number of writers is 0. If we're lucky, this could actually go through right away. If however there are threads in write in either of the two locations between incrementing and decrementing num_writers, then it will not. Each time a write thread decrements num_writers, it checks if it has reduced that number to zero, and when it does it will signal the condition variable. Because num_writers is atomic which prevents various reordering shenanigans, it is guaranteed that the last thread will see num_writers == 0; it could also be notified more than once but this is ok and cannot result in bad behavior.
Once that condition variable has been signalled, that shows that all writers are either trapped in the first branch or are done modifying the array. So the read thread can now safely swap the data, and then unlock everything, and then return what it needs to.
As mentioned before, in typical operation there are no locks, just increments and untaken branches. Even when a read does occur, the read thread will have one lock and one condition variable wait, whereas a typical write thread will have about one lock/unlock of a mutex and that's all (one, or a small number of write threads, will also perform a condition variable notification).

How should boost::lockfree::spsc_queue's read_available and write_available be used?

Boost documentation for spsc_queue says:
read_available(): Thread-safe and wait-free, should only be called from the producer thread
write_available(): Thread-safe and wait-free, should only be called from the consumer thread
I would expect the most common use case to be just the other way around: producer thread (thread writing data to the queue) would need write_available(), and consumer thread (thread reading data from the queue) would need read_available().
If I need to know how much I can write to the queue in the producer thread, should I use QUEUE_CAPACITY - read_available()?
Any kind of size assessment is going to be a race in the lockfree world.
Simple reason being that the size might be changed on other threads before you act on the "measured size".
Single-Producer/Single-Consumer is special in the sense that the consumer knows nobody else can read from the the queue (so "read_available" will never decrease unless the consumer reads it itself). Similarly for the producer side.
It is clear that write_available is what you need. Sure, it could be more by the time you actually write, but you can't get more accurate. At least it will never be less (after all there is only 1 producer thread).
Note that the documentation appears to be in error (swapping the allowed threads)
This made me double check, and sure enough they use the functions internally in the expected ways (contradicting the erronous documentation claim):
ConstIterator push(ConstIterator begin, ConstIterator end, T * internal_buffer, size_t max_size)
{
const size_t write_index = write_index_.load(memory_order_relaxed); // only written from push thread
const size_t read_index = read_index_.load(memory_order_acquire);
const size_t avail = write_available(write_index, read_index, max_size);
I heartily suggest using the range-push shown here to automatically push the exact number of items possible. E.g.:
auto b = messages.begin(), e = messages.end();
do {
b = a.push(b, e)
} while (b != e);

Lock-free stack - Is this a correct usage of c++11 relaxed atomics? Can it be proven?

I've written a container for a very simple piece of data that needs to be synchronized across threads. I want the top performance. I don't want to use locks.
I want to use "relaxed" atomics. Partly for that little bit of extra oomph, and partly to really understand them.
I've been working on this a lot, and I'm at the point where this code passes all tests I throw at it. That's not quite "proof" though, and so I'm wondering if there's anything I'm missing, or any other ways I can test this?
Here's my premise:
It is only important that a Node be properly pushed and popped, and that the Stack can never be invalidated.
I believe that the order of operations in memory is only important in one place:
Between the compare_exchange operations themselves. This is guaranteed, even with relaxed atomics.
The "ABA" problem is solved by adding identification numbers to the pointers. On 32 bit systems, this requires a double-word compare_exchange, and on 64 bit systems the unused 16 bits of the pointer are filled with id numbers.
Therefore: the stack will always be in a valid state. (right?)
Here's what I'm thinking. "Normally", the way we reason about code that we're reading is to look at the order in which it's written. Memory can be read or written to "out of order", but not in a way that invalidates the correctness of the program.
That changes in a multi-threaded environment. That's what memory fences are for - so that we can still look at the code and be able to reason about how it's going to work.
So if everything can go all out-of-order here, what am I doing with relaxed atomics? Isn't that a bit too far?
I don't think so, but that's why I'm here asking for help.
The compare_exchange operations themselves give a guarantee of sequential constancy with each other.
The only other time there is read or write to an atomic is to get the head's initial value before a compare_exchange. It is set as part of the initialization of a variable. As far as I can tell, it would be irrelevant whether or not this operation brings back a "proper" value.
Current code:
struct node
{
node *n_;
#if PROCESSOR_BITS == 64
inline constexpr node() : n_{ nullptr } { }
inline constexpr node(node* n) : n_{ n } { }
inline void tag(const stack_tag_t t) { reinterpret_cast<stack_tag_t*>(this)[3] = t; }
inline stack_tag_t read_tag() { return reinterpret_cast<stack_tag_t*>(this)[3]; }
inline void clear_pointer() { tag(0); }
#elif PROCESSOR_BITS == 32
stack_tag_t t_;
inline constexpr node() : n_{ nullptr }, t_{ 0 } { }
inline constexpr node(node* n) : n_{ n }, t_{ 0 } { }
inline void tag(const stack_tag_t t) { t_ = t; }
inline stack_tag_t read_tag() { return t_; }
inline void clear_pointer() { }
#endif
inline void set(node* n, const stack_tag_t t) { n_ = n; tag(t); }
};
using std::memory_order_relaxed;
class stack
{
public:
constexpr stack() : head_{}{}
void push(node* n)
{
node next{n}, head{head_.load(memory_order_relaxed)};
do
{
n->n_ = head.n_;
next.tag(head.read_tag() + 1);
} while (!head_.compare_exchange_weak(head, next, memory_order_relaxed, memory_order_relaxed));
}
bool pop(node*& n)
{
node clean, next, head{head_.load(memory_order_relaxed)};
do
{
clean.set(head.n_, 0);
if (!clean.n_)
return false;
next.set(clean.n_->n_, head.read_tag() + 1);
} while (!head_.compare_exchange_weak(head, next, memory_order_relaxed, memory_order_relaxed));
n = clean.n_;
return true;
}
protected:
std::atomic<node> head_;
};
What's different about this question compared to others? Relaxed atomics. They make a big difference to the question.
So, what do you think? Is there anything I'm missing?
push is broken, since you do not update node->_next after a compareAndSwap failure. It's possible that the node you originally stored with node->setNext has been popped from the top of stack by another thread when the next compareAndSwap attempt succeeds. As a result, some thread thinks it has popped a node from the stack but this thread has put it back in the stack. It should be:
void push(Node* node) noexcept
{
Node* n = _head.next();
do {
node->setNext(n);
} while (!_head.compareAndSwap(n, node));
}
Also, since next and setNext use memory_order_relaxed, there's no guarantee that _head_.next() here is returning the node most recently pushed. It's possible to leak nodes from the top of the stack. The same problem obviously exists in pop as well: _head.next() may return a node that was previously but is no longer at the top of the stack. If the returned value is nullptr, you may fail to pop when the stack is not actually empty.
pop can also have undefined behavior if two threads try to pop the last node from the stack at the same time. They both see the same value for _head.next(), one thread successfully completes pop. The other thread enters the while loop - since the observed node pointer is not nullptr - but the compareAndSwap loop soon updates it to nullptr since the stack is now empty. On the next iteration of the loop, that nullptr is dererenced to get its _next pointer and much hilarity ensues.
pop is also clearly suffering from ABA. Two threads can see the same node at the top of the stack. Say one thread gets to the point of evaluating the _next pointer and then blocks. The other thread successfully pops the node, pushes 5 new nodes, and then pushes that original node again all before the other thread wakes. That other thread's compareAndSwap will succeed - the top-of-stack node is the same - but store the old _next value into _head instead of the new one. The five nodes pushed by the other thread are all leaked. This would be the case with memory_order_seq_cst as well.
Leaving to one side the difficulty of implementing the pop operation, I think memory_order_relaxed is inadequate. Before pushing the node, one assumes that some value(s) will be written into to it, which will be read when the node is popped. You need some synchronization mechanism to ensure that the values have actually been written before they are read. memory_order_relaxed is not providing that synchronization... memory_order_acquire/memory_order_release would.
This code is completely broken.
The only reason this appears to work is that current compilers aren't very aggressive with reordering across atomic operations and x86 processors have pretty strong guarantees.
The first problem is that without synchronization, there is no guarantee that the client of this data structure will even see the fields of the node object to be initialized. The next issue is that without synchronization, the push operation can read arbitrarily old values for the head's tag.
We have developed a tool, CDSChecker, that simulates most behaviors that the memory model allows. It is open source and free. Run it on your data structure to see some interesting executions.
Proving anything about code that utilizes relaxed atomics is a big challenge at this point. Most proof methods break down because they are typically inductive in nature, and you don't have an order to induct on. So you get out of thin air read issues...

How to make thread synchronization without using mutex, semorphore, spinLock and futex?

This is an interview question, the interview has been done.
How to make thread synchronization without using mutex, semorphore, spinLock and futex ?
Given 5 threads, how to make 4 of them wait for a signal from the left thread at the same point ?
it means that when all threads (1,2,3,4) execute at a point in their thread function, they stop and wait for
signal from thread 5 send a signal otherwise they will not proceed.
My idea:
Use global bool variable as a flag, if thread 5 does not set it true, all other threads wait at one point and also set their
flag variable true. After the thread 5 find all threads' flag variables are true, it will set it flag var true.
It is a busy-wait.
Any better ideas ?
Thanks
the pseudo code:
bool globalflag = false;
bool a[10] = {false} ;
int main()
{
for (int i = 0 ; i < 10; i++)
pthread_create( threadfunc, i ) ;
while(1)
{
bool b = true;
for (int i = 0 ; i < 10 ; i++)
{
b = a[i] & b ;
}
if (b) break;
}
}
void threadfunc(i)
{
a[i] = true;
while(!globalflag);
}
Start with an empty linked list of waiting threads. The head should be set to 0.
Use CAS, compare and swap, to insert a thread at the head of the list of waiters. If the head =-1, then do not insert or wait. You can safely use CAS to insert items at the head of a linked list if you do it right.
After being inserted, the waiting thread should wait on SIGUSR1. Use sigwait() to do this.
When ready, the signaling thread uses CAS to set the head of wait list to -1. This prevents any more threads from adding themselves to the wait list. Then the signaling thread iterates the threads in the wait list and calls pthread_kill(&thread, SIGUSR1) to wake up each waiting thread.
If SIGUSR1 is sent before a call to sigwait, sigwait will return immediately. Thus, there will not be a race between adding a thread to the wait list and calling sigwait.
EDIT:
Why is CAS faster than a mutex? Laymen's answer (I'm a layman). Its faster for some things in some situations, because it has lower overhead when there is NO race. So if you can reduce your concurrent problem down to needing to change 8-16-32-64-128 bits of contiguous memory, and a race is not going to happen very often, CAS wins. CAS is basically a slightly more fancy/expensive mov instruction right where you were going to do a regular "mov" anyway. Its a "lock exchng" or something like that.
A mutex on the other hand is a whole bunch of extra stuff, that gets other cache lines dirty and uses more memory barriers, etc. Although CAS acts as a memory barrier on the x86, x64, etc. Then of course you have to unlock the mutex which is probably about the same amount of extra stuff.
Here is how you add an item to a linked list using CAS:
while (1)
{
pOldHead = pHead; <-- snapshot of the world. Start of the race.
pItem->pNext = pHead;
if (CAS(&pHead, pOldHead, pItem)) <-- end of the race if phead still is pOldHead
break; // success
}
So how often do you think your code is going to have multiple threads at that CAS line at the exact same time? In reality....not very often. We did tests that just looped adding millions of items with multiple threads at the same time and it happens way less than 1% of the time. In a real program, it might never happen.
Obviously if there is a race you have to go back and do that loop again, but in the case of a linked list, what does that cost you?
The downside is that you can't do very complex things to that linked list if you are going to use that method to add items to the head. Try implementing a double linked list. What a pain.
EDIT:
In the code above I use a macro CAS. If you are using linux, CAS = macro using __sync_bool_compare_and_swap. See gcc atomic builtins. If you are using windows, CAS = macro using something like InterlockedCompareExchange. Here is what an inline function in windows might look like:
inline bool CAS(volatile WORD* p, const WORD nOld, const WORD nNew) {
return InterlockedCompareExchange16((short*)p, nNew, nOld) == nOld;
}
inline bool CAS(volatile DWORD* p, const DWORD nOld, const DWORD nNew) {
return InterlockedCompareExchange((long*)p, nNew, nOld) == nOld;
}
inline bool CAS(volatile QWORD* p, const QWORD nOld, const QWORD nNew) {
return InterlockedCompareExchange64((LONGLONG*)p, nNew, nOld) == nOld;
}
inline bool CAS(void*volatile* p, const void* pOld, const void* pNew) {
return InterlockedCompareExchangePointer(p, (PVOID)pNew, (PVOID)pOld) == pOld;
}
Choose a signal to use, say SIGUSR1.
Use pthread_sigmask to block SIGUSR1.
Create the threads (they inherit the signal mask, hence 1 must be done first!)
Threads 1-4 call sigwait, blocking until SIGUSR1 is received.
Thread 5 calls kill() or pthread_kill 4 times with SIGUSR1. Since POSIX specifies that signals will be delivered to a thread which is not blocking the signal, it will be delivered to one of the threads waiting in sigwait(). There is thus no need to keep track of which threads have already received the signal and which haven't, with associated synchronization.
You can do this using SSE3's MONITOR and MWAIT instructions, available via the _mm_mwait and _mm_monitor intrinsics, Intel has an article on it here.
(there is also a patent for using memory-monitor-wait for lock contention here that may be of interest).
I think you are looking the Peterson's algorithm or Dekker's algorithm
They synced threads only based on shared memory