Problem with thread-safe queue? - c++

I'm trying to write a thread-safe queue using pthreads in c++. My program works 93% of the time. The other 7% of the time it other spits out garbage, OR seems to fall asleep. I'm wondering if there is some flaw in my queue where a context-switch would break it?
// thread-safe queue
// inspired by http://msmvps.com/blogs/vandooren/archive/2007/01/05/creating-a-thread-safe-producer-consumer-queue-in-c-without-using-locks.aspx
// only works with one producer and one consumer
#include <pthread.h>
#include <exception>
template<class T>
class tsqueue
{
private:
volatile int m_ReadIndex, m_WriteIndex;
volatile T *m_Data;
volatile bool m_Done;
const int m_Size;
pthread_mutex_t m_ReadMutex, m_WriteMutex;
pthread_cond_t m_ReadCond, m_WriteCond;
public:
tsqueue(const int &size);
~tsqueue();
void push(const T &elem);
T pop();
void terminate();
bool isDone() const;
};
template <class T>
tsqueue<T>::tsqueue(const int &size) : m_ReadIndex(0), m_WriteIndex(0), m_Size(size), m_Done(false) {
m_Data = new T[size];
pthread_mutex_init(&m_ReadMutex, NULL);
pthread_mutex_init(&m_WriteMutex, NULL);
pthread_cond_init(&m_WriteCond, NULL);
pthread_cond_init(&m_WriteCond, NULL);
}
template <class T>
tsqueue<T>::~tsqueue() {
delete[] m_Data;
pthread_mutex_destroy(&m_ReadMutex);
pthread_mutex_destroy(&m_WriteMutex);
pthread_cond_destroy(&m_ReadCond);
pthread_cond_destroy(&m_WriteCond);
}
template <class T>
void tsqueue<T>::push(const T &elem) {
int next = (m_WriteIndex + 1) % m_Size;
if(next == m_ReadIndex) {
pthread_mutex_lock(&m_WriteMutex);
pthread_cond_wait(&m_WriteCond, &m_WriteMutex);
pthread_mutex_unlock(&m_WriteMutex);
}
m_Data[m_WriteIndex] = elem;
m_WriteIndex = next;
pthread_cond_signal(&m_ReadCond);
}
template <class T>
T tsqueue<T>::pop() {
if(m_ReadIndex == m_WriteIndex) {
pthread_mutex_lock(&m_ReadMutex);
pthread_cond_wait(&m_ReadCond, &m_ReadMutex);
pthread_mutex_unlock(&m_ReadMutex);
if(m_Done && m_ReadIndex == m_WriteIndex) throw "queue empty and terminated";
}
int next = (m_ReadIndex +1) % m_Size;
T elem = m_Data[m_ReadIndex];
m_ReadIndex = next;
pthread_cond_signal(&m_WriteCond);
return elem;
}
template <class T>
void tsqueue<T>::terminate() {
m_Done = true;
pthread_cond_signal(&m_ReadCond);
}
template <class T>
bool tsqueue<T>::isDone() const {
return (m_Done && m_ReadIndex == m_WriteIndex);
}
This could be used like this:
// thread 1
while(cin.get(c)) {
queue1.push(c);
}
queue1.terminate();
// thread 2
while(!queue1.isDone()) {
try{ c = queue1.pop(); }
catch(char const* str){break;}
cout.put(c);
}
If anyone sees a problem with this, please say so :)

Yes, there are definitely problems here. All your accesses to queue member variables occur outside the mutexes. In fact, I'm not entirely sure what your mutexes are protecting, since they are just around a wait on a condition variable.
Also, it appears that your reader and writer will always operate in lock-step, never allowing the queue to grow beyond one element in size.

If this is your actual code, one problem right off the bat is that you're initializing m_WriteCond twice, and not initializing m_ReadCond at all.

You should treat this class as a monitor. You should have a "monitor lock" for each queue (a normal mutex). Whenever you enter a method that reads or writes any field in the queue, you should lock this mutex as soon as you enter it. This prevents more than one thread from interacting with the queue at a time. You should release the lock before you wait on a condition and when you leave a method so other threads may enter. Make sure to re-acquire the lock when you are done waiting on a condition.

If you want anything with decent performance I would strongly suggest dumping your R/W lock and just use a very simple spinlock. Or if you really think you can get the performance you want with R/W lock, i would roll your own based on this design(single word R/W Spinlock) from Joe Duffy.

Seems that the problem is that you have a race condition that thread 2 CAN run before thread 1 ever does any cin.get(c). Need to make sure that the data is initialized and when you're getting information that you are ensuring that you are doing something if the data has not been entered.
Maybe this is me not seeing the rest of the code where this is done though.

Related

C++ Blocking Queue Segfault w/ Boost

I had a need for a Blocking Queue in C++ with timeout-capable offer(). The queue is intended for multiple producers, one consumer. Back when I was implementing, I didn't find any good existing queues that fit this need, so I coded it myself.
I'm seeing segfaults come out of the take() method on the queue, but they are intermittent. I've been looking over the code for issues but I'm not seeing anything that looks problematic.
I'm wondering if:
There is an existing library that does this reliably that I should
use (boost or header-only preferred).
Anyone sees any obvious flaw in my code that I need to fix.
Here is the header:
class BlockingQueue
{
public:
BlockingQueue(unsigned int capacity) : capacity(capacity) { };
bool offer(const MyType & myType, unsigned int timeoutMillis);
MyType take();
void put(const MyType & myType);
unsigned int getCapacity();
unsigned int getCount();
private:
std::deque<MyType> queue;
unsigned int capacity;
};
And the relevant implementations:
boost::condition_variable cond;
boost::mutex mut;
bool BlockingQueue::offer(const MyType & myType, unsigned int timeoutMillis)
{
Timer timer;
// boost::unique_lock is a scoped lock - its destructor will call unlock().
// So no need for us to make that call here.
boost::unique_lock<boost::mutex> lock(mut);
// We use a while loop here because the monitor may have woken up because
// another producer did a PulseAll. In that case, the queue may not have
// room, so we need to re-check and re-wait if that is the case.
// We use an external stopwatch to stop the madness if we have taken too long.
while (queue.size() >= this->capacity)
{
int monitorTimeout = timeoutMillis - ((unsigned int) timer.getElapsedMilliSeconds());
if (monitorTimeout <= 0)
{
return false;
}
if (!cond.timed_wait(lock, boost::posix_time::milliseconds(timeoutMillis)))
{
return false;
}
}
cond.notify_all();
queue.push_back(myType);
return true;
}
void BlockingQueue::put(const MyType & myType)
{
// boost::unique_lock is a scoped lock - its destructor will call unlock().
// So no need for us to make that call here.
boost::unique_lock<boost::mutex> lock(mut);
// We use a while loop here because the monitor may have woken up because
// another producer did a PulseAll. In that case, the queue may not have
// room, so we need to re-check and re-wait if that is the case.
// We use an external stopwatch to stop the madness if we have taken too long.
while (queue.size() >= this->capacity)
{
cond.wait(lock);
}
cond.notify_all();
queue.push_back(myType);
}
MyType BlockingQueue::take()
{
// boost::unique_lock is a scoped lock - its destructor will call unlock().
// So no need for us to make that call here.
boost::unique_lock<boost::mutex> lock(mut);
while (queue.size() == 0)
{
cond.wait(lock);
}
cond.notify_one();
MyType myType = this->queue.front();
this->queue.pop_front();
return myType;
}
unsigned int BlockingQueue::getCapacity()
{
return this->capacity;
}
unsigned int BlockingQueue::getCount()
{
return this->queue.size();
}
And yes, I didn't implement the class using templates - that is next on the list :)
Any help is greatly appreciated. Threading issues can be really hard to pin down.
-Ben
Why are cond, and mut globals? I would expect them to be members of your BlockingQueue object. I don't know what else is touching those things, but there may be an issue there.
I too have implemented a ThreadSafeQueue as part of a larger project:
https://github.com/cdesjardins/QueuePtr/blob/master/include/ThreadSafeQueue.h
It is a similar concept to yours, except the enqueue (aka offer) functions are non-blocking because there is basically no max capacity. To enforce a capacity I typically have a pool with N buffers added at system init time, and a Queue for message passing at run time, this also eliminates the need for memory allocation at run time which I consider to be a good thing (I typically work on embedded applications).
The only difference between a pool, and a queue is that a pool gets a bunch of buffers enqueued at system init time. So you have something like this:
ThreadSafeQueue<BufferDataType*> pool;
ThreadSafeQueue<BufferDataType*> queue;
void init()
{
for (int i = 0; i < NUM_BUFS; i++)
{
pool.enqueue(new BufferDataType);
}
}
Then when you want send a message you do something like the following:
void producerA()
{
BufferDataType *buf;
if (pool.waitDequeue(buf, timeout) == true)
{
initBufWithMyData(buf);
queue.enqueue(buf);
}
}
This way the enqueue function is quick and easy, but if the pool is empty, then you will block until someone puts a buffer back into the pool. The idea being that some other thread will be blocking on the queue and will return buffers to the pool when they have been processed as follows:
void consumer()
{
BufferDataType *buf;
if (queue.waitDequeue(buf, timeout) == true)
{
processBufferData(buf);
pool.enqueue(buf);
}
}
Anyways take a look at it, maybe it will help.
I suppose the problem in your code is modifying the deque by several threads. Look:
you're waiting for codition from another thread;
and then immediately sending a signal to other threads that deque is unlocked just before you want to modify it;
then you modifying the deque while other threads are thinking deque is allready unlocked and starting doing the same.
So, try to place all the cond.notify_*() after modifying the deque. I.e.:
void BlockingQueue::put(const MyType & myType)
{
boost::unique_lock<boost::mutex> lock(mut);
while (queue.size() >= this->capacity)
{
cond.wait(lock);
}
queue.push_back(myType); // <- modify first
cond.notify_all(); // <- then say to others that deque is free
}
For better understanding I suggest to read about the pthread_cond_wait().

Thread locks occuring using boost::thread. What's wrong with my condition variables?

I wrote a Link class for passing data between two nodes in a network. I've implemented it with two deques (one for data going from node 0 to node 1, and the other for data going from node 1 to node 0). I'm trying to multithread the application, but I'm getting threadlocks. I'm trying to prevent reading from and writing to the same deque at the same time. In reading more about how I originally implemented this, I think I'm using the condition variables incorrectly (and maybe shouldn't be using the boolean variables?). Should I have two mutexes, one for each deque? Please help if you can. Thanks!
class Link {
public:
// other stuff...
void push_back(int sourceNodeID, Data newData);
void get(int destinationNodeID, std::vector<Data> &linkData);
private:
// other stuff...
std::vector<int> nodeIDs_;
// qVector_ has two deques, one for Data from node 0 to node 1 and
// one for Data from node 1 to node 0
std::vector<std::deque<Data> > qVector_;
void initialize(int nodeID0, int nodeID1);
boost::mutex mutex_;
std::vector<boost::shared_ptr<boost::condition_variable> > readingCV_;
std::vector<boost::shared_ptr<boost::condition_variable> > writingCV_;
std::vector<bool> writingData_;
std::vector<bool> readingData_;
};
The push_back function:
void Link::push_back(int sourceNodeID, Data newData)
{
int idx;
if (sourceNodeID == nodeIDs_[0]) idx = 1;
else
{
if (sourceNodeID == nodeIDs_[1]) idx = 0;
else throw runtime_error("Link::push_back: Invalid node ID");
}
boost::unique_lock<boost::mutex> lock(mutex_);
// pause to avoid multithreading collisions
while (readingData_[idx]) readingCV_[idx]->wait(lock);
writingData_[idx] = true;
qVector_[idx].push_back(newData);
writingData_[idx] = false;
writingCV_[idx]->notify_all();
}
The get function:
void Link::get(int destinationNodeID,
std::vector<Data> &linkData)
{
int idx;
if (destinationNodeID == nodeIDs_[0]) idx = 0;
else
{
if (destinationNodeID == nodeIDs_[1]) idx = 1;
else throw runtime_error("Link::get: Invalid node ID");
}
boost::unique_lock<boost::mutex> lock(mutex_);
// pause to avoid multithreading collisions
while (writingData_[idx]) writingCV_[idx]->wait(lock);
readingData_[idx] = true;
std::copy(qVector_[idx].begin(),qVector_[idx].end(),back_inserter(linkData));
qVector_[idx].erase(qVector_[idx].begin(),qVector_[idx].end());
readingData_[idx] = false;
readingCV_[idx]->notify_all();
return;
}
and here's initialize (in case it's helpful)
void Link::initialize(int nodeID0, int nodeID1)
{
readingData_ = std::vector<bool>(2,false);
writingData_ = std::vector<bool>(2,false);
for (int i = 0; i < 2; ++i)
{
readingCV_.push_back(make_shared<boost::condition_variable>());
writingCV_.push_back(make_shared<boost::condition_variable>());
}
nodeIDs_.reserve(2);
nodeIDs_.push_back(nodeID0);
nodeIDs_.push_back(nodeID1);
qVector_.reserve(2);
qVector_.push_back(std::deque<Data>());
qVector_.push_back(std::deque<Data>());
}
I'm trying to multithread the application, but I'm getting threadlocks.
What is a "threadlock"? It's difficult to see what your code is trying to accomplish. Consider, first, your push_back() code, whose synchronized portion looks like this:
boost::unique_lock<boost::mutex> lock(mutex_);
while (readingData_[idx]) readingCV_[idx]->wait(lock);
writingData_[idx] = true;
qVector_[idx].push_back(newData);
writingData_[idx] = false;
writingCV_[idx]->notify_all();
Your writingData[idx] boolean starts off as false, and becomes true only momentarily while a thread has the mutex locked. By the time the mutex is released, it is false again. So for any other thread that has to wait to acquire the mutex, writingData[idx] will never be true.
But in your get() code, you have
boost::unique_lock<boost::mutex> lock(mutex_);
// pause to avoid multithreading collisions
while (writingData_[idx]) writingCV_[idx]->wait(lock);
By the time a thread gets the lock on the mutex, writingData[idx] is back to false and so the while loop (and wait on the CV) is never entered.
An exactly symmetric analysis applies to the readingData[idx] boolean, which also is always false outside the mutex lock.
So your condition variables are never waited on. You need to completely rethink your design.
Start with one mutex per queue (the deque is overkill for simply passing data), and for each queue associate a condition variable with the queue being non-empty. The get() method will thus wait until the queue is non-empty, which will be signalled in the push_back() method. Something like this (untested code):
template <typename Data>
class BasicQueue
{
public:
void push( Data const& data )
{
boost::unique_lock _lock( mutex_ );
queue_.push_back( data );
not_empty_.notify_all();
}
void get ( Data& data )
{
boost::unique_lock _lock( mutex_ );
while ( queue_.size() == 0 )
not_empty_.wait( _lock ); // this releases the mutex
// mutex is reacquired here, with queue_.size() > 0
data = queue_.front();
queue_.pop_front();
}
private:
std::queue<Data> queue_;
boost::mutex mutex_;
boost::condition_variable not_empty_;
};
Yes. You need two mutexes. Your deadlocks are almost certainly a result of contention on the single mutex. If you break into your running program with a debugger you will see where the threads are hanging. Also I don't see why you would need the bools. (EDIT: It may be possible to come up with a design that uses a single mutex but it's simpler and safer to stick with one mutex per shared data structure)
A rule of thumb would be to have one mutex per shared data structure you are trying to protect. That mutex guards the data structure against concurrent access and provides thread safety. In your case one mutex per deque. E.g.:
class SafeQueue
{
private:
std::deque<Data> q_;
boost::mutex m_;
boost::condition_variable v_;
public:
void push_back(Data newData)
{
boost::lock_guard<boost::mutex> lock(m_);
q_.push_back(newData);
// notify etc.
}
// ...
};
In terms of notification via condition variables see here:
Using condition variable in a producer-consumer situation
So there would also be one condition_variable per object which the producer would notify and the consumer would wait on. Now you can create two of these queues for communicating in both directions. Keep in mind that with only two threads you can still deadlock if both threads are blocked (waiting for data) and both queues are empty.

Does a multiple producer single consumer lock-free queue exist for c++? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
The more I read the more confused I become... I would have thought it trivial to find a formally correct MPSC queue implemented in C++.
Every time I find another stab at it, further research seems to suggest there are issues such as ABA or other subtle race conditions.
Many talk of the need for garbage collection. This is something I want to avoid.
Is there an accepted correct open-source implementation out there?
You may want to check disruptor; it's available in C++ here: http://lmax-exchange.github.io/disruptor/
You can also find explanation how it works here on stackoverflow Basically it's circular buffer with no locking, optimized for passing FIFO messages between threads in a fixed-size slots.
Here are two implementations which I found useful: Lock-free Multi-producer Multi-consumer Queue on Ring Buffer # NatSys Lab. Blog and
Yet another implementation of a lock-free circular array queue
# CodeProject
NOTE: the code below is incorrect, I leave it only as an example how tricky these things can be.
If you don't like the complexity of google version, here is something similar from me - it's much simpler, but I leave it as an exercise to the reader to make it work (it's part of larger project, not portable at the moment). The whole idea is to maintain cirtular buffer for data and a small set of counters to identify slots for writing/written and reading/read. Since each counter is in its own cache line, and (normally) each is only atomically updated once in the live of a message, they can all be read without any synchronisation. There is one potential contention point between writing threads in post_done, it's required for FIFO guarantee. Counters (head_, wrtn_, rdng_, tail_) were selected to ensure correctness and FIFO, so dropping FIFO would also require change of counters (and that might be difficult to do without sacrifying correctness). It is possible to slightly improve performance for scenarios with one consumer, but I would not bother - you would have to undo it if other use cases with multiple readers are found.
On my machine latency looks like following (percentile on left, mean within this percentile on right, unit is microsecond, measured by rdtsc):
total=1000000 samples, avg=0.24us
50%=0.214us, avg=0.093us
90%=0.23us, avg=0.151us
99%=0.322us, avg=0.159us
99.9%=15.566us, avg=0.173us
These results are for single polling consumer, i.e. worker thread calling wheel.read() in tight loop and checking if not empty (scroll to bottom for example). Waiting consumers (much lower CPU utilization) would wait on event (one of acquire... functions), this adds about 1-2us to average latency due to context switch.
Since there is verly little contention on read, consumers scale very well with number of worker threads, e.g. for 3 threads on my machine:
total=1500000 samples, avg=0.07us
50%=0us, avg=0us
90%=0.155us, avg=0.016us
99%=0.361us, avg=0.038us
99.9%=8.723us, avg=0.044us
Patches will be welcome :)
// Copyright (c) 2011-2012, Bronislaw (Bronek) Kozicki
//
// Distributed under the Boost Software License, Version 1.0. (See accompanying
// file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
#pragma once
#include <core/api.hxx>
#include <core/wheel/exception.hxx>
#include <boost/noncopyable.hpp>
#include <boost/type_traits.hpp>
#include <boost/lexical_cast.hpp>
#include <typeinfo>
namespace core { namespace wheel
{
struct bad_size : core::exception
{
template<typename T> explicit bad_size(const T&, size_t m)
: core::exception(std::string("Slot capacity exceeded, sizeof(")
+ typeid(T).name()
+ ") = "
+ boost::lexical_cast<std::string>(sizeof(T))
+ ", capacity = "
+ boost::lexical_cast<std::string>(m)
)
{}
};
// inspired by Disruptor
template <typename Header>
class wheel : boost::noncopyable
{
__declspec(align(64))
struct slot_detail
{
// slot write: (memory barrier in wheel) > post_done > (memory barrier in wheel)
// slot read: (memory barrier in wheel) > read_done > (memory barrier in wheel)
// done writing or reading, must update wrtn_ or tail_ in wheel, as appropriate
template <bool Writing>
void done(wheel* w)
{
if (Writing)
w->post_done(sequence);
else
w->read_done();
}
// cache line for sequence number and header
long long sequence;
Header header;
// there is no such thing as data type with variable size, but we need it to avoid thrashing
// cache - so we invent one. The memory is reserved in runtime and we simply go beyond last element.
// This is well into UB territory! Using template parameter for this is not good, since it
// results in this small implementation detail leaking to all possible user interfaces.
__declspec(align(8))
char data[8];
};
// use this as a storage space for slot_detail, to guarantee 64 byte alignment
_declspec(align(64))
struct slot_block { long long padding[8]; };
public:
// wrap slot data to outside world
template <bool Writable>
class slot
{
template<typename> friend class wheel;
slot& operator=(const slot&); // moveable but non-assignable
// may only be constructed by wheel
slot(slot_detail* impl, wheel<Header>* w, size_t c)
: slot_(impl) , wheel_(w) , capacity_(c)
{}
public:
slot(slot&& s)
: slot_(s.slot_) , wheel_(s.wheel_) , capacity_(s.capacity_)
{
s.slot_ = NULL;
}
~slot()
{
if (slot_)
{
slot_->done<Writable>(wheel_);
}
}
// slot accessors - use Header to store information on what type is actually stored in data
bool empty() const { return !slot_; }
long long sequence() const { return slot_->sequence; }
Header& header() { return slot_->header; }
char* data() { return slot_->data; }
template <typename T> T& cast()
{
static_assert(boost::is_pod<T>::value, "Data type must be POD");
if (sizeof(T) > capacity_)
throw bad_size(T(), capacity_);
if (empty())
throw no_data();
return *((T*) data());
}
private:
slot_detail* slot_;
wheel<Header>* wheel_;
const size_t capacity_;
};
private:
// dynamic size of slot, with extra capacity, expressed in 64 byte blocks
static size_t sizeof_slot(size_t s)
{
size_t m = sizeof(slot_detail);
// add capacity less 8 bytes already within sizeof(slot_detail)
m += max(8, s) - 8;
// round up to 64 bytes, i.e. alignment of slot_detail
size_t r = m & ~(unsigned int)63;
if (r < m)
r += 64;
r /= 64;
return r;
}
// calculate actual slot capacity back from number of 64 byte blocks
static size_t slot_capacity(size_t s)
{
return s*64 - sizeof(slot_detail) + 8;
}
// round up to power of 2
static size_t round_size(size_t s)
{
// enfore minimum size
if (s <= min_size)
return min_size;
// find rounded value
--s;
size_t r = 1;
while (s)
{
s >>= 1;
r <<= 1;
};
return r;
}
slot_detail& at(long long sequence)
{
// find index from sequence number and return slot at found index of the wheel
return *((slot_detail*) &wheel_[(sequence & (size_ - 1)) * blocks_]);
}
public:
wheel(size_t capacity, size_t size)
: head_(0) , wrtn_(0) , rdng_(0) , tail_(0) , event_()
, blocks_(sizeof_slot(capacity)) , capacity_(slot_capacity(blocks_)) , size_(round_size(size))
{
static_assert(boost::is_pod<Header>::value, "Header type must be POD");
static_assert(sizeof(slot_block) == 64, "This was unexpected");
wheel_ = new slot_block[size_ * blocks_];
// all slots must be initialised to 0
memset(wheel_, 0, size_ * 64 * blocks_);
active_ = 1;
}
~wheel()
{
stop();
delete[] wheel_;
}
// all accessors needed
size_t capacity() const { return capacity_; } // capacity of a single slot
size_t size() const { return size_; } // number of slots available
size_t queue() const { return (size_t)head_ - (size_t)tail_; }
bool active() const { return active_ == 1; }
// enough to call it just once, to fine tune slot capacity
template <typename T>
void check() const
{
static_assert(boost::is_pod<T>::value, "Data type must be POD");
if (sizeof(T) > capacity_)
throw bad_size(T(), capacity_);
}
// stop the wheel - safe to execute many times
size_t stop()
{
InterlockedExchange(&active_, 0);
// must wait for current read to complete
while (rdng_ != tail_)
Sleep(10);
return size_t(head_ - tail_);
}
// return first available slot for write
slot<true> post()
{
if (!active_)
throw stopped();
// the only memory barrier on head seq. number we need, if not overflowing
long long h = InterlockedIncrement64(&head_);
while(h - (long long) size_ > tail_)
{
if (InterlockedDecrement64(&head_) == h - 1)
throw overflowing();
// protection against case of race condition when we are overflowing
// and two or more threads try to post and two or more messages are read,
// all at the same time. If this happens we must re-try, otherwise we
// could have skipped a sequence number - causing infinite wait in post_done
Sleep(0);
h = InterlockedIncrement64(&head_);
}
slot_detail& r = at(h);
r.sequence = h;
// wrap in writeable slot
return slot<true>(&r, this, capacity_);
}
// return first available slot for write, nothrow variant
slot<true> post(std::nothrow_t)
{
if (!active_)
return slot<true>(NULL, this, capacity_);
// the only memory barrier on head seq. number we need, if not overflowing
long long h = InterlockedIncrement64(&head_);
while(h - (long long) size_ > tail_)
{
if (InterlockedDecrement64(&head_) == h - 1)
return slot<true>(NULL, this, capacity_);
// must retry if race condition described above
Sleep(0);
h = InterlockedIncrement64(&head_);
}
slot_detail& r = at(h);
r.sequence = h;
// wrap in writeable slot
return slot<true>(&r, this, capacity_);
}
// read first available slot for read
slot<false> read()
{
slot_detail* r = NULL;
// compare rdng_ and wrtn_ early to avoid unnecessary memory barrier
if (active_ && rdng_ < wrtn_)
{
// the only memory barrier on reading seq. number we need
const long long h = InterlockedIncrement64(&rdng_);
// check if this slot has been written, step back if not
if (h > wrtn_)
InterlockedDecrement64(&rdng_);
else
r = &at(h);
}
// wrap in readable slot
return slot<false>(r , this, capacity_);
}
// waiting for new post, to be used by non-polling clients
void acquire()
{
event_.acquire();
}
bool try_acquire()
{
return event_.try_acquire();
}
bool try_acquire(unsigned long timeout)
{
return event_.try_acquire(timeout);
}
void release()
{}
private:
void post_done(long long sequence)
{
const long long t = sequence - 1;
// the only memory barrier on written seq. number we need
while(InterlockedCompareExchange64(&wrtn_, sequence, t) != t)
Sleep(0);
// this is outside of critical path for polling clients
event_.set();
}
void read_done()
{
// the only memory barrier on tail seq. number we need
InterlockedIncrement64(&tail_);
}
// each in its own cache line
// head_ - wrtn_ = no. of messages being written at this moment
// rdng_ - tail_ = no. of messages being read at the moment
// head_ - tail_ = no. of messages to read (including those being written and read)
// wrtn_ - rdng_ = no. of messages to read (excluding those being written or read)
__declspec(align(64)) volatile long long head_; // currently writing or written
__declspec(align(64)) volatile long long wrtn_; // written
__declspec(align(64)) volatile long long rdng_; // currently reading or read
__declspec(align(64)) volatile long long tail_; // read
__declspec(align(64)) volatile long active_; // flag switched to 0 when stopped
__declspec(align(64))
api::event event_; // set when new message is posted
const size_t blocks_; // number of 64-byte blocks in a single slot_detail
const size_t capacity_; // capacity of data() section per single slot. Initialisation depends on blocks_
const size_t size_; // number of slots available, always power of 2
slot_block* wheel_;
};
}}
Here is what polling consumer worker thread may look like:
while (wheel.active())
{
core::wheel::wheel<int>::slot<false> slot = wheel.read();
if (!slot.empty())
{
Data& d = slot.cast<Data>();
// do work
}
// uncomment below for waiting consumer, saving CPU cycles
// else
// wheel.try_acquire(10);
}
Edited added consumer example
The most suitable implementation depends on the desired properties of a queue. Should it be unbounded or a bounded one is fine? Should it be linearizable, or less strict requirements would be fine? How strong FIFO guarantees you need? Are you willing to pay the cost of reverting the list by the consumer (there exists a very simple implementation where the consumer grabs the tail of a single-linked list, thus getting at once all items put by producers till the moment)? Should it guarantee that no thread is ever blocked, or tiny chances to get some thread blocked are ok? And etc.
Some useful links:
Is multiple-producer, single-consumer possible in a lockfree setting?
http://www.1024cores.net/home/lock-free-algorithms/queues
http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue
https://groups.google.com/group/comp.programming.threads/browse_frm/thread/33f79c75146582f3
Hope that helps.
Below is the technique I used for my Cooperative Multi-tasking / Multi-threading library (MACE) http://bytemaster.github.com/mace/. It has the benefit of being lock-free except for when the queue is empty.
struct task {
boost::function<void()> func;
task* next;
};
boost::mutex task_ready_mutex;
boost::condition_variable task_ready;
boost::atomic<task*> task_in_queue;
// this can be called from any thread
void thread::post_task( task* t ) {
// atomically post the task to the queue.
task* stale_head = task_in_queue.load(boost::memory_order_relaxed);
do { t->next = stale_head;
} while( !task_in_queue.compare_exchange_weak( stale_head, t, boost::memory_order_release ) );
// Because only one thread can post the 'first task', only that thread will attempt
// to aquire the lock and therefore there should be no contention on this lock except
// when *this thread is about to block on a wait condition.
if( !stale_head ) {
boost::unique_lock<boost::mutex> lock(task_ready_mutex);
task_ready.notify_one();
}
}
// this is the consumer thread.
void process_tasks() {
while( !done ) {
// this will atomically pop everything that has been posted so far.
pending = task_in_queue.exchange(0,boost::memory_order_consume);
// pending is a linked list in 'reverse post order', so process them
// from tail to head if you want to maintain order.
if( !pending ) { // lock scope
boost::unique_lock<boost::mutex> lock(task_ready_mutex);
// check one last time while holding the lock before blocking.
if( !task_in_queue ) task_ready.wait( lock );
}
}
I'm guessing no such thing exists - and if it does, it either isn't portable or isn't open source.
Conceptually, you are trying to control two pointers simultaneously: the tail pointer and the tail->next pointer. That can't generally be done with just lock-free primitives.

Asynchronous thread-safe logging in C++ (no mutex)

I'm actually looking for a way to do an asynchronous and thread-safe logging in my C++.
I have already explored thread-safe logging solutions like log4cpp, log4cxx, Boost:log or rlog, but it seems that all of them use a mutex. And as far as I know, mutex is a synchronous solution, which means that all threads are locked as they try to write their messages while other does.
Do you know a solution?
I think your statement is wrong: using mutex is not necessary equivalent to a synchronous solution. Yes, Mutex is for synchronization control but it can be used for many different thing. We can use mutex in, for example, a producer consumer queue while the logging is still happening asynchronously.
Honestly I haven't looked into the implementation of these logging library but it should be feasible to make a asynchronous appender (for log4j like lib) which logger writes to an producer consumer queue and another worker thread is responsible to write to a file (or even delegate to another appender), in case it is not provided.
Edit:
Just have had a brief scan in log4cxx, it does provide an AsyncAppender which does what I suggested: buffers the incoming logging event, and delegate to attached appender asynchronously.
I'd recomment avoiding the problem by using only one thread for logging. For passing the necessary data to log, you can use lock-free fifo queue (thread safe as long as producer and consumer are strictly separated and only one thread has each role -- therefore you will need one queue for each producer.)
Example of fast lock-free queue is included:
queue.h:
#ifndef QUEUE_H
#define QUEUE_H
template<typename T> class Queue
{
public:
virtual void Enqueue(const T &element) = 0;
virtual T Dequeue() = 0;
virtual bool Empty() = 0;
};
hybridqueue.h:
#ifndef HYBRIDQUEUE_H
#define HYBRIDQUEUE_H
#include "queue.h"
template <typename T, int size> class HybridQueue : public Queue<T>
{
public:
virtual bool Empty();
virtual T Dequeue();
virtual void Enqueue(const T& element);
HybridQueue();
virtual ~HybridQueue();
private:
struct ItemList
{
int start;
T list[size];
int end;
ItemList volatile * volatile next;
};
ItemList volatile * volatile start;
char filler[256];
ItemList volatile * volatile end;
};
/**
* Implementation
*
*/
#include <stdio.h>
template <typename T, int size> bool HybridQueue<T, size>::Empty()
{
return (this->start == this->end) && (this->start->start == this->start->end);
}
template <typename T, int size> T HybridQueue<T, size>::Dequeue()
{
if(this->Empty())
{
return NULL;
}
if(this->start->start >= size)
{
ItemList volatile * volatile old;
old = this->start;
this->start = this->start->next;
delete old;
}
T tmp;
tmp = this->start->list[this->start->start];
this->start->start++;
return tmp;
}
template <typename T, int size> void HybridQueue<T, size>::Enqueue(const T& element)
{
if(this->end->end >= size) {
this->end->next = new ItemList();
this->end->next->start = 0;
this->end->next->list[0] = element;
this->end->next->end = 1;
this->end = this->end->next;
}
else
{
this->end->list[this->end->end] = element;
this->end->end++;
}
}
template <typename T, int size> HybridQueue<T, size>::HybridQueue()
{
this->start = this->end = new ItemList();
this->start->start = this->start->end = 0;
}
template <typename T, int size> HybridQueue<T, size>::~HybridQueue()
{
}
#endif // HYBRIDQUEUE_H
If I get your question right you are concerned about doing I/O operation (probably write to a file) in a logger's critical section.
Boost:log lets you define a custom writer object. You can define operator() to call async I/O or pass a message to your logging thread (which is doing I/Os).
http://www.torjo.com/log2/doc/html/workflow.html#workflow_2b
No libraries will do this as far as I know - it's too complex. You'll have to roll your own, and here's an idea which I just had, create a per thread log file, ensure that the first item in each entry is a timestamp, and then merge the logs after then run and sort (by timestamp) to get a final log file.
You can use some thread local storage may be (say a FILE handle AFAIK it won't be possible to store a stream object in thread local storage) and look this handle up on each log line and write to that specific file.
All this complexity vs locking the mutex? I don't know the performance requirements of your application, but if it is sensitive - why would you be logging (excessively)? Think of other ways to obtain the information you require without logging?
Also one other thing to consider is to use the mutex for the least amount of time possible, i.e. construct your log entry first and then just before writing to the file, acquire the lock.
In a Windows program, we use a user-defined Windows message. First, memory is allocated for the log entry on the heap. Then PostMessage is called, with the pointer as the LPARAM, and the record size as the WPARAM. The receiver window extracts the record, displays it, and saves it in the log file. Then PostMessage returns, and the allocated memory is deallocated by the sender. This approach is thread-safe, and you don't have to use mutexes. Concurrency is handled by the message queue mechanism of Windows. Not very elegant, but works.
Lock-free algorithms are not necessarily the fastest ones. Define your boundaries. How many threads are there for logging? How much will be written in a single log operation at most?
I/O bound operations are much much slower than thread context switching due to blocking/awaking threads. Using lock-free/spinning lock algorithm with 10 writing threads will bring a heavy load to CPU.
Shortly, block other threads when you are writing to a file.

C++ Critical Section not working

My critical section code does not work!!!
Backgrounder.run IS able to modify MESSAGE_QUEUE g_msgQueue and LockSections destructor hadn't been called yet !!!
Extra code :
typedef std::vector<int> MESSAGE_LIST; // SHARED OBJECT .. MUST LOCK!
class MESSAGE_QUEUE : MESSAGE_LIST{
public:
MESSAGE_LIST * m_pList;
MESSAGE_QUEUE(MESSAGE_LIST* pList){ m_pList = pList; }
~MESSAGE_QUEUE(){ }
/* This class will be shared between threads that means any
* attempt to access it MUST be inside a critical section.
*/
void Add( int messageCode ){ if(m_pList) m_pList->push_back(messageCode); }
int getLast()
{
if(m_pList){
if(m_pList->size() == 1){
Add(0x0);
}
m_pList->pop_back();
return m_pList->back();
}
}
void removeLast()
{
if(m_pList){
m_pList->erase(m_pList->end()-1,m_pList->end());
}
}
};
class Backgrounder{
public:
MESSAGE_QUEUE* m_pMsgQueue;
static void __cdecl Run( void* args){
MESSAGE_QUEUE* s_pMsgQueue = (MESSAGE_QUEUE*)args;
if(s_pMsgQueue->getLast() == 0x45)printf("It's a success!");
else printf("It's a trap!");
}
Backgrounder(MESSAGE_QUEUE* pMsgQueue)
{
m_pMsgQueue = pMsgQueue;
_beginthread(Run,0,(void*)m_pMsgQueue);
}
~Backgrounder(){ }
};
int main(){
MESSAGE_LIST g_List;
CriticalSection crt;
ErrorHandler err;
LockSection lc(&crt,&err); // Does not work , see question #2
MESSAGE_QUEUE g_msgQueue(&g_List);
g_msgQueue.Add(0x45);
printf("%d",g_msgQueue.getLast());
Backgrounder back_thread(&g_msgQueue);
while(!kbhit());
return 0;
}
#ifndef CRITICALSECTION_H
#define CRITICALSECTION_H
#include <windows.h>
#include "ErrorHandler.h"
class CriticalSection{
long m_nLockCount;
long m_nThreadId;
typedef CRITICAL_SECTION cs;
cs m_tCS;
public:
CriticalSection(){
::InitializeCriticalSection(&m_tCS);
m_nLockCount = 0;
m_nThreadId = 0;
}
~CriticalSection(){ ::DeleteCriticalSection(&m_tCS); }
void Enter(){ ::EnterCriticalSection(&m_tCS); }
void Leave(){ ::LeaveCriticalSection(&m_tCS); }
void Try();
};
class LockSection{
CriticalSection* m_pCS;
ErrorHandler * m_pErrorHandler;
bool m_bIsClosed;
public:
LockSection(CriticalSection* pCS,ErrorHandler* pErrorHandler){
m_bIsClosed = false;
m_pCS = pCS;
m_pErrorHandler = pErrorHandler;
// 0x1AE is code prefix for critical section header
if(!m_pCS)m_pErrorHandler->Add(0x1AE1);
if(m_pCS)m_pCS->Enter();
}
~LockSection(){
if(!m_pCS)m_pErrorHandler->Add(0x1AE2);
if(m_pCS && m_bIsClosed == false)m_pCS->Leave();
}
void ForceCSectionClose(){
if(!m_pCS)m_pErrorHandler->Add(0x1AE3);
if(m_pCS){m_pCS->Leave();m_bIsClosed = true;}
}
};
/*
Safe class basic structure;
class SafeObj
{
CriticalSection m_cs;
public:
void SafeMethod()
{
LockSection myLock(&m_cs);
//add code to implement the method ...
}
};
*/
#endif
Two questions in one. I don't know about the first, but the critical section part is easy to explain. The background thread isn't trying to claim the lock and so, of course, is not blocked. You need to make the critical section object crt visible to the thread so that it can lock it.
The way to use this lock class is that each section of code that you want serialised must create a LockSection object and hold on to it until the end of the serialised block:
Thread 1:
{
LockSection lc(&crt,&err);
//operate on shared object from thread 1
}
Thread 2:
{
LockSection lc(&crt,&err);
//operate on shared object from thread 2
}
Note that it has to be the same critical section instance crt that is used in each block of code that is to be serialised.
This code has a number of problems.
First of all, deriving from the standard containers is almost always a poor idea. In this case you're using private inheritance, which reduces the problems, but doesn't eliminate them entirely. In any case, you don't seem to put the inheritance to much (any?) use anyway. Even though you've derived your MESSAGE_QUEUE from MESSAGE_LIST (which is actually std::vector<int>), you embed a pointer to an instance of a MESSAGE_LIST into MESSAGE_QUEUE anyway.
Second, if you're going to use a queue to communicate between threads (which I think is generally a good idea) you should make the locking inherent in the queue operations rather than requiring each thread to manage the locking correctly on its own.
Third, a vector isn't a particularly suitable data structure for representing a queue anyway, unless you're going to make it fixed size, and use it roughly like a ring buffer. That's not a bad idea either, but it's quite a bit different from what you've done. If you're going to make the size dynamic, you'd probably be better off starting with a deque instead.
Fourth, the magic numbers in your error handling (0x1AE1, 0x1AE2, etc.) is quite opaque. At the very least, you need to give these meaningful names. The one comment you have does not make the use anywhere close to clear.
Finally, if you're going to go to all the trouble of writing code for a thread-safe queue, you might as well make it generic so it can hold essentially any kind of data you want, instead of dedicating it to one specific type.
Ultimately, your code doesn't seem to save the client much work or trouble over using the Windows functions directly. For the most part, you've just provided the same capabilities under slightly different names.
IMO, a thread-safe queue should handle almost all the work internally, so that client code can use it about like it would any other queue.
// Warning: untested code.
// Assumes: `T::T(T const &) throw()`
//
template <class T>
class queue {
std::deque<T> data;
CRITICAL_SECTION cs;
HANDLE semaphore;
public:
queue() {
InitializeCriticalSection(&cs);
semaphore = CreateSemaphore(NULL, 0, 2048, NULL);
}
~queue() {
DeleteCriticalSection(&cs);
CloseHandle(semaphore);
}
void push(T const &item) {
EnterCriticalSection(&cs);
data.push_back(item);
LeaveCriticalSection(&cs);
ReleaseSemaphore(semaphore, 1, NULL);
}
T pop() {
WaitForSingleObject(semaphore, INFINITE);
EnterCriticalSection(&cs);
T item = data.front();
data.pop_front();
LeaveCriticalSection(&cs);
return item;
}
};
HANDLE done;
typedef queue<int> msgQ;
enum commands { quit, print };
void backgrounder(void *qq) {
// I haven't quite puzzled out what your background thread
// was supposed to do, so I've kept it really simple, executing only
// the two commands listed above.
msgQ *q = (msgQ *)qq;
int command;
while (quit != (command = q->pop()))
printf("Print\n");
SetEvent(done);
}
int main() {
msgQ q;
done = CreateEvent(NULL, false, false, NULL);
_beginthread(backgrounder, 0, (void*)&q);
for (int i=0; i<20; i++)
q.push(print);
q.push(quit);
WaitForSingleObject(done, INFINITE);
return 0;
}
Your background thread needs access to the same CriticalSection object and it needs to create LockSection objects to lock it -- the locking is collaborative.
You are trying to return the last element after popping it.