Lock-Free Multiple Producer/Consumer Queue in C++11 - c++

I'm trying to implement a lock free multiple producer, multiple consumer queue in C++11. I'm doing this as a learning exercise, so I'm well aware that I could just use an existing open source implementation, but I'd really like to find out why my code doesn't work. The data is stored in a ringbuffer, apparently it is a "bounded MPMC queue".
I've modelled it pretty closely to what I've read of Disruptor. The thing I've noticed is that it works absolutely fine with a single consumer and single/multiple producers, it's just multiple consumers which seems to break it.
Here's the queue:
template <typename T>
class Queue : public IQueue<T>
{
public:
explicit Queue( int capacity );
~Queue();
bool try_push( T value );
bool try_pop( T& value );
private:
typedef struct
{
bool readable;
T value;
} Item;
std::atomic<int> m_head;
std::atomic<int> m_tail;
int m_capacity;
Item* m_items;
};
template <typename T>
Queue<T>::Queue( int capacity ) :
m_head( 0 ),
m_tail( 0 ),
m_capacity(capacity),
m_items( new Item[capacity] )
{
for( int i = 0; i < capacity; ++i )
{
m_items[i].readable = false;
}
}
template <typename T>
Queue<T>::~Queue()
{
delete[] m_items;
}
template <typename T>
bool Queue<T>::try_push( T value )
{
while( true )
{
// See that there's room
int tail = m_tail.load(std::memory_order_acquire);
int new_tail = ( tail + 1 );
int head = m_head.load(std::memory_order_acquire);
if( ( new_tail - head ) >= m_capacity )
{
return false;
}
if( m_tail.compare_exchange_weak( tail, new_tail, std::memory_order_acq_rel ) )
{
// In try_pop, m_head is incremented before the reading of the value has completed,
// so though we've acquired this slot, a consumer thread may be in the middle of reading
tail %= m_capacity;
std::atomic_thread_fence( std::memory_order_acquire );
while( m_items[tail].readable )
{
}
m_items[tail].value = value;
std::atomic_thread_fence( std::memory_order_release );
m_items[tail].readable = true;
return true;
}
}
}
template <typename T>
bool Queue<T>::try_pop( T& value )
{
while( true )
{
int head = m_head.load(std::memory_order_acquire);
int tail = m_tail.load(std::memory_order_acquire);
if( head == tail )
{
return false;
}
int new_head = ( head + 1 );
if( m_head.compare_exchange_weak( head, new_head, std::memory_order_acq_rel ) )
{
head %= m_capacity;
std::atomic_thread_fence( std::memory_order_acquire );
while( !m_items[head].readable )
{
}
value = m_items[head].value;
std::atomic_thread_fence( std::memory_order_release );
m_items[head].readable = false;
return true;
}
}
}
And here's the test I'm using:
void Test( std::string name, Queue<int>& queue )
{
const int NUM_PRODUCERS = 64;
const int NUM_CONSUMERS = 2;
const int NUM_ITERATIONS = 512;
bool table[NUM_PRODUCERS*NUM_ITERATIONS];
memset(table, 0, NUM_PRODUCERS*NUM_ITERATIONS*sizeof(bool));
std::vector<std::thread> threads(NUM_PRODUCERS+NUM_CONSUMERS);
std::chrono::system_clock::time_point start, end;
start = std::chrono::system_clock::now();
std::atomic<int> pop_count (NUM_PRODUCERS * NUM_ITERATIONS);
std::atomic<int> push_count (0);
for( int thread_id = 0; thread_id < NUM_PRODUCERS; ++thread_id )
{
threads[thread_id] = std::thread([&queue,thread_id,&push_count]()
{
int base = thread_id * NUM_ITERATIONS;
for( int i = 0; i < NUM_ITERATIONS; ++i )
{
while( !queue.try_push( base + i ) ){};
push_count.fetch_add(1);
}
});
}
for( int thread_id = 0; thread_id < ( NUM_CONSUMERS ); ++thread_id )
{
threads[thread_id+NUM_PRODUCERS] = std::thread([&]()
{
int v;
while( pop_count.load() > 0 )
{
if( queue.try_pop( v ) )
{
if( table[v] )
{
std::cout << v << " already set" << std::endl;
}
table[v] = true;
pop_count.fetch_sub(1);
}
}
});
}
for( int i = 0; i < ( NUM_PRODUCERS + NUM_CONSUMERS ); ++i )
{
threads[i].join();
}
end = std::chrono::system_clock::now();
std::chrono::duration<double> duration = end - start;
std::cout << name << " " << duration.count() << std::endl;
std::atomic_thread_fence( std::memory_order_acq_rel );
bool result = true;
for( int i = 0; i < NUM_PRODUCERS * NUM_ITERATIONS; ++i )
{
if( !table[i] )
{
std::cout << "failed at " << i << std::endl;
result = false;
}
}
std::cout << name << " " << ( result? "success" : "fail" ) << std::endl;
}
Any nudging in the right direction would be greatly appreciated. I'm pretty new to memory fences rather than just using a mutex for everything, so I'm probably just fundamentally misunderstanding something.
Cheers
J

I'd give a look to Moody Camel's implementation.
It is a fast general purpose lock-free queue for C++ entirely written in C++11. Documentation seems to be rather good along with a few performance tests.
Among all other interesting things (they're worth a read anyway), it's all contained in a single header, and available under the simplified BSD license. Just drop it in your project and enjoy!

The simplest approach uses a circular buffer. That is it's like an array of 256 elements and you use uint8_t as index so it wraps around and starts at beginning when you overflow it.
The simplest primitive you can build upon is when you have single producer, single consumer thread.
The buffer has two heads:
Write head: It points the element which will be written next.
Read head: It points to the element which will be read next.
Operation of the producer:
If write Head + 1 == read head, the buffer is full, return buffer full error.
Write content to the element.
Insert memory barrier to sync CPU cores.
Move the write head forward.
At the buffer full case there is still 1 room left, but we reserve that, to distinguish from the buffer empty case.
Operation of the consumer:
If read head == write head, the buffer is empty, return buffer empty error.
Read content of the element.
Insert memory barrier to sync CPU cores.
Move the read head forward.
The producer owns the write head, the consumer owns the read head, there is no concurrency on those. Also the heads are updated when the operation is completed, this ensure the consumer leaves finished elements behind, and the consumes leaves behind fully consumed empty cells.
Create 2 of these pipes in both directions whenever you fork off a new thread and you can have bidirectional communication with your threads.
Given that we are talking about lock freeness it also means none of the threads are blocked, when there is nothing to do the threads are spinning empty, you may want to detect this and add some sleep when it happens.

How about this lock free queue
It is memory ordering lock free queue, but this need to pre-set number of current thread when init the queue.
For example:-
int* ret;
int max_concurrent_thread = 16;
lfqueue_t my_queue;
lfqueue_init(&my_queue, max_concurrent_thread );
/** Wrap This scope in other threads **/
int_data = (int*) malloc(sizeof(int));
assert(int_data != NULL);
*int_data = i++;
/*Enqueue*/
while (lfqueue_enq(&my_queue, int_data) == -1) {
printf("ENQ Full ?\n");
}
/** Wrap This scope in other threads **/
/*Dequeue*/
while ( (int_data = lfqueue_deq(&my_queue)) == NULL) {
printf("DEQ EMPTY ..\n");
}
// printf("%d\n", *(int*) ret );
free(ret);
/** End **/
lfqueue_destroy(&my_queue);

On another similar question, I presented a solution to this problem. I believe that it the smallest found so far.
I will not put same answer here, but the repository has a fully functional C++ implementation of the lock free queue you desire.
EDIT: Thanks to code review from #PeterCordes, I've found a bug on the solution when using 64 bit templates, but now it's working perfectly.
This is the output I receive when running the tests
Creating 4 producers & 4 consumers
to flow 10.000.000 items trough the queue.
Produced: 10.743.668.245.000.000
Consumed: 5.554.289.678.184.004
Produced: 10.743.668.245.000.000
Consumed: 15.217.833.969.059.643
Produced: 10.743.668.245.000.000
Consumed: 7.380.542.769.600.801
Produced: 10.743.668.245.000.000
Consumed: 14.822.006.563.155.552
Checksum: 0 (it must be zero)

Related

Fast process std::bitset<65536> in parallel

Once there was a deleted question, that I wrote a huge answer to, but this question was deleted and author refused to undelete it.
So posting here a short summary of this question. And immediately answering this question myself, just to share my results.
Question was that if we're given std::bitset<65536> that is processed (by some formula) inside inner loop bit-by-bit, then how can we boost this computation?
Outer loop just called inner loop many times (lets say 50 000 times), and outer loop can't be processed in parallel, because each next iteration depends on results of previous iteration.
Example code of this process:
std::bitset<65536> bits{};
uint64_t hash = 0;
for (size_t i = 0; i < 50000; ++i) {
// Process Bits
for (size_t j = 0; j < bits.size(); ++j)
bits[j] = ModifyBit(i, j, hash, bits[j]);
hash = Hash(bits, hash);
}
Code above is just one sample way of processing, it is not a real case. The real case is such that many times we process std::bitset<65536> somehow in such a way that all bits can be processed independently.
The question is how we can process bits in parallel as fast as possible inside inner loop.
One important Note that formula that modifies bits is generic, meaning that we don't know it in advance and can't make SIMD instructions out of it.
But what we know is that all bits can be processed independently. And that we need to parallelize this processing. Also we can't parallelize outer loop as each its iteration depends on results of previous iteration.
Another Note is that std::bitset<65536> is quite small, just 1K of 64-bit words. So it means that directly using pool of std::thread of std::async threads will not work as each thread's work will be just around 50-200 nano-seconds, very tiny time to start and stop threads and send work to them. Even std::mutex takes 75 nano-seconds on my Windows machine (although 20 nano-seconds on Linux), so using std::mutex is also a big overhead.
One may assume that ModifyBit() function above takes around same time for each bit, otherwise there is no understanding on how to schedule balanced parallelization of a loop, only by slicing it into very many tiny tasks hoping that longer tasks will be balanced out by several shorter one.
Implemented quite large and complex solution for your task, but which works very fast. On my 4-core (8 hardware threads) laptop I have 6x times multi-core speedup compared to single threaded version (your version of code).
Main idea of solution below is to implement very fast multi core Thread-Pool for running arbitrary tasks that has small overhead. My implementation can handle up to 1-10 Million tasks per second (depending on CPU speed and cores count).
Regular way of asynchronously starting multiple tasks is through usage of std::async or just by creating std::thread. Both these ways are considerably slower than my own implementation. They can't give throughput of 5 Million tasks per second like my implementation gives. And your code needs millions of tasks per second to be run for good speed. That's why I implemented everything from scratch.
After fast thread pool is implemented now we can slice your 64K bitset into smaller sub-sets and process these sub-sets in parallel. I sliced 64K bitset into 16 equal parts (see BitSize / 16 in code), you can set other amount of parts equal to power of two, but not too many, otherwise thread pool overhead will be too large. Usually it is good to slice into amount of parts that is equal to twice the amount of hardware threads (or 4 times amount of cores).
I implemented several classes in C++ code. AtomicMutex class uses std::atomic_flag in order to implement very fast replacement for mutex that is based on spin-locking approach. This AtomicMutex is used to protect queue of tasks submitted for running on thread pool.
RingBuffer class is based on std::vector and implements simple and fast queue to store any objects. It is implemented using two pointers (head and tail), pointing into vector. When new element is added to queue then tail pointer is advanced to the right, if this pointer reaches end of vector then it wraps around to 0-th position. Same way when element is taken out from queue then head pointer also advances to the right with wrap around. RingBuffer is used to store thread pool tasks.
Queue class is a wrapper around RingBuffer, but with AtomicMutex protection. This spin-lock mutex is used to protect simultaneous adding/taking elements to/from queue from multiple workers' threads.
Pool implements multi-core pool of tasks itself. It creates as many worker threads as there are CPU hardware threads (double amount of cores) minus one. Each worker thread just polls new tasks from queue and executes them immediately. Main thread adds new tasks to queue. Pool also has Wait() capability to wait till all current tasks are finished, this waiting is used as barrier to wait till whole 64K bitset is processed (all sub-parts are processed). Pool accepts any lambdas (function closures) to be run. You can see that 64K bitset sliced into smaller parts is processed by doing pool.Emplace(lambda) and later pool.Wait() is used to wait till all sub-parts are finished. Exceptions from pool workers are collected and reported to user if there is any error. When doing Wait() pool runs tasks also inside main thread not to waste one core for just waiting of tasks to finish.
Timings reported in console are done by std::chrono module.
There is an ability to run both versions - single-threaded (your original version) and multi-threaded using all cores. Switch between single/multi is done by passing MultiThreaded = true template parameter to function ProcessBitset().
Try it online!
#include <cstdint>
#include <atomic>
#include <vector>
#include <array>
#include <queue>
#include <functional>
#include <thread>
#include <future>
#include <exception>
#include <optional>
#include <memory>
#include <iostream>
#include <iomanip>
#include <bitset>
#include <string>
#include <chrono>
#include <algorithm>
#include <any>
#include <type_traits>
class AtomicMutex {
class LockerC;
public:
void lock() {
while (f_.test_and_set(std::memory_order_acquire))
//f_.wait(true, std::memory_order_acquire)
;
}
void unlock() {
f_.clear(std::memory_order_release);
//f_.notify_all();
}
LockerC Locker() { return LockerC(*this); }
private:
class LockerC {
public:
LockerC() = delete;
LockerC(AtomicMutex & mux) : pmux_(&mux) { mux.lock(); }
LockerC(LockerC const & other) = delete;
LockerC(LockerC && other) : pmux_(other.pmux_) { other.pmux_ = nullptr; }
~LockerC() { if (pmux_) pmux_->unlock(); }
LockerC & operator = (LockerC const & other) = delete;
LockerC & operator = (LockerC && other) = delete;
private:
AtomicMutex * pmux_ = nullptr;
};
std::atomic_flag f_ = ATOMIC_FLAG_INIT;
};
template <typename T>
class RingBuffer {
public:
RingBuffer() : buf_(1 << 8), last_(buf_.size() - 1) {}
T & front() { return buf_[first_]; }
T const & front() const { return buf_[first_]; }
T & back() { return buf_[last_]; }
T const & back() const { return buf_[last_]; }
size_t size() const { return size_; }
bool empty() const { return size_ == 0; }
template <typename ... Args>
void emplace(Args && ... args) {
while (size_ >= buf_.size()) {
std::rotate(&buf_[0], &buf_[first_], &buf_[buf_.size()]);
first_ = 0;
last_ = buf_.size() - 1;
buf_.resize(buf_.size() * 2);
}
++size_;
++last_;
if (last_ >= buf_.size())
last_ = 0;
buf_[last_] = T(std::forward<Args>(args)...);
}
void pop() {
if (size_ == 0)
return;
--size_;
++first_;
if (first_ >= buf_.size())
first_ = 0;
}
private:
std::vector<T> buf_;
size_t first_ = 0, last_ = 0, size_ = 0;
};
template <typename T>
class Queue {
public:
size_t Size() const { return q_.size(); }
bool Empty() const { return q_.size() == 0; }
template <typename ... Args>
void Emplace(Args && ... args) {
auto lock = m_.Locker();
q_.emplace(std::forward<Args>(args)...);
}
T Pop(std::function<void()> const & on_empty = []{},
std::function<void()> const & on_full = []{}) {
while (true) {
if (q_.empty()) {
on_empty();
continue;
}
auto lock = m_.Locker();
if (q_.empty()) {
on_empty();
continue;
}
on_full();
T val = std::move(q_.front());
q_.pop();
return std::move(val);
}
}
std::optional<T> TryPop() {
auto lock = m_.Locker();
if (q_.empty())
return std::nullopt;
T val = std::move(q_.front());
q_.pop();
return std::move(val);
}
private:
AtomicMutex m_;
RingBuffer<T> q_;
};
class RunInDestr {
public:
RunInDestr(std::function<void()> const & f) : f_(f) {}
~RunInDestr() { f_(); }
private:
std::function<void()> const & f_;
};
class Pool {
public:
struct FinishExc {};
struct Worker {
std::unique_ptr<std::atomic<bool>> pdone = std::make_unique<std::atomic<bool>>(true);
std::unique_ptr<std::exception_ptr> pexc = std::make_unique<std::exception_ptr>();
std::unique_ptr<std::thread> thr;
};
Pool(size_t nthreads = size_t(-1)) {
if (nthreads == size_t(-1))
nthreads = std::thread::hardware_concurrency() - 1;
std::cout << "Pool has " << nthreads << " worker threads." << std::endl;
for (size_t i = 0; i < nthreads; ++i) {
workers_.emplace_back(Worker{});
workers_.back().thr = std::make_unique<std::thread>(
[&, pdone = workers_.back().pdone.get(), pexc = workers_.back().pexc.get()]{
try {
std::function<void()> f_done = [pdone]{
pdone->store(true, std::memory_order_relaxed);
}, f_empty = [this]{
CheckFinish();
}, f_full = [pdone]{
pdone->store(false, std::memory_order_relaxed);
};
while (true) {
RunInDestr set_done(f_done);
tasks_.Pop(f_empty, f_full)();
}
} catch (...) {
exc_.store(true, std::memory_order_relaxed);
*pexc = std::current_exception();
}
});
}
}
~Pool() {
Wait();
Finish();
}
void CheckExc() {
if (!exc_.load(std::memory_order_relaxed))
return;
Finish();
throw std::runtime_error("Pool: Exception occured!");
}
void Finish() {
finish_ = true;
for (auto & w: workers_)
try {
w.thr->join();
if (*w.pexc)
std::rethrow_exception(*w.pexc);
} catch (FinishExc const &) {}
workers_.clear();
}
template <typename ... Args>
void Emplace(Args && ... args) {
CheckExc();
tasks_.Emplace(std::forward<Args>(args)...);
}
void Wait() {
while (true) {
auto task = tasks_.TryPop();
if (!task)
break;
(*task)();
}
while (true) {
bool done = true;
for (auto & w: workers_)
if (!w.pdone->load(std::memory_order_relaxed)) {
done = false;
break;
}
if (done)
break;
}
CheckExc();
}
private:
void CheckFinish() {
if (finish_)
throw FinishExc{};
}
Queue<std::function<void()>> tasks_;
std::vector<Worker> workers_;
bool finish_ = false;
std::atomic<bool> exc_ = false;
};
template <bool MultiThreaded = true, size_t BitSize>
void ProcessBitset(Pool & pool, std::bitset<BitSize> & bset,
std::string const & businessLogicCriteria) {
static size_t constexpr block = BitSize / 16;
for (int j = 0; j < BitSize; j += block) {
auto task = [&bset, j]{
int const hi = std::min(j + block, BitSize);
for (int i = j; i < hi; ++i) {
if (i % 2 == 0)
bset[i] = 0;
else
bset[i] = 1;
}
};
if constexpr(MultiThreaded)
pool.Emplace(std::move(task));
else
task();
}
if constexpr(MultiThreaded)
pool.Wait();
}
static auto const gtb = std::chrono::high_resolution_clock::now();
double Time() {
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
}
void Compute() {
Pool pool;
std::bitset<65536> bset;
std::string businessLogicCriteria;
int const hi = 50000;
for (int j = 0; j < hi; ++j) {
if ((j & 0x1FFF) == 0 || j + 1 >= hi)
std::cout << j / 1000 << "K (" << std::fixed << std::setprecision(3) << Time() << " sec), " << std::flush;
ProcessBitset(pool, bset, businessLogicCriteria);
businessLogicCriteria = "...";
}
}
void TimeMeasure() {
size_t constexpr A = 1 << 16, B = 1 << 5;
{
Pool pool;
auto const tb = Time();
int64_t volatile x = 0;
for (size_t i = 0; i < A; ++i) {
for (size_t j = 0; j < B; ++j)
pool.Emplace([&]{ x = x + 1; });
pool.Wait();
}
std::cout << "AtomicPool time " << std::fixed << std::setprecision(3) << (Time() - tb)
<< " sec, speed " << A * B / (Time() - tb) / 1000.0 << " empty K-tasks/sec, "
<< 1'000'000 / (A * B / (Time() - tb)) << " sec/M-task, no-collisions "
<< std::setprecision(7) << double(x) / (A * B) << std::endl;
}
{
auto const tb = Time();
//size_t const nthr = std::thread::hardware_concurrency();
size_t constexpr C = A / 8;
std::vector<std::future<void>> asyncs;
int64_t volatile x = 0;
for (size_t i = 0; i < C; ++i) {
asyncs.clear();
for (size_t j = 0; j < B; ++j)
asyncs.emplace_back(std::async(std::launch::async, [&]{ x = x + 1; }));
asyncs.clear();
}
std::cout << "AsyncPool time " << std::fixed << std::setprecision(3) << (Time() - tb)
<< " sec, speed " << C * B / (Time() - tb) / 1000.0 << " empty K-tasks/sec, "
<< 1'000'000 / (C * B / (Time() - tb)) << " sec/M-task, no-collisions "
<< std::setprecision(7) << double(x) / (C * B) << std::endl;
}
}
int main() {
try {
TimeMeasure();
Compute();
return 0;
} catch (std::exception const & ex) {
std::cout << "Exception: " << ex.what() << std::endl;
return -1;
} catch (...) {
std::cout << "Unknown Exception!" << std::endl;
return -1;
}
}
Output for 4 cores (8 hardware threads):
Pool has 7 worker threads.
AtomicPool time 0.903 sec, speed 2321.831 empty K-tasks/sec, 0.431 sec/M-task, no-collisions 0.9999967
AsyncPool time 0.982 sec, speed 266.789 empty K-tasks/sec, 3.750 sec/M-task, no-collisions 0.9999123
Pool has 7 worker threads.
0K (0.074 sec), 8K (0.670 sec), 16K (1.257 sec), 24K (1.852 sec), 32K (2.435 sec), 40K (2.984 sec), 49K (3.650 sec), 49K (3.711 sec),
For comparison below is single-threaded version timings, that is 6x times slower:
0K (0.125 sec), 8K (3.786 sec), 16K (7.754 sec), 24K (11.202 sec), 32K (14.662 sec), 40K (18.056 sec), 49K (21.470 sec), 49K (21.841 sec),
You have this inner loop you want to parallelize:
for (size_t j = 0; j < bits.size(); ++j)
bits[j] = ModifyBit(i, j, hash, bits[j]);
So a good idea is to split it into chunks, and have multiple threads do each chunk in parallel. You can submit chunks to workers easily with a std::atomic<int> counter that increments to identify which chunk to work on. You can also make sure the threads all stop working after one loop before starting the next with a std::barrier:
std::bitset<65536> bits{};
std::thread pool[8]; // Change size accordingly
std::atomic<int> task_number{0};
constexpr std::size_t tasks_per_loop = 32; // Arbitrarily chosen
constexpr std::size_t block_size = (bits.size()+tasks_per_loop-1) / tasks_per_loop;
// (only written to by one thread by the barrier, so not atomic)
uint64_t hash = 0;
int i = 0;
std::barrier barrier(std::size(pool), [&]() {
task_number = 0;
++i;
hash = Hash(bits, hash);
});
for (std::thread& t : pool) {
t = std::thread([&]{
while (i < 50000) {
for (int t; (t = task_number++) < tasks_per_loop;) {
int block_start = t * block_size;
int block_end = std::min(block_start + block_size, bits.size());
for (int j = block_start; j < block_end; ++j) {
bits[j] = ModifyBit(i, j, hash, bits[j]);
}
}
// Wait for other threads to finish and hash
// to be calculated before starting next loop
barrier.arrive_and_wait();
}
});
}
for (std::thread& t : pool) t.join();
(The seemingly easy way of parallelizing the for loop with OpenMP #pragma omp parallel for seemed slower with some testing, perhaps because the tasks were so small)
Here it is against your implementation running similar code: https://godbolt.org/z/en76Kv4nn
And on my machine, running this a few times with 1 million iterations took 28 to 32 seconds with my approach and 44 to 50 seconds with your general thread pool approach (granted this is much less general because it can't execute arbitrary std::function<void()> tasks).

Troubles with simple Lock-Free MPSC Ring Buffer

I am trying to implement an array-based ring buffer that is thread-safe for multiple producers and a single consumer. The main idea is to have atomic head and tail indices. When pushing an element to the queue, the head is increased atomically to reserve a slot in the buffer:
#include <atomic>
#include <chrono>
#include <iostream>
#include <stdexcept>
#include <thread>
#include <vector>
template <class T> class MPSC {
private:
int MAX_SIZE;
std::atomic<int> head{0}; ///< index of first free slot
std::atomic<int> tail{0}; ///< index of first occupied slot
std::unique_ptr<T[]> data;
std::unique_ptr<std::atomic<bool>[]> valid; ///< indicates whether data at an
///< index has been fully written
/// Compute next index modulo size.
inline int advance(int x) { return (x + 1) % MAX_SIZE; }
public:
explicit MPSC(int size) {
if (size <= 0)
throw std::invalid_argument("size must be greater than 0");
MAX_SIZE = size + 1;
data = std::make_unique<T[]>(MAX_SIZE);
valid = std::make_unique<std::atomic<bool>[]>(MAX_SIZE);
}
/// Add an element to the queue.
///
/// If the queue is full, this method blocks until a slot is available for
/// writing. This method is not starvation-free, i.e. it is possible that one
/// thread always fills up the queue and prevents others from pushing.
void push(const T &msg) {
int idx;
int next_idx;
int k = 100;
do {
idx = head;
next_idx = advance(idx);
while (next_idx == tail) { // queue is full
k = k >= 100000 ? k : k * 2; // exponential backoff
std::this_thread::sleep_for(std::chrono::nanoseconds(k));
} // spin
} while (!head.compare_exchange_weak(idx, next_idx));
if (valid[idx])
// this throws, suggesting that two threads are writing to the same index. I have no idea how this is possible.
throw std::runtime_error("message slot already written");
data[idx] = msg;
valid[idx] = true; // this was set to false by the reader,
// set it to true to indicate completed data write
}
/// Read an element from the queue.
///
/// If the queue is empty, this method blocks until a message is available.
/// This method is only safe to be called from one single reader thread.
T pop() {
int k = 100;
while (is_empty() || !valid[tail]) {
k = k >= 100000 ? k : k * 2;
std::this_thread::sleep_for(std::chrono::nanoseconds(k));
} // spin
T res = data[tail];
valid[tail] = false;
tail = advance(tail);
return res;
}
bool is_full() { return (head + 1) % MAX_SIZE == tail; }
bool is_empty() { return head == tail; }
};
When there is a lot of congestion, some messages get overwritten by other threads. Hence there must be something fundamentally wrong with what I'm doing here.
What seems to be happening is that two threads are acquiring the same index to write their data to. Why could that be?
Even if a producer were to pause just before writing it's data, the tail could not increase past this threads idx and hence no other thread should be able to overtake and claim that same idx.
EDIT
At the risk of posting too much code, here is a simple program that reproduces the problem. It sends some incrementing numbers from many threads and checks whether all numbers are received by the consumer:
#include "mpsc.hpp" // or whatever; the above queue
#include <thread>
#include <iostream>
int main() {
static constexpr int N_THREADS = 10; ///< number of threads
static constexpr int N_MSG = 1E+5; ///< number of messages per thread
struct msg {
int t_id;
int i;
};
MPSC<msg> q(N_THREADS / 2);
std::thread threads[N_THREADS];
// consumer
threads[0] = std::thread([&q] {
int expected[N_THREADS] {};
for (int i = 0; i < N_MSG * (N_THREADS - 1); ++i) {
msg m = q.pop();
std::cout << "Got message from T-" << m.t_id << ": " << m.i << std::endl;
if (expected[m.t_id] != m.i) {
std::cout << "T-" << m.t_id << " unexpected msg " << m.i << "; expected " << expected[m.t_id] << std::endl;
return -1;
}
expected[m.t_id] = m.i + 1;
}
});
// producers
for (int id = 1; id < N_THREADS; ++id) {
threads[id] = std::thread([id, &q] {
for (int i = 0; i < N_MSG; ++i) {
q.push(msg{id, i});
}
});
}
for (auto &t : threads)
t.join();
}
I am trying to implement an array-based ring buffer that is thread-safe for multiple producers and a single consumer.
I assume you are doing this as a learning exercise. Implementing a lock-free queue yourself is most probably the wrong thing to do if you want to solve a real problem.
What seems to be happening is that two threads are acquiring the same index to write their data to. Why could that be?
The combination of that producer spinlock with the outer CAS loop does not work in the intended way:
do {
idx = head;
next_idx = advance(idx);
while (next_idx == tail) { // queue is full
k = k >= 100000 ? k : k * 2; // exponential backoff
std::this_thread::sleep_for(std::chrono::nanoseconds(k));
} // spin
//
// ...
//
// All other threads (producers and consumers) can progress.
//
// ...
//
} while (!head.compare_exchange_weak(idx, next_idx));
The queue may be full when the CAS happens because those checks are performed independently. In addition, the CAS may succeed because the other threads may have advanced head to exactly match idx.

Dividing work between fixed number of threads with pthread

I have n number of jobs, which there is no shared resource between them, and mthreads. I want to efficiently divide number of jobs in threads in such a way that there is no idle thread untill everything is processed?
This is a prototype of my program:
class Job {
//constructor and other stuff
//...
public: doWork();
};
struct JobParams{
int threadId;
Job job;
};
void* doWorksOnThread(void* job) {
JobParams* j = // cast argument
cout << "Thread #" << j->threadId << " started" << endl;
j->job->doWork();
return (void*)0;
}
Then in my main file I have something like:
int main() {
vector<Job> jobs; // lets say it has 17 jobs
int numThreads = 4;
pthread_t* threads = new pthread_t[numThreads];
JobParams* jps = new JubParams[jobs.size()];
for(int i = 0; i < jobs.size(); i++) {
jps[i]->job = jobs[i];
}
for(int i = 0; i < numThread; i++) {
pthread_create(&t[i], null, doWorkOnThread, &jps[0])
}
//another for loop and call join on 4 threads...
return 0;
}
how can I efficiently make sure that there is no idle thread until all jobs are completed?
You'll need to add a loop to identify the threads that completed and then start new ones, making sure you always have up to 4 threads running.
Here is a very basic way to do that. Using a sleep as proposed could be a good start and will do the job (even if adding an extra delay before you'll figure out the last thread completed). Ideally, you should use a condition variable notified by the thread when job is done to wake up the main loop (then sleep instruction would be replaced by a wait condition instruction).
struct JobParams{
int threadId;
Job job;
std::atomic<bool> done; // flag to know when job is done, could also be an attribute of Job class!
};
void* doWorksOnThread(void* job) {
JobParams* j = // cast argument
cout << "Thread #" << j->threadId << " started" << endl;
j->job->doWork();
j->done = true; // signal job completed
return (void*)0;
}
int main() {
....
std::map<JobParams*,pthread_t*> runningThreads; // to keep track of running jobs
for(int i = 0; i < jobs.size(); i++) {
jps[i]->job = jobs[i];
jps[i]->done = false; // mark as not done yet
}
while ( true )
{
vector<JobParams*> todo;
for( int i = 0; i < jobs.size(); i++ )
{
if ( !jps[i]->done )
{
if ( runningThreads.find(jps[i]) == runningThreads.end() )
todo.push_back( &jps[i] ); // job not started yet, mask as to be done
// else, a thread is already processing the job and did not complete it yet
}
else
{
if ( runningThreads.find(jps[i]) != runningThreads.end() )
{
// thread just completed the job!
// let's join to wait for the thread to end cleanly
// I'm not familiar with pthread, hope this is correct
void* res;
pthread_join(runningThreads[jps[i]], &res);
runningThreads.erase(jps[i]); // not running anymore
}
// else, job was already done and thread joined from a previous iteration
}
}
if ( todo.empty() && runningThreads.empty() )
break; // done all jobs
// some jobs remain undone
if ( runningThreads.size() < numThreads && !todo.empty() )
{
// some new threads shall be started...
int newThreadsToBeCreatedCount = numThreads - runningThreads.size();
// make sure you don't end up with too many threads running
if ( todo.size() > newThreadsToBeCreatedCount )
todo.resize( newThreadsToBeCreatedCount );
for ( auto jobParam : todo )
{
pthread_t* thread = runningThreads[&jobParam];
pthread_create(thread, null, doWorkOnThread, &jobParam );
}
}
// else: you already have 4 runnign jobs
// sanity check that everything went as expected:
assert( runningThreads.size() <= numThreads );
msleep( 100 ); // give a chance for some jobs to complete (100ms)
// adjust sleep duration if necessary
}
}
Note: I'm not very familiar with pthread. Hope the syntax is correct.

One producer, two consumers acting on one 'queue' produced by producer

Preface: I'm new to multithreaded programming, and a little rusty with C++. My requirements are to use one mutex, and two conditions mNotEmpty and mEmpty. I must also create and populate the vectors in the way mentioned below.
I have one producer thread creating a vector of random numbers of size n*2, and two consumers inserting those values into two separate vectors of size n.
I am doing the following in the producer:
Lock the mutex: pthread_mutex_lock(&mMutex1)
Wait for consumer to say vector is empty: pthread_cond_wait(&mEmpty,&mMutex1)
Push back a value into the vector
Signal the consumer that the vector isn't empty anymore: pthread_cond_signal(&mNotEmpty)
Unlock the mutex: pthread_mutex_unlock(&mMutex1)
Return to step 1
In the consumer:
Lock the mutex: pthread_mutex_lock(&mMutex1)
Check to see if the vector is empty, and if so signal the producer: pthread_cond_signal(&mEmpty)
Else insert value into one of two new vectors (depending on which thread) and remove from original vector
Unlock the mutex: pthread_mutex_unlock(&mMutex1)
Return to step 1
What's wrong with my process? I keep getting segmentation faults or infinite loops.
Edit: Here's the code:
void Producer()
{
srand(time(NULL));
for(unsigned int i = 0; i < mTotalNumberOfValues; i++){
pthread_mutex_lock(&mMutex1);
pthread_cond_wait(&mEmpty,&mMutex1);
mGeneratedNumber.push_back((rand() % 100) + 1);
pthread_cond_signal(&mNotEmpty);
pthread_mutex_unlock(&mMutex1);
}
}
void Consumer(const unsigned int index)
{
for(unsigned int i = 0; i < mNumberOfValuesPerVector; i++){
pthread_mutex_lock(&mMutex1);
if(mGeneratedNumber.empty()){
pthread_cond_signal(&mEmpty);
}else{
mThreadVector.at(index).push_back[mGeneratedNumber.at(0)];
mGeneratedNumber.pop_back();
}
pthread_mutex_unlock(&mMutex1);
}
}
I'm not sure I understand the rationale behind the way you're doing
things. In the usual consumer-provider idiom, the provider pushes as
many items as possible into the channel, waiting only if there is
insufficient space in the channel; it doesn't wait for empty. So the
usual idiom would be:
provider (to push one item):
pthread_mutex_lock( &mutex );
while ( ! spaceAvailable() ) {
pthread_cond_wait( &spaceAvailableCondition, &mutex );
}
pushTheItem();
pthread_cond_signal( &itemAvailableCondition );
pthread_mutex_unlock( &mutex );
and on the consumer side, to get an item:
pthread_mutex_lock( &mutex );
while ( ! itemAvailable() ) {
pthread_cond_wait( &itemAvailableCondition, &mutex );
}
getTheItem();
pthread_cond_signal( &spaceAvailableCondition );
pthread_mutex_unlock( &mutex );
Note that for each condition, one side signals, and the other waits. (I
don't see any wait in your consumer.) And if there is more than one
process on either side, I'd recommend using pthread_cond_broadcast,
rather than pthread_cond_signal.
There are a number of other issues in your code. Some of them look more
like typos: you should copy/paste actual code to avoid this. Do you
really mean to read and pop mGeneratedValues, when you push into
mGeneratedNumber, and check whether that is empty? (If you actually
do have two different queues, then you're popping from a queue where no
one has pushed.) And you don't have any loops waiting for the
conditions; you keep iterating through the number of elements you
expect (incrementing the counter each time, so you're likely to
gerninate long before you should)—I can't see an infinite loop,
but I can readily see a endless wait in pthread_cond_wait in the
producer. I don't see a core dump off hand, but what happens when one
of the processes terminates (probably the consumer, because it never
waits for anything); if it ends up destroying the mutex or the condition
variables, you could get a core dump when another process attempts to
use them.
In producer, call pthread_cond_wait only when queue is not empty. Otherwise you get blocked forever due to a race condition.
You might want to consider taking mutex only after condition is fulfilled, e.g.
producer()
{
while true
{
waitForEmpty();
takeMutex();
produce();
releaseMutex();
}
}
consumer()
{
while true
{
waitForNotEmpty();
takeMutex();
consume();
releaseMutex();
}
}
Here is a solution to a similar problem like you. In this program producer produces a no and writes it to a array(buffer) and a maintains a file then update a status(status array) about it, while on getting data in the array(buffer) consumers start to consume(read and write to their file) and update a status that it has consumed. when producer looks that both the consumer has consumed the data it overrides the data with a new value and goes on. for convenience here i have restricted the code to run for 2000 nos.
// Producer-consumer //
#include <iostream>
#include <fstream>
#include <pthread.h>
#define MAX 100
using namespace std;
int dataCount = 2000;
int buffer_g[100];
int status_g[100];
void *producerFun(void *);
void *consumerFun1(void *);
void *consumerFun2(void *);
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t dataNotProduced = PTHREAD_COND_INITIALIZER;
pthread_cond_t dataNotConsumed = PTHREAD_COND_INITIALIZER;
int main()
{
for(int i = 0; i < MAX; i++)
status_g[i] = 0;
pthread_t producerThread, consumerThread1, consumerThread2;
int retProducer = pthread_create(&producerThread, NULL, producerFun, NULL);
int retConsumer1 = pthread_create(&consumerThread1, NULL, consumerFun1, NULL);
int retConsumer2 = pthread_create(&consumerThread2, NULL, consumerFun2, NULL);
pthread_join(producerThread, NULL);
pthread_join(consumerThread1, NULL);
pthread_join(consumerThread2, NULL);
return 0;
}
void *producerFun(void *)
{
//file to write produced data by producer
const char *producerFileName = "producer.txt";
ofstream producerFile(producerFileName);
int index = 0, producerCount = 0;
while(1)
{
pthread_mutex_lock(&mutex);
if(index == MAX)
{
index = 0;
}
if(status_g[index] == 0)
{
static int data = 0;
data++;
cout << "Produced: " << data << endl;
buffer_g[index] = data;
producerFile << data << endl;
status_g[index] = 5;
index ++;
producerCount ++;
pthread_cond_broadcast(&dataNotProduced);
}
else
{
cout << ">> Producer is in wait.." << endl;
pthread_cond_wait(&dataNotConsumed, &mutex);
}
pthread_mutex_unlock(&mutex);
if(producerCount == dataCount)
{
producerFile.close();
return NULL;
}
}
}
void *consumerFun1(void *)
{
const char *consumerFileName = "consumer1.txt";
ofstream consumerFile(consumerFileName);
int index = 0, consumerCount = 0;
while(1)
{
pthread_mutex_lock(&mutex);
if(index == MAX)
{
index = 0;
}
if(status_g[index] != 0 && status_g[index] != 2)
{
int data = buffer_g[index];
cout << "Cosumer1 consumed: " << data << endl;
consumerFile << data << endl;
status_g[index] -= 3;
index ++;
consumerCount ++;
pthread_cond_signal(&dataNotConsumed);
}
else
{
cout << "Consumer1 is in wait.." << endl;
pthread_cond_wait(&dataNotProduced, &mutex);
}
pthread_mutex_unlock(&mutex);
if(consumerCount == dataCount)
{
consumerFile.close();
return NULL;
}
}
}
void *consumerFun2(void *)
{
const char *consumerFileName = "consumer2.txt";
ofstream consumerFile(consumerFileName);
int index = 0, consumerCount = 0;
while(1)
{
pthread_mutex_lock(&mutex);
if(index == MAX)
{
index = 0;
}
if(status_g[index] != 0 && status_g[index] != 3)
{
int data = buffer_g[index];
cout << "Consumer2 consumed: " << data << endl;
consumerFile << data << endl;
status_g[index] -= 2;
index ++;
consumerCount ++;
pthread_cond_signal(&dataNotConsumed);
}
else
{
cout << ">> Consumer2 is in wait.." << endl;
pthread_cond_wait(&dataNotProduced, &mutex);
}
pthread_mutex_unlock(&mutex);
if(consumerCount == dataCount)
{
consumerFile.close();
return NULL;
}
}
}
Here is only one problem that producer in not independent to produce, that is it needs to take lock on the whole array(buffer) before it produces new data, and if the mutex is locked by consumer it waits for that and vice versa, i am trying to look for it.

C++ Lock free producer/consumer queue

I was looking at the sample code for a lock-free queue at:
http://drdobbs.com/high-performance-computing/210604448?pgno=2
(Also reference in many SO questions such as Is there a production ready lock-free queue or hash implementation in C++)
This looks like it should work for a single producer/consumer, although there are a number of typos in the code. I've updated the code to read as shown below, but it's crashing on me. Anybody have suggestions why?
In particular, should divider and last be declared as something like:
atomic<Node *> divider, last; // shared
I don't have a compiler supporting C++0x on this machine, so perhaps that's all I need...
// Implementation from http://drdobbs.com/high-performance-computing/210604448
// Note that the code in that article (10/26/11) is broken.
// The attempted fixed version is below.
template <typename T>
class LockFreeQueue {
private:
struct Node {
Node( T val ) : value(val), next(0) { }
T value;
Node* next;
};
Node *first, // for producer only
*divider, *last; // shared
public:
LockFreeQueue()
{
first = divider = last = new Node(T()); // add dummy separator
}
~LockFreeQueue()
{
while( first != 0 ) // release the list
{
Node* tmp = first;
first = tmp->next;
delete tmp;
}
}
void Produce( const T& t )
{
last->next = new Node(t); // add the new item
last = last->next; // publish it
while (first != divider) // trim unused nodes
{
Node* tmp = first;
first = first->next;
delete tmp;
}
}
bool Consume( T& result )
{
if (divider != last) // if queue is nonempty
{
result = divider->next->value; // C: copy it back
divider = divider->next; // D: publish that we took it
return true; // and report success
}
return false; // else report empty
}
};
I wrote the following code to test this. Main (not shown) just calls TestQ().
#include "LockFreeQueue.h"
const int numThreads = 1;
std::vector<LockFreeQueue<int> > q(numThreads);
void *Solver(void *whichID)
{
int id = (long)whichID;
printf("Thread %d initialized\n", id);
int result = 0;
do {
if (q[id].Consume(result))
{
int y = 0;
for (int x = 0; x < result; x++)
{ y++; }
y = 0;
}
} while (result != -1);
return 0;
}
void TestQ()
{
std::vector<pthread_t> threads;
for (int x = 0; x < numThreads; x++)
{
pthread_t thread;
pthread_create(&thread, NULL, Solver, (void *)x);
threads.push_back(thread);
}
for (int y = 0; y < 1000000; y++)
{
for (unsigned int x = 0; x < threads.size(); x++)
{
q[x].Produce(y);
}
}
for (unsigned int x = 0; x < threads.size(); x++)
{
q[x].Produce(-1);
}
for (unsigned int x = 0; x < threads.size(); x++)
pthread_join(threads[x], 0);
}
Update: It ends up that the crash is being caused by the queue declaration:
std::vector<LockFreeQueue<int> > q(numThreads);
When I change this to be a simple array, it runs fine. (I implemented a version with locks and it was crashing too.) I see that the destructor is being called immediate after the constructor, resulting in doubly-freed memory. But, does anyone know WHY the destructor would be called immediately with a std::vector?
You'll need to make several of the pointers std::atomic, as you note, and you'll need to use compare_exchange_weak in a loop to update them atomically. Otherwise, multiple consumers might consume the same node and multiple producers might corrupt the list.
It's critically important that these writes (just one example from your code) occur in order:
last->next = new Node(t); // add the new item
last = last->next; // publish it
That's not guaranteed by C++ -- the optimizer can rearrange things however it likes, as long as the current thread always acts as-if the program ran exactly the way you wrote it. And then the CPU cache can come along and reorder things further.
You need memory fences. Making the pointers use the atomic type should have that effect.
This could be totally off the mark, but I can't help but wonder whether you're having some sort of static initialization related issue... For laughs, try declaring q as a pointer to a vector of lock-free queues and allocating it on the heap in main().