Troubles with simple Lock-Free MPSC Ring Buffer - c++

I am trying to implement an array-based ring buffer that is thread-safe for multiple producers and a single consumer. The main idea is to have atomic head and tail indices. When pushing an element to the queue, the head is increased atomically to reserve a slot in the buffer:
#include <atomic>
#include <chrono>
#include <iostream>
#include <stdexcept>
#include <thread>
#include <vector>
template <class T> class MPSC {
private:
int MAX_SIZE;
std::atomic<int> head{0}; ///< index of first free slot
std::atomic<int> tail{0}; ///< index of first occupied slot
std::unique_ptr<T[]> data;
std::unique_ptr<std::atomic<bool>[]> valid; ///< indicates whether data at an
///< index has been fully written
/// Compute next index modulo size.
inline int advance(int x) { return (x + 1) % MAX_SIZE; }
public:
explicit MPSC(int size) {
if (size <= 0)
throw std::invalid_argument("size must be greater than 0");
MAX_SIZE = size + 1;
data = std::make_unique<T[]>(MAX_SIZE);
valid = std::make_unique<std::atomic<bool>[]>(MAX_SIZE);
}
/// Add an element to the queue.
///
/// If the queue is full, this method blocks until a slot is available for
/// writing. This method is not starvation-free, i.e. it is possible that one
/// thread always fills up the queue and prevents others from pushing.
void push(const T &msg) {
int idx;
int next_idx;
int k = 100;
do {
idx = head;
next_idx = advance(idx);
while (next_idx == tail) { // queue is full
k = k >= 100000 ? k : k * 2; // exponential backoff
std::this_thread::sleep_for(std::chrono::nanoseconds(k));
} // spin
} while (!head.compare_exchange_weak(idx, next_idx));
if (valid[idx])
// this throws, suggesting that two threads are writing to the same index. I have no idea how this is possible.
throw std::runtime_error("message slot already written");
data[idx] = msg;
valid[idx] = true; // this was set to false by the reader,
// set it to true to indicate completed data write
}
/// Read an element from the queue.
///
/// If the queue is empty, this method blocks until a message is available.
/// This method is only safe to be called from one single reader thread.
T pop() {
int k = 100;
while (is_empty() || !valid[tail]) {
k = k >= 100000 ? k : k * 2;
std::this_thread::sleep_for(std::chrono::nanoseconds(k));
} // spin
T res = data[tail];
valid[tail] = false;
tail = advance(tail);
return res;
}
bool is_full() { return (head + 1) % MAX_SIZE == tail; }
bool is_empty() { return head == tail; }
};
When there is a lot of congestion, some messages get overwritten by other threads. Hence there must be something fundamentally wrong with what I'm doing here.
What seems to be happening is that two threads are acquiring the same index to write their data to. Why could that be?
Even if a producer were to pause just before writing it's data, the tail could not increase past this threads idx and hence no other thread should be able to overtake and claim that same idx.
EDIT
At the risk of posting too much code, here is a simple program that reproduces the problem. It sends some incrementing numbers from many threads and checks whether all numbers are received by the consumer:
#include "mpsc.hpp" // or whatever; the above queue
#include <thread>
#include <iostream>
int main() {
static constexpr int N_THREADS = 10; ///< number of threads
static constexpr int N_MSG = 1E+5; ///< number of messages per thread
struct msg {
int t_id;
int i;
};
MPSC<msg> q(N_THREADS / 2);
std::thread threads[N_THREADS];
// consumer
threads[0] = std::thread([&q] {
int expected[N_THREADS] {};
for (int i = 0; i < N_MSG * (N_THREADS - 1); ++i) {
msg m = q.pop();
std::cout << "Got message from T-" << m.t_id << ": " << m.i << std::endl;
if (expected[m.t_id] != m.i) {
std::cout << "T-" << m.t_id << " unexpected msg " << m.i << "; expected " << expected[m.t_id] << std::endl;
return -1;
}
expected[m.t_id] = m.i + 1;
}
});
// producers
for (int id = 1; id < N_THREADS; ++id) {
threads[id] = std::thread([id, &q] {
for (int i = 0; i < N_MSG; ++i) {
q.push(msg{id, i});
}
});
}
for (auto &t : threads)
t.join();
}

I am trying to implement an array-based ring buffer that is thread-safe for multiple producers and a single consumer.
I assume you are doing this as a learning exercise. Implementing a lock-free queue yourself is most probably the wrong thing to do if you want to solve a real problem.
What seems to be happening is that two threads are acquiring the same index to write their data to. Why could that be?
The combination of that producer spinlock with the outer CAS loop does not work in the intended way:
do {
idx = head;
next_idx = advance(idx);
while (next_idx == tail) { // queue is full
k = k >= 100000 ? k : k * 2; // exponential backoff
std::this_thread::sleep_for(std::chrono::nanoseconds(k));
} // spin
//
// ...
//
// All other threads (producers and consumers) can progress.
//
// ...
//
} while (!head.compare_exchange_weak(idx, next_idx));
The queue may be full when the CAS happens because those checks are performed independently. In addition, the CAS may succeed because the other threads may have advanced head to exactly match idx.

Related

Producer/consumer of type <int *>, how to avoid segmentation fault?

I found a great producer/consumer double buffer example in this question. However, instead of setting T to "int", I want to use "int *". Unfortunately, it doesn't work, and I keep getting segmentation faults when writing to the buffer.
The code below is what I've tried. Would anyone mind telling me how to solve that problem?
void processing(DoubleBuffer<int *> *const buf, int num_elts)
{
std::thread producer([&]() {
for (int i = 0; i != num_elts; ++i) {
int **item = buf->start_writing();
if (item != nullptr) { // Always true
**item = i;
}
buf->end_writing();
}
});
/*
std::thread consumer([&]() {
int prev = -1;
for (int i = 0; i != 10; ++i) {
int* item = *buf->start_reading();
std::cout << "Consumer: " << item << std::endl;
if (item != nullptr) {
assert(*item > prev);
prev = *item;
std::cout << "prev: " << prev << std::endl;
}
buf->end_reading();
}
});
*/
producer.join();
//consumer.join();
}
int main(void) {
int* buffer_a;
int* buffer_b;
int num_elts = 10;
ProducerConsumerDoubleBuffer<int *> buf;
buf.m_buf[0] = buffer_a + num_elts;
buf.m_buf[1] = buffer_b + num_elts;
processing(&buf, num_elts);
}
I modified the class from the post you linked to work better with pointer types. Playing around with the member variables seemed wrong, so I added a malloc to the constructor to avoid segfaults.
#include <atomic>
#include <cstdint>
#include <thread>
#include <cassert>
#include <iostream>
template <typename T>
class ProducerConsumerDoubleBuffer {
public:
ProducerConsumerDoubleBuffer() : m_state(0) { }
ProducerConsumerDoubleBuffer(const int buf_size) : m_state(0) {
m_buf[0] = new typename std::remove_pointer<T>::type[buf_size];
m_buf[1] = new typename std::remove_pointer<T>::type[buf_size];
}
~ProducerConsumerDoubleBuffer() { }
// Never returns nullptr
T* start_writing() {
// Increment active users; once we do this, no one
// can swap the active cell on us until we're done
auto state = m_state.fetch_add(0x2, std::memory_order_relaxed);
return &m_buf[state & 1];
}
void end_writing() {
// We want to swap the active cell, but only if we were the last
// ones concurrently accessing the data (otherwise the consumer
// will do it for us when *it's* done accessing the data)
auto state = m_state.load(std::memory_order_relaxed);
std::uint32_t flag = (8 << (state & 1)) ^ (state & (8 << (state & 1)));
state = m_state.fetch_add(flag - 0x2, std::memory_order_release) + flag - 0x2;
if ((state & 0x6) == 0) {
// The consumer wasn't in the middle of a read, we should
// swap (unless the consumer has since started a read or
// already swapped or read a value and is about to swap).
// If we swap, we also want to clear the full flag on what
// will become the active cell, otherwise the consumer could
// eventually read two values out of order (it reads a new
// value, then swaps and reads the old value while the
// producer is idle).
m_state.compare_exchange_strong(state, (state ^ 0x1) & ~(0x10 >> (state & 1)), std::memory_order_release);
}
}
// Returns nullptr if there appears to be no more data to read yet
T* start_reading() {
m_readState = m_state.load(std::memory_order_relaxed);
if ((m_readState & (0x10 >> (m_readState & 1))) == 0) {
// Nothing to read here!
return nullptr;
}
// At this point, there is guaranteed to be something to
// read, because the full flag is never turned off by the
// producer thread once it's on; the only thing that could
// happen is that the active cell changes, but that can
// only happen after the producer wrote a value into it,
// in which case there's still a value to read, just in a
// different cell.
m_readState = m_state.fetch_add(0x2, std::memory_order_acquire) + 0x2;
// Now that we've incremented the user count, nobody can swap until
// we decrement it
return &m_buf[(m_readState & 1) ^ 1];
}
void end_reading() {
if ((m_readState & (0x10 >> (m_readState & 1))) == 0) {
// There was nothing to read; shame to repeat this
// check, but if these functions are inlined it might
// not matter. Otherwise the API could be changed.
// Or just don't call this method if start_reading()
// returns nullptr -- then you could also get rid
// of m_readState.
return;
}
// Alright, at this point the active cell cannot change on
// us, but the active cell's flag could change and the user
// count could change. We want to release our user count
// and remove the flag on the value we read.
auto state = m_state.load(std::memory_order_relaxed);
std::uint32_t sub = (0x10 >> (state & 1)) | 0x2;
state = m_state.fetch_sub(sub, std::memory_order_relaxed) - sub;
if ((state & 0x6) == 0 && (state & (0x8 << (state & 1))) == 1) {
// Oi, we were the last ones accessing the data when we released our cell.
// That means we should swap, but only if the producer isn't in the middle
// of producing something, and hasn't already swapped, and hasn't already
// set the flag we just reset (which would mean they swapped an even number
// of times). Note that we don't bother swapping if there's nothing to read
// in the other cell.
m_state.compare_exchange_strong(state, state ^ 0x1, std::memory_order_relaxed);
}
}
private:
T m_buf[2];
// The bottom (lowest) bit will be the active cell (the one for writing).
// The active cell can only be switched if there's at most one concurrent
// user. The next two bits of state will be the number of concurrent users.
// The fourth bit indicates if there's a value available for reading
// in m_buf[0], and the fifth bit has the same meaning but for m_buf[1].
std::atomic<std::uint32_t> m_state;
std::uint32_t m_readState;
};
void processing(ProducerConsumerDoubleBuffer<int *> *const buf, int num_elts)
{
std::thread producer([&]() {
for (int i = 0; i != num_elts; ++i) {
int **item = buf->start_writing();
if (item != nullptr) { // Always true
**item = i;
}
buf->end_writing();
}
});
std::thread consumer([&]() {
int prev = -1;
for (int i = 0; i != num_elts; ++i) {
int** item = buf->start_reading();
//std::cout << "Consumer: " << item << std::endl;
if (item != nullptr) {
assert(**item > prev);
prev = **item;
std::cout << "item: " << **item << std::endl;
}
buf->end_reading();
}
});
producer.join();
consumer.join();
}
int main(void) {
int num_elts = 500000;
ProducerConsumerDoubleBuffer<int *> buf(num_elts);
processing(&buf, num_elts);
}

Fast process std::bitset<65536> in parallel

Once there was a deleted question, that I wrote a huge answer to, but this question was deleted and author refused to undelete it.
So posting here a short summary of this question. And immediately answering this question myself, just to share my results.
Question was that if we're given std::bitset<65536> that is processed (by some formula) inside inner loop bit-by-bit, then how can we boost this computation?
Outer loop just called inner loop many times (lets say 50 000 times), and outer loop can't be processed in parallel, because each next iteration depends on results of previous iteration.
Example code of this process:
std::bitset<65536> bits{};
uint64_t hash = 0;
for (size_t i = 0; i < 50000; ++i) {
// Process Bits
for (size_t j = 0; j < bits.size(); ++j)
bits[j] = ModifyBit(i, j, hash, bits[j]);
hash = Hash(bits, hash);
}
Code above is just one sample way of processing, it is not a real case. The real case is such that many times we process std::bitset<65536> somehow in such a way that all bits can be processed independently.
The question is how we can process bits in parallel as fast as possible inside inner loop.
One important Note that formula that modifies bits is generic, meaning that we don't know it in advance and can't make SIMD instructions out of it.
But what we know is that all bits can be processed independently. And that we need to parallelize this processing. Also we can't parallelize outer loop as each its iteration depends on results of previous iteration.
Another Note is that std::bitset<65536> is quite small, just 1K of 64-bit words. So it means that directly using pool of std::thread of std::async threads will not work as each thread's work will be just around 50-200 nano-seconds, very tiny time to start and stop threads and send work to them. Even std::mutex takes 75 nano-seconds on my Windows machine (although 20 nano-seconds on Linux), so using std::mutex is also a big overhead.
One may assume that ModifyBit() function above takes around same time for each bit, otherwise there is no understanding on how to schedule balanced parallelization of a loop, only by slicing it into very many tiny tasks hoping that longer tasks will be balanced out by several shorter one.
Implemented quite large and complex solution for your task, but which works very fast. On my 4-core (8 hardware threads) laptop I have 6x times multi-core speedup compared to single threaded version (your version of code).
Main idea of solution below is to implement very fast multi core Thread-Pool for running arbitrary tasks that has small overhead. My implementation can handle up to 1-10 Million tasks per second (depending on CPU speed and cores count).
Regular way of asynchronously starting multiple tasks is through usage of std::async or just by creating std::thread. Both these ways are considerably slower than my own implementation. They can't give throughput of 5 Million tasks per second like my implementation gives. And your code needs millions of tasks per second to be run for good speed. That's why I implemented everything from scratch.
After fast thread pool is implemented now we can slice your 64K bitset into smaller sub-sets and process these sub-sets in parallel. I sliced 64K bitset into 16 equal parts (see BitSize / 16 in code), you can set other amount of parts equal to power of two, but not too many, otherwise thread pool overhead will be too large. Usually it is good to slice into amount of parts that is equal to twice the amount of hardware threads (or 4 times amount of cores).
I implemented several classes in C++ code. AtomicMutex class uses std::atomic_flag in order to implement very fast replacement for mutex that is based on spin-locking approach. This AtomicMutex is used to protect queue of tasks submitted for running on thread pool.
RingBuffer class is based on std::vector and implements simple and fast queue to store any objects. It is implemented using two pointers (head and tail), pointing into vector. When new element is added to queue then tail pointer is advanced to the right, if this pointer reaches end of vector then it wraps around to 0-th position. Same way when element is taken out from queue then head pointer also advances to the right with wrap around. RingBuffer is used to store thread pool tasks.
Queue class is a wrapper around RingBuffer, but with AtomicMutex protection. This spin-lock mutex is used to protect simultaneous adding/taking elements to/from queue from multiple workers' threads.
Pool implements multi-core pool of tasks itself. It creates as many worker threads as there are CPU hardware threads (double amount of cores) minus one. Each worker thread just polls new tasks from queue and executes them immediately. Main thread adds new tasks to queue. Pool also has Wait() capability to wait till all current tasks are finished, this waiting is used as barrier to wait till whole 64K bitset is processed (all sub-parts are processed). Pool accepts any lambdas (function closures) to be run. You can see that 64K bitset sliced into smaller parts is processed by doing pool.Emplace(lambda) and later pool.Wait() is used to wait till all sub-parts are finished. Exceptions from pool workers are collected and reported to user if there is any error. When doing Wait() pool runs tasks also inside main thread not to waste one core for just waiting of tasks to finish.
Timings reported in console are done by std::chrono module.
There is an ability to run both versions - single-threaded (your original version) and multi-threaded using all cores. Switch between single/multi is done by passing MultiThreaded = true template parameter to function ProcessBitset().
Try it online!
#include <cstdint>
#include <atomic>
#include <vector>
#include <array>
#include <queue>
#include <functional>
#include <thread>
#include <future>
#include <exception>
#include <optional>
#include <memory>
#include <iostream>
#include <iomanip>
#include <bitset>
#include <string>
#include <chrono>
#include <algorithm>
#include <any>
#include <type_traits>
class AtomicMutex {
class LockerC;
public:
void lock() {
while (f_.test_and_set(std::memory_order_acquire))
//f_.wait(true, std::memory_order_acquire)
;
}
void unlock() {
f_.clear(std::memory_order_release);
//f_.notify_all();
}
LockerC Locker() { return LockerC(*this); }
private:
class LockerC {
public:
LockerC() = delete;
LockerC(AtomicMutex & mux) : pmux_(&mux) { mux.lock(); }
LockerC(LockerC const & other) = delete;
LockerC(LockerC && other) : pmux_(other.pmux_) { other.pmux_ = nullptr; }
~LockerC() { if (pmux_) pmux_->unlock(); }
LockerC & operator = (LockerC const & other) = delete;
LockerC & operator = (LockerC && other) = delete;
private:
AtomicMutex * pmux_ = nullptr;
};
std::atomic_flag f_ = ATOMIC_FLAG_INIT;
};
template <typename T>
class RingBuffer {
public:
RingBuffer() : buf_(1 << 8), last_(buf_.size() - 1) {}
T & front() { return buf_[first_]; }
T const & front() const { return buf_[first_]; }
T & back() { return buf_[last_]; }
T const & back() const { return buf_[last_]; }
size_t size() const { return size_; }
bool empty() const { return size_ == 0; }
template <typename ... Args>
void emplace(Args && ... args) {
while (size_ >= buf_.size()) {
std::rotate(&buf_[0], &buf_[first_], &buf_[buf_.size()]);
first_ = 0;
last_ = buf_.size() - 1;
buf_.resize(buf_.size() * 2);
}
++size_;
++last_;
if (last_ >= buf_.size())
last_ = 0;
buf_[last_] = T(std::forward<Args>(args)...);
}
void pop() {
if (size_ == 0)
return;
--size_;
++first_;
if (first_ >= buf_.size())
first_ = 0;
}
private:
std::vector<T> buf_;
size_t first_ = 0, last_ = 0, size_ = 0;
};
template <typename T>
class Queue {
public:
size_t Size() const { return q_.size(); }
bool Empty() const { return q_.size() == 0; }
template <typename ... Args>
void Emplace(Args && ... args) {
auto lock = m_.Locker();
q_.emplace(std::forward<Args>(args)...);
}
T Pop(std::function<void()> const & on_empty = []{},
std::function<void()> const & on_full = []{}) {
while (true) {
if (q_.empty()) {
on_empty();
continue;
}
auto lock = m_.Locker();
if (q_.empty()) {
on_empty();
continue;
}
on_full();
T val = std::move(q_.front());
q_.pop();
return std::move(val);
}
}
std::optional<T> TryPop() {
auto lock = m_.Locker();
if (q_.empty())
return std::nullopt;
T val = std::move(q_.front());
q_.pop();
return std::move(val);
}
private:
AtomicMutex m_;
RingBuffer<T> q_;
};
class RunInDestr {
public:
RunInDestr(std::function<void()> const & f) : f_(f) {}
~RunInDestr() { f_(); }
private:
std::function<void()> const & f_;
};
class Pool {
public:
struct FinishExc {};
struct Worker {
std::unique_ptr<std::atomic<bool>> pdone = std::make_unique<std::atomic<bool>>(true);
std::unique_ptr<std::exception_ptr> pexc = std::make_unique<std::exception_ptr>();
std::unique_ptr<std::thread> thr;
};
Pool(size_t nthreads = size_t(-1)) {
if (nthreads == size_t(-1))
nthreads = std::thread::hardware_concurrency() - 1;
std::cout << "Pool has " << nthreads << " worker threads." << std::endl;
for (size_t i = 0; i < nthreads; ++i) {
workers_.emplace_back(Worker{});
workers_.back().thr = std::make_unique<std::thread>(
[&, pdone = workers_.back().pdone.get(), pexc = workers_.back().pexc.get()]{
try {
std::function<void()> f_done = [pdone]{
pdone->store(true, std::memory_order_relaxed);
}, f_empty = [this]{
CheckFinish();
}, f_full = [pdone]{
pdone->store(false, std::memory_order_relaxed);
};
while (true) {
RunInDestr set_done(f_done);
tasks_.Pop(f_empty, f_full)();
}
} catch (...) {
exc_.store(true, std::memory_order_relaxed);
*pexc = std::current_exception();
}
});
}
}
~Pool() {
Wait();
Finish();
}
void CheckExc() {
if (!exc_.load(std::memory_order_relaxed))
return;
Finish();
throw std::runtime_error("Pool: Exception occured!");
}
void Finish() {
finish_ = true;
for (auto & w: workers_)
try {
w.thr->join();
if (*w.pexc)
std::rethrow_exception(*w.pexc);
} catch (FinishExc const &) {}
workers_.clear();
}
template <typename ... Args>
void Emplace(Args && ... args) {
CheckExc();
tasks_.Emplace(std::forward<Args>(args)...);
}
void Wait() {
while (true) {
auto task = tasks_.TryPop();
if (!task)
break;
(*task)();
}
while (true) {
bool done = true;
for (auto & w: workers_)
if (!w.pdone->load(std::memory_order_relaxed)) {
done = false;
break;
}
if (done)
break;
}
CheckExc();
}
private:
void CheckFinish() {
if (finish_)
throw FinishExc{};
}
Queue<std::function<void()>> tasks_;
std::vector<Worker> workers_;
bool finish_ = false;
std::atomic<bool> exc_ = false;
};
template <bool MultiThreaded = true, size_t BitSize>
void ProcessBitset(Pool & pool, std::bitset<BitSize> & bset,
std::string const & businessLogicCriteria) {
static size_t constexpr block = BitSize / 16;
for (int j = 0; j < BitSize; j += block) {
auto task = [&bset, j]{
int const hi = std::min(j + block, BitSize);
for (int i = j; i < hi; ++i) {
if (i % 2 == 0)
bset[i] = 0;
else
bset[i] = 1;
}
};
if constexpr(MultiThreaded)
pool.Emplace(std::move(task));
else
task();
}
if constexpr(MultiThreaded)
pool.Wait();
}
static auto const gtb = std::chrono::high_resolution_clock::now();
double Time() {
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
}
void Compute() {
Pool pool;
std::bitset<65536> bset;
std::string businessLogicCriteria;
int const hi = 50000;
for (int j = 0; j < hi; ++j) {
if ((j & 0x1FFF) == 0 || j + 1 >= hi)
std::cout << j / 1000 << "K (" << std::fixed << std::setprecision(3) << Time() << " sec), " << std::flush;
ProcessBitset(pool, bset, businessLogicCriteria);
businessLogicCriteria = "...";
}
}
void TimeMeasure() {
size_t constexpr A = 1 << 16, B = 1 << 5;
{
Pool pool;
auto const tb = Time();
int64_t volatile x = 0;
for (size_t i = 0; i < A; ++i) {
for (size_t j = 0; j < B; ++j)
pool.Emplace([&]{ x = x + 1; });
pool.Wait();
}
std::cout << "AtomicPool time " << std::fixed << std::setprecision(3) << (Time() - tb)
<< " sec, speed " << A * B / (Time() - tb) / 1000.0 << " empty K-tasks/sec, "
<< 1'000'000 / (A * B / (Time() - tb)) << " sec/M-task, no-collisions "
<< std::setprecision(7) << double(x) / (A * B) << std::endl;
}
{
auto const tb = Time();
//size_t const nthr = std::thread::hardware_concurrency();
size_t constexpr C = A / 8;
std::vector<std::future<void>> asyncs;
int64_t volatile x = 0;
for (size_t i = 0; i < C; ++i) {
asyncs.clear();
for (size_t j = 0; j < B; ++j)
asyncs.emplace_back(std::async(std::launch::async, [&]{ x = x + 1; }));
asyncs.clear();
}
std::cout << "AsyncPool time " << std::fixed << std::setprecision(3) << (Time() - tb)
<< " sec, speed " << C * B / (Time() - tb) / 1000.0 << " empty K-tasks/sec, "
<< 1'000'000 / (C * B / (Time() - tb)) << " sec/M-task, no-collisions "
<< std::setprecision(7) << double(x) / (C * B) << std::endl;
}
}
int main() {
try {
TimeMeasure();
Compute();
return 0;
} catch (std::exception const & ex) {
std::cout << "Exception: " << ex.what() << std::endl;
return -1;
} catch (...) {
std::cout << "Unknown Exception!" << std::endl;
return -1;
}
}
Output for 4 cores (8 hardware threads):
Pool has 7 worker threads.
AtomicPool time 0.903 sec, speed 2321.831 empty K-tasks/sec, 0.431 sec/M-task, no-collisions 0.9999967
AsyncPool time 0.982 sec, speed 266.789 empty K-tasks/sec, 3.750 sec/M-task, no-collisions 0.9999123
Pool has 7 worker threads.
0K (0.074 sec), 8K (0.670 sec), 16K (1.257 sec), 24K (1.852 sec), 32K (2.435 sec), 40K (2.984 sec), 49K (3.650 sec), 49K (3.711 sec),
For comparison below is single-threaded version timings, that is 6x times slower:
0K (0.125 sec), 8K (3.786 sec), 16K (7.754 sec), 24K (11.202 sec), 32K (14.662 sec), 40K (18.056 sec), 49K (21.470 sec), 49K (21.841 sec),
You have this inner loop you want to parallelize:
for (size_t j = 0; j < bits.size(); ++j)
bits[j] = ModifyBit(i, j, hash, bits[j]);
So a good idea is to split it into chunks, and have multiple threads do each chunk in parallel. You can submit chunks to workers easily with a std::atomic<int> counter that increments to identify which chunk to work on. You can also make sure the threads all stop working after one loop before starting the next with a std::barrier:
std::bitset<65536> bits{};
std::thread pool[8]; // Change size accordingly
std::atomic<int> task_number{0};
constexpr std::size_t tasks_per_loop = 32; // Arbitrarily chosen
constexpr std::size_t block_size = (bits.size()+tasks_per_loop-1) / tasks_per_loop;
// (only written to by one thread by the barrier, so not atomic)
uint64_t hash = 0;
int i = 0;
std::barrier barrier(std::size(pool), [&]() {
task_number = 0;
++i;
hash = Hash(bits, hash);
});
for (std::thread& t : pool) {
t = std::thread([&]{
while (i < 50000) {
for (int t; (t = task_number++) < tasks_per_loop;) {
int block_start = t * block_size;
int block_end = std::min(block_start + block_size, bits.size());
for (int j = block_start; j < block_end; ++j) {
bits[j] = ModifyBit(i, j, hash, bits[j]);
}
}
// Wait for other threads to finish and hash
// to be calculated before starting next loop
barrier.arrive_and_wait();
}
});
}
for (std::thread& t : pool) t.join();
(The seemingly easy way of parallelizing the for loop with OpenMP #pragma omp parallel for seemed slower with some testing, perhaps because the tasks were so small)
Here it is against your implementation running similar code: https://godbolt.org/z/en76Kv4nn
And on my machine, running this a few times with 1 million iterations took 28 to 32 seconds with my approach and 44 to 50 seconds with your general thread pool approach (granted this is much less general because it can't execute arbitrary std::function<void()> tasks).

How to apply a concurrent solution to a Producer-Consumer like situation

I have a XML file with a sequence of nodes. Each node represents an element that I need to parse and add in a sorted list (the order must be the same of the nodes found in the file).
At the moment I am using a sequential solution:
struct Graphic
{
bool parse()
{
// parsing...
return parse_outcome;
}
};
vector<unique_ptr<Graphic>> graphics;
void producer()
{
for (size_t i = 0; i < N_GRAPHICS; i++)
{
auto g = new Graphic();
if (g->parse())
graphics.emplace_back(g);
else
delete g;
}
}
So, only if the graphic (that actually is an instance of a class derived from Graphic, a Line, a Rectangle and so on, that is why the new) can be properly parse, it will be added to my data structure.
Since I only care about the order in which thes graphics are added to my list, I though to call the parse method asynchronously, such that the producer has the task of read each node from the file and add this graphic to the data structure, while the consumer has the task of parse each graphic whenever a new graphic is ready to be parsed.
Now I have several consumer threads (created in the main) and my code looks like the following:
queue<pair<Graphic*, size_t>> q;
mutex m;
atomic<size_t> n_elements;
void producer()
{
for (size_t i = 0; i < N_GRAPHICS; i++)
{
auto g = new Graphic();
graphics.emplace_back(g);
q.emplace(make_pair(g, i));
}
n_elements = graphics.size();
}
void consumer()
{
pair<Graphic*, size_t> item;
while (true)
{
{
std::unique_lock<std::mutex> lk(m);
if (n_elements == 0)
return;
n_elements--;
item = q.front();
q.pop();
}
if (!item.first->parse())
{
// here I should remove the item from the vector
assert(graphics[item.second].get() == item.first);
delete item.first;
graphics[item.second] = nullptr;
}
}
}
I run the producer first of all in my main, so that when the first consumer starts the queue is already completely full.
int main()
{
producer();
vector<thread> threads;
for (auto i = 0; i < N_THREADS; i++)
threads.emplace_back(consumer);
for (auto& t : threads)
t.join();
return 0;
}
The concurrent version seems to be at least twice as faster as the original one.
The full code has been uploaded here.
Now I am wondering:
Are there any (synchronization) errors in my code?
Is there a way to achieve the same result faster (or better)?
Also, I noticed that on my computer I get the best result (in terms of elapsed time) if I set the number of thread equals to 8. More (or less) threads give me worst results. Why?
Blockquote
There isn't synchronization errors, but I think that the memory managing could be better, since your code leaked if parse() throws an exception.
There isn't synchronization errors, but I think that your memory managing could be better, since you will have leaks if parse() throw an exception.
Blockquote
Is there a way to achieve the same result faster (or better)?
Probably. You could use a simple implementation of a thread pool and a lambda that do the parse() for you.
The code below illustrate this approach. I use the threadpool implementation
here
#include <iostream>
#include <stdexcept>
#include <vector>
#include <memory>
#include <chrono>
#include <utility>
#include <cassert>
#include <ThreadPool.h>
using namespace std;
using namespace std::chrono;
#define N_GRAPHICS (1000*1000*1)
#define N_THREADS 8
struct Graphic;
using GPtr = std::unique_ptr<Graphic>;
static vector<GPtr> graphics;
struct Graphic
{
Graphic()
: status(false)
{
}
bool parse()
{
// waste time
try
{
throw runtime_error("");
}
catch (runtime_error)
{
}
status = true;
//return false;
return true;
}
bool status;
};
int main()
{
auto start = system_clock::now();
auto producer_unit = []()-> GPtr {
std::unique_ptr<Graphic> g(new Graphic);
if(!g->parse()){
g.reset(); // if g don't parse, return nullptr
}
return g;
};
using ResultPool = std::vector<std::future<GPtr>>;
ResultPool results;
// ThreadPool pool(thread::hardware_concurrency());
ThreadPool pool(N_THREADS);
for(int i = 0; i <N_GRAPHICS; ++i){
// Running async task
results.emplace_back(pool.enqueue(producer_unit));
}
for(auto &t : results){
auto value = t.get();
if(value){
graphics.emplace_back(std::move(value));
}
}
auto duration = duration_cast<milliseconds>(system_clock::now() - start);
cout << "Elapsed: " << duration.count() << endl;
for (size_t i = 0; i < graphics.size(); i++)
{
if (!graphics[i]->status)
{
cerr << "Assertion failed! (" << i << ")" << endl;
break;
}
}
cin.get();
return 0;
}
It is a bit faster (1s) on my machine, more readable, and removes the necessity of shared datas (synchronization is evil, avoid it or hide it in a reliable and efficient way).

Lock-Free Multiple Producer/Consumer Queue in C++11

I'm trying to implement a lock free multiple producer, multiple consumer queue in C++11. I'm doing this as a learning exercise, so I'm well aware that I could just use an existing open source implementation, but I'd really like to find out why my code doesn't work. The data is stored in a ringbuffer, apparently it is a "bounded MPMC queue".
I've modelled it pretty closely to what I've read of Disruptor. The thing I've noticed is that it works absolutely fine with a single consumer and single/multiple producers, it's just multiple consumers which seems to break it.
Here's the queue:
template <typename T>
class Queue : public IQueue<T>
{
public:
explicit Queue( int capacity );
~Queue();
bool try_push( T value );
bool try_pop( T& value );
private:
typedef struct
{
bool readable;
T value;
} Item;
std::atomic<int> m_head;
std::atomic<int> m_tail;
int m_capacity;
Item* m_items;
};
template <typename T>
Queue<T>::Queue( int capacity ) :
m_head( 0 ),
m_tail( 0 ),
m_capacity(capacity),
m_items( new Item[capacity] )
{
for( int i = 0; i < capacity; ++i )
{
m_items[i].readable = false;
}
}
template <typename T>
Queue<T>::~Queue()
{
delete[] m_items;
}
template <typename T>
bool Queue<T>::try_push( T value )
{
while( true )
{
// See that there's room
int tail = m_tail.load(std::memory_order_acquire);
int new_tail = ( tail + 1 );
int head = m_head.load(std::memory_order_acquire);
if( ( new_tail - head ) >= m_capacity )
{
return false;
}
if( m_tail.compare_exchange_weak( tail, new_tail, std::memory_order_acq_rel ) )
{
// In try_pop, m_head is incremented before the reading of the value has completed,
// so though we've acquired this slot, a consumer thread may be in the middle of reading
tail %= m_capacity;
std::atomic_thread_fence( std::memory_order_acquire );
while( m_items[tail].readable )
{
}
m_items[tail].value = value;
std::atomic_thread_fence( std::memory_order_release );
m_items[tail].readable = true;
return true;
}
}
}
template <typename T>
bool Queue<T>::try_pop( T& value )
{
while( true )
{
int head = m_head.load(std::memory_order_acquire);
int tail = m_tail.load(std::memory_order_acquire);
if( head == tail )
{
return false;
}
int new_head = ( head + 1 );
if( m_head.compare_exchange_weak( head, new_head, std::memory_order_acq_rel ) )
{
head %= m_capacity;
std::atomic_thread_fence( std::memory_order_acquire );
while( !m_items[head].readable )
{
}
value = m_items[head].value;
std::atomic_thread_fence( std::memory_order_release );
m_items[head].readable = false;
return true;
}
}
}
And here's the test I'm using:
void Test( std::string name, Queue<int>& queue )
{
const int NUM_PRODUCERS = 64;
const int NUM_CONSUMERS = 2;
const int NUM_ITERATIONS = 512;
bool table[NUM_PRODUCERS*NUM_ITERATIONS];
memset(table, 0, NUM_PRODUCERS*NUM_ITERATIONS*sizeof(bool));
std::vector<std::thread> threads(NUM_PRODUCERS+NUM_CONSUMERS);
std::chrono::system_clock::time_point start, end;
start = std::chrono::system_clock::now();
std::atomic<int> pop_count (NUM_PRODUCERS * NUM_ITERATIONS);
std::atomic<int> push_count (0);
for( int thread_id = 0; thread_id < NUM_PRODUCERS; ++thread_id )
{
threads[thread_id] = std::thread([&queue,thread_id,&push_count]()
{
int base = thread_id * NUM_ITERATIONS;
for( int i = 0; i < NUM_ITERATIONS; ++i )
{
while( !queue.try_push( base + i ) ){};
push_count.fetch_add(1);
}
});
}
for( int thread_id = 0; thread_id < ( NUM_CONSUMERS ); ++thread_id )
{
threads[thread_id+NUM_PRODUCERS] = std::thread([&]()
{
int v;
while( pop_count.load() > 0 )
{
if( queue.try_pop( v ) )
{
if( table[v] )
{
std::cout << v << " already set" << std::endl;
}
table[v] = true;
pop_count.fetch_sub(1);
}
}
});
}
for( int i = 0; i < ( NUM_PRODUCERS + NUM_CONSUMERS ); ++i )
{
threads[i].join();
}
end = std::chrono::system_clock::now();
std::chrono::duration<double> duration = end - start;
std::cout << name << " " << duration.count() << std::endl;
std::atomic_thread_fence( std::memory_order_acq_rel );
bool result = true;
for( int i = 0; i < NUM_PRODUCERS * NUM_ITERATIONS; ++i )
{
if( !table[i] )
{
std::cout << "failed at " << i << std::endl;
result = false;
}
}
std::cout << name << " " << ( result? "success" : "fail" ) << std::endl;
}
Any nudging in the right direction would be greatly appreciated. I'm pretty new to memory fences rather than just using a mutex for everything, so I'm probably just fundamentally misunderstanding something.
Cheers
J
I'd give a look to Moody Camel's implementation.
It is a fast general purpose lock-free queue for C++ entirely written in C++11. Documentation seems to be rather good along with a few performance tests.
Among all other interesting things (they're worth a read anyway), it's all contained in a single header, and available under the simplified BSD license. Just drop it in your project and enjoy!
The simplest approach uses a circular buffer. That is it's like an array of 256 elements and you use uint8_t as index so it wraps around and starts at beginning when you overflow it.
The simplest primitive you can build upon is when you have single producer, single consumer thread.
The buffer has two heads:
Write head: It points the element which will be written next.
Read head: It points to the element which will be read next.
Operation of the producer:
If write Head + 1 == read head, the buffer is full, return buffer full error.
Write content to the element.
Insert memory barrier to sync CPU cores.
Move the write head forward.
At the buffer full case there is still 1 room left, but we reserve that, to distinguish from the buffer empty case.
Operation of the consumer:
If read head == write head, the buffer is empty, return buffer empty error.
Read content of the element.
Insert memory barrier to sync CPU cores.
Move the read head forward.
The producer owns the write head, the consumer owns the read head, there is no concurrency on those. Also the heads are updated when the operation is completed, this ensure the consumer leaves finished elements behind, and the consumes leaves behind fully consumed empty cells.
Create 2 of these pipes in both directions whenever you fork off a new thread and you can have bidirectional communication with your threads.
Given that we are talking about lock freeness it also means none of the threads are blocked, when there is nothing to do the threads are spinning empty, you may want to detect this and add some sleep when it happens.
How about this lock free queue
It is memory ordering lock free queue, but this need to pre-set number of current thread when init the queue.
For example:-
int* ret;
int max_concurrent_thread = 16;
lfqueue_t my_queue;
lfqueue_init(&my_queue, max_concurrent_thread );
/** Wrap This scope in other threads **/
int_data = (int*) malloc(sizeof(int));
assert(int_data != NULL);
*int_data = i++;
/*Enqueue*/
while (lfqueue_enq(&my_queue, int_data) == -1) {
printf("ENQ Full ?\n");
}
/** Wrap This scope in other threads **/
/*Dequeue*/
while ( (int_data = lfqueue_deq(&my_queue)) == NULL) {
printf("DEQ EMPTY ..\n");
}
// printf("%d\n", *(int*) ret );
free(ret);
/** End **/
lfqueue_destroy(&my_queue);
On another similar question, I presented a solution to this problem. I believe that it the smallest found so far.
I will not put same answer here, but the repository has a fully functional C++ implementation of the lock free queue you desire.
EDIT: Thanks to code review from #PeterCordes, I've found a bug on the solution when using 64 bit templates, but now it's working perfectly.
This is the output I receive when running the tests
Creating 4 producers & 4 consumers
to flow 10.000.000 items trough the queue.
Produced: 10.743.668.245.000.000
Consumed: 5.554.289.678.184.004
Produced: 10.743.668.245.000.000
Consumed: 15.217.833.969.059.643
Produced: 10.743.668.245.000.000
Consumed: 7.380.542.769.600.801
Produced: 10.743.668.245.000.000
Consumed: 14.822.006.563.155.552
Checksum: 0 (it must be zero)

One producer, two consumers acting on one 'queue' produced by producer

Preface: I'm new to multithreaded programming, and a little rusty with C++. My requirements are to use one mutex, and two conditions mNotEmpty and mEmpty. I must also create and populate the vectors in the way mentioned below.
I have one producer thread creating a vector of random numbers of size n*2, and two consumers inserting those values into two separate vectors of size n.
I am doing the following in the producer:
Lock the mutex: pthread_mutex_lock(&mMutex1)
Wait for consumer to say vector is empty: pthread_cond_wait(&mEmpty,&mMutex1)
Push back a value into the vector
Signal the consumer that the vector isn't empty anymore: pthread_cond_signal(&mNotEmpty)
Unlock the mutex: pthread_mutex_unlock(&mMutex1)
Return to step 1
In the consumer:
Lock the mutex: pthread_mutex_lock(&mMutex1)
Check to see if the vector is empty, and if so signal the producer: pthread_cond_signal(&mEmpty)
Else insert value into one of two new vectors (depending on which thread) and remove from original vector
Unlock the mutex: pthread_mutex_unlock(&mMutex1)
Return to step 1
What's wrong with my process? I keep getting segmentation faults or infinite loops.
Edit: Here's the code:
void Producer()
{
srand(time(NULL));
for(unsigned int i = 0; i < mTotalNumberOfValues; i++){
pthread_mutex_lock(&mMutex1);
pthread_cond_wait(&mEmpty,&mMutex1);
mGeneratedNumber.push_back((rand() % 100) + 1);
pthread_cond_signal(&mNotEmpty);
pthread_mutex_unlock(&mMutex1);
}
}
void Consumer(const unsigned int index)
{
for(unsigned int i = 0; i < mNumberOfValuesPerVector; i++){
pthread_mutex_lock(&mMutex1);
if(mGeneratedNumber.empty()){
pthread_cond_signal(&mEmpty);
}else{
mThreadVector.at(index).push_back[mGeneratedNumber.at(0)];
mGeneratedNumber.pop_back();
}
pthread_mutex_unlock(&mMutex1);
}
}
I'm not sure I understand the rationale behind the way you're doing
things. In the usual consumer-provider idiom, the provider pushes as
many items as possible into the channel, waiting only if there is
insufficient space in the channel; it doesn't wait for empty. So the
usual idiom would be:
provider (to push one item):
pthread_mutex_lock( &mutex );
while ( ! spaceAvailable() ) {
pthread_cond_wait( &spaceAvailableCondition, &mutex );
}
pushTheItem();
pthread_cond_signal( &itemAvailableCondition );
pthread_mutex_unlock( &mutex );
and on the consumer side, to get an item:
pthread_mutex_lock( &mutex );
while ( ! itemAvailable() ) {
pthread_cond_wait( &itemAvailableCondition, &mutex );
}
getTheItem();
pthread_cond_signal( &spaceAvailableCondition );
pthread_mutex_unlock( &mutex );
Note that for each condition, one side signals, and the other waits. (I
don't see any wait in your consumer.) And if there is more than one
process on either side, I'd recommend using pthread_cond_broadcast,
rather than pthread_cond_signal.
There are a number of other issues in your code. Some of them look more
like typos: you should copy/paste actual code to avoid this. Do you
really mean to read and pop mGeneratedValues, when you push into
mGeneratedNumber, and check whether that is empty? (If you actually
do have two different queues, then you're popping from a queue where no
one has pushed.) And you don't have any loops waiting for the
conditions; you keep iterating through the number of elements you
expect (incrementing the counter each time, so you're likely to
gerninate long before you should)—I can't see an infinite loop,
but I can readily see a endless wait in pthread_cond_wait in the
producer. I don't see a core dump off hand, but what happens when one
of the processes terminates (probably the consumer, because it never
waits for anything); if it ends up destroying the mutex or the condition
variables, you could get a core dump when another process attempts to
use them.
In producer, call pthread_cond_wait only when queue is not empty. Otherwise you get blocked forever due to a race condition.
You might want to consider taking mutex only after condition is fulfilled, e.g.
producer()
{
while true
{
waitForEmpty();
takeMutex();
produce();
releaseMutex();
}
}
consumer()
{
while true
{
waitForNotEmpty();
takeMutex();
consume();
releaseMutex();
}
}
Here is a solution to a similar problem like you. In this program producer produces a no and writes it to a array(buffer) and a maintains a file then update a status(status array) about it, while on getting data in the array(buffer) consumers start to consume(read and write to their file) and update a status that it has consumed. when producer looks that both the consumer has consumed the data it overrides the data with a new value and goes on. for convenience here i have restricted the code to run for 2000 nos.
// Producer-consumer //
#include <iostream>
#include <fstream>
#include <pthread.h>
#define MAX 100
using namespace std;
int dataCount = 2000;
int buffer_g[100];
int status_g[100];
void *producerFun(void *);
void *consumerFun1(void *);
void *consumerFun2(void *);
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t dataNotProduced = PTHREAD_COND_INITIALIZER;
pthread_cond_t dataNotConsumed = PTHREAD_COND_INITIALIZER;
int main()
{
for(int i = 0; i < MAX; i++)
status_g[i] = 0;
pthread_t producerThread, consumerThread1, consumerThread2;
int retProducer = pthread_create(&producerThread, NULL, producerFun, NULL);
int retConsumer1 = pthread_create(&consumerThread1, NULL, consumerFun1, NULL);
int retConsumer2 = pthread_create(&consumerThread2, NULL, consumerFun2, NULL);
pthread_join(producerThread, NULL);
pthread_join(consumerThread1, NULL);
pthread_join(consumerThread2, NULL);
return 0;
}
void *producerFun(void *)
{
//file to write produced data by producer
const char *producerFileName = "producer.txt";
ofstream producerFile(producerFileName);
int index = 0, producerCount = 0;
while(1)
{
pthread_mutex_lock(&mutex);
if(index == MAX)
{
index = 0;
}
if(status_g[index] == 0)
{
static int data = 0;
data++;
cout << "Produced: " << data << endl;
buffer_g[index] = data;
producerFile << data << endl;
status_g[index] = 5;
index ++;
producerCount ++;
pthread_cond_broadcast(&dataNotProduced);
}
else
{
cout << ">> Producer is in wait.." << endl;
pthread_cond_wait(&dataNotConsumed, &mutex);
}
pthread_mutex_unlock(&mutex);
if(producerCount == dataCount)
{
producerFile.close();
return NULL;
}
}
}
void *consumerFun1(void *)
{
const char *consumerFileName = "consumer1.txt";
ofstream consumerFile(consumerFileName);
int index = 0, consumerCount = 0;
while(1)
{
pthread_mutex_lock(&mutex);
if(index == MAX)
{
index = 0;
}
if(status_g[index] != 0 && status_g[index] != 2)
{
int data = buffer_g[index];
cout << "Cosumer1 consumed: " << data << endl;
consumerFile << data << endl;
status_g[index] -= 3;
index ++;
consumerCount ++;
pthread_cond_signal(&dataNotConsumed);
}
else
{
cout << "Consumer1 is in wait.." << endl;
pthread_cond_wait(&dataNotProduced, &mutex);
}
pthread_mutex_unlock(&mutex);
if(consumerCount == dataCount)
{
consumerFile.close();
return NULL;
}
}
}
void *consumerFun2(void *)
{
const char *consumerFileName = "consumer2.txt";
ofstream consumerFile(consumerFileName);
int index = 0, consumerCount = 0;
while(1)
{
pthread_mutex_lock(&mutex);
if(index == MAX)
{
index = 0;
}
if(status_g[index] != 0 && status_g[index] != 3)
{
int data = buffer_g[index];
cout << "Consumer2 consumed: " << data << endl;
consumerFile << data << endl;
status_g[index] -= 2;
index ++;
consumerCount ++;
pthread_cond_signal(&dataNotConsumed);
}
else
{
cout << ">> Consumer2 is in wait.." << endl;
pthread_cond_wait(&dataNotProduced, &mutex);
}
pthread_mutex_unlock(&mutex);
if(consumerCount == dataCount)
{
consumerFile.close();
return NULL;
}
}
}
Here is only one problem that producer in not independent to produce, that is it needs to take lock on the whole array(buffer) before it produces new data, and if the mutex is locked by consumer it waits for that and vice versa, i am trying to look for it.