Fast process std::bitset<65536> in parallel - c++

Once there was a deleted question, that I wrote a huge answer to, but this question was deleted and author refused to undelete it.
So posting here a short summary of this question. And immediately answering this question myself, just to share my results.
Question was that if we're given std::bitset<65536> that is processed (by some formula) inside inner loop bit-by-bit, then how can we boost this computation?
Outer loop just called inner loop many times (lets say 50 000 times), and outer loop can't be processed in parallel, because each next iteration depends on results of previous iteration.
Example code of this process:
std::bitset<65536> bits{};
uint64_t hash = 0;
for (size_t i = 0; i < 50000; ++i) {
// Process Bits
for (size_t j = 0; j < bits.size(); ++j)
bits[j] = ModifyBit(i, j, hash, bits[j]);
hash = Hash(bits, hash);
Code above is just one sample way of processing, it is not a real case. The real case is such that many times we process std::bitset<65536> somehow in such a way that all bits can be processed independently.
The question is how we can process bits in parallel as fast as possible inside inner loop.
One important Note that formula that modifies bits is generic, meaning that we don't know it in advance and can't make SIMD instructions out of it.
But what we know is that all bits can be processed independently. And that we need to parallelize this processing. Also we can't parallelize outer loop as each its iteration depends on results of previous iteration.
Another Note is that std::bitset<65536> is quite small, just 1K of 64-bit words. So it means that directly using pool of std::thread of std::async threads will not work as each thread's work will be just around 50-200 nano-seconds, very tiny time to start and stop threads and send work to them. Even std::mutex takes 75 nano-seconds on my Windows machine (although 20 nano-seconds on Linux), so using std::mutex is also a big overhead.
One may assume that ModifyBit() function above takes around same time for each bit, otherwise there is no understanding on how to schedule balanced parallelization of a loop, only by slicing it into very many tiny tasks hoping that longer tasks will be balanced out by several shorter one.

Implemented quite large and complex solution for your task, but which works very fast. On my 4-core (8 hardware threads) laptop I have 6x times multi-core speedup compared to single threaded version (your version of code).
Main idea of solution below is to implement very fast multi core Thread-Pool for running arbitrary tasks that has small overhead. My implementation can handle up to 1-10 Million tasks per second (depending on CPU speed and cores count).
Regular way of asynchronously starting multiple tasks is through usage of std::async or just by creating std::thread. Both these ways are considerably slower than my own implementation. They can't give throughput of 5 Million tasks per second like my implementation gives. And your code needs millions of tasks per second to be run for good speed. That's why I implemented everything from scratch.
After fast thread pool is implemented now we can slice your 64K bitset into smaller sub-sets and process these sub-sets in parallel. I sliced 64K bitset into 16 equal parts (see BitSize / 16 in code), you can set other amount of parts equal to power of two, but not too many, otherwise thread pool overhead will be too large. Usually it is good to slice into amount of parts that is equal to twice the amount of hardware threads (or 4 times amount of cores).
I implemented several classes in C++ code. AtomicMutex class uses std::atomic_flag in order to implement very fast replacement for mutex that is based on spin-locking approach. This AtomicMutex is used to protect queue of tasks submitted for running on thread pool.
RingBuffer class is based on std::vector and implements simple and fast queue to store any objects. It is implemented using two pointers (head and tail), pointing into vector. When new element is added to queue then tail pointer is advanced to the right, if this pointer reaches end of vector then it wraps around to 0-th position. Same way when element is taken out from queue then head pointer also advances to the right with wrap around. RingBuffer is used to store thread pool tasks.
Queue class is a wrapper around RingBuffer, but with AtomicMutex protection. This spin-lock mutex is used to protect simultaneous adding/taking elements to/from queue from multiple workers' threads.
Pool implements multi-core pool of tasks itself. It creates as many worker threads as there are CPU hardware threads (double amount of cores) minus one. Each worker thread just polls new tasks from queue and executes them immediately. Main thread adds new tasks to queue. Pool also has Wait() capability to wait till all current tasks are finished, this waiting is used as barrier to wait till whole 64K bitset is processed (all sub-parts are processed). Pool accepts any lambdas (function closures) to be run. You can see that 64K bitset sliced into smaller parts is processed by doing pool.Emplace(lambda) and later pool.Wait() is used to wait till all sub-parts are finished. Exceptions from pool workers are collected and reported to user if there is any error. When doing Wait() pool runs tasks also inside main thread not to waste one core for just waiting of tasks to finish.
Timings reported in console are done by std::chrono module.
There is an ability to run both versions - single-threaded (your original version) and multi-threaded using all cores. Switch between single/multi is done by passing MultiThreaded = true template parameter to function ProcessBitset().
Try it online!
#include <cstdint>
#include <atomic>
#include <vector>
#include <array>
#include <queue>
#include <functional>
#include <thread>
#include <future>
#include <exception>
#include <optional>
#include <memory>
#include <iostream>
#include <iomanip>
#include <bitset>
#include <string>
#include <chrono>
#include <algorithm>
#include <any>
#include <type_traits>
class AtomicMutex {
class LockerC;
void lock() {
while (f_.test_and_set(std::memory_order_acquire))
//f_.wait(true, std::memory_order_acquire)
void unlock() {
LockerC Locker() { return LockerC(*this); }
class LockerC {
LockerC() = delete;
LockerC(AtomicMutex & mux) : pmux_(&mux) { mux.lock(); }
LockerC(LockerC const & other) = delete;
LockerC(LockerC && other) : pmux_(other.pmux_) { other.pmux_ = nullptr; }
~LockerC() { if (pmux_) pmux_->unlock(); }
LockerC & operator = (LockerC const & other) = delete;
LockerC & operator = (LockerC && other) = delete;
AtomicMutex * pmux_ = nullptr;
std::atomic_flag f_ = ATOMIC_FLAG_INIT;
template <typename T>
class RingBuffer {
RingBuffer() : buf_(1 << 8), last_(buf_.size() - 1) {}
T & front() { return buf_[first_]; }
T const & front() const { return buf_[first_]; }
T & back() { return buf_[last_]; }
T const & back() const { return buf_[last_]; }
size_t size() const { return size_; }
bool empty() const { return size_ == 0; }
template <typename ... Args>
void emplace(Args && ... args) {
while (size_ >= buf_.size()) {
std::rotate(&buf_[0], &buf_[first_], &buf_[buf_.size()]);
first_ = 0;
last_ = buf_.size() - 1;
buf_.resize(buf_.size() * 2);
if (last_ >= buf_.size())
last_ = 0;
buf_[last_] = T(std::forward<Args>(args)...);
void pop() {
if (size_ == 0)
if (first_ >= buf_.size())
first_ = 0;
std::vector<T> buf_;
size_t first_ = 0, last_ = 0, size_ = 0;
template <typename T>
class Queue {
size_t Size() const { return q_.size(); }
bool Empty() const { return q_.size() == 0; }
template <typename ... Args>
void Emplace(Args && ... args) {
auto lock = m_.Locker();
T Pop(std::function<void()> const & on_empty = []{},
std::function<void()> const & on_full = []{}) {
while (true) {
if (q_.empty()) {
auto lock = m_.Locker();
if (q_.empty()) {
T val = std::move(q_.front());
return std::move(val);
std::optional<T> TryPop() {
auto lock = m_.Locker();
if (q_.empty())
return std::nullopt;
T val = std::move(q_.front());
return std::move(val);
AtomicMutex m_;
RingBuffer<T> q_;
class RunInDestr {
RunInDestr(std::function<void()> const & f) : f_(f) {}
~RunInDestr() { f_(); }
std::function<void()> const & f_;
class Pool {
struct FinishExc {};
struct Worker {
std::unique_ptr<std::atomic<bool>> pdone = std::make_unique<std::atomic<bool>>(true);
std::unique_ptr<std::exception_ptr> pexc = std::make_unique<std::exception_ptr>();
std::unique_ptr<std::thread> thr;
Pool(size_t nthreads = size_t(-1)) {
if (nthreads == size_t(-1))
nthreads = std::thread::hardware_concurrency() - 1;
std::cout << "Pool has " << nthreads << " worker threads." << std::endl;
for (size_t i = 0; i < nthreads; ++i) {
workers_.back().thr = std::make_unique<std::thread>(
[&, pdone = workers_.back().pdone.get(), pexc = workers_.back().pexc.get()]{
try {
std::function<void()> f_done = [pdone]{
pdone->store(true, std::memory_order_relaxed);
}, f_empty = [this]{
}, f_full = [pdone]{
pdone->store(false, std::memory_order_relaxed);
while (true) {
RunInDestr set_done(f_done);
tasks_.Pop(f_empty, f_full)();
} catch (...) {, std::memory_order_relaxed);
*pexc = std::current_exception();
~Pool() {
void CheckExc() {
if (!exc_.load(std::memory_order_relaxed))
throw std::runtime_error("Pool: Exception occured!");
void Finish() {
finish_ = true;
for (auto & w: workers_)
try {
if (*w.pexc)
} catch (FinishExc const &) {}
template <typename ... Args>
void Emplace(Args && ... args) {
void Wait() {
while (true) {
auto task = tasks_.TryPop();
if (!task)
while (true) {
bool done = true;
for (auto & w: workers_)
if (!w.pdone->load(std::memory_order_relaxed)) {
done = false;
if (done)
void CheckFinish() {
if (finish_)
throw FinishExc{};
Queue<std::function<void()>> tasks_;
std::vector<Worker> workers_;
bool finish_ = false;
std::atomic<bool> exc_ = false;
template <bool MultiThreaded = true, size_t BitSize>
void ProcessBitset(Pool & pool, std::bitset<BitSize> & bset,
std::string const & businessLogicCriteria) {
static size_t constexpr block = BitSize / 16;
for (int j = 0; j < BitSize; j += block) {
auto task = [&bset, j]{
int const hi = std::min(j + block, BitSize);
for (int i = j; i < hi; ++i) {
if (i % 2 == 0)
bset[i] = 0;
bset[i] = 1;
if constexpr(MultiThreaded)
if constexpr(MultiThreaded)
static auto const gtb = std::chrono::high_resolution_clock::now();
double Time() {
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
void Compute() {
Pool pool;
std::bitset<65536> bset;
std::string businessLogicCriteria;
int const hi = 50000;
for (int j = 0; j < hi; ++j) {
if ((j & 0x1FFF) == 0 || j + 1 >= hi)
std::cout << j / 1000 << "K (" << std::fixed << std::setprecision(3) << Time() << " sec), " << std::flush;
ProcessBitset(pool, bset, businessLogicCriteria);
businessLogicCriteria = "...";
void TimeMeasure() {
size_t constexpr A = 1 << 16, B = 1 << 5;
Pool pool;
auto const tb = Time();
int64_t volatile x = 0;
for (size_t i = 0; i < A; ++i) {
for (size_t j = 0; j < B; ++j)
pool.Emplace([&]{ x = x + 1; });
std::cout << "AtomicPool time " << std::fixed << std::setprecision(3) << (Time() - tb)
<< " sec, speed " << A * B / (Time() - tb) / 1000.0 << " empty K-tasks/sec, "
<< 1'000'000 / (A * B / (Time() - tb)) << " sec/M-task, no-collisions "
<< std::setprecision(7) << double(x) / (A * B) << std::endl;
auto const tb = Time();
//size_t const nthr = std::thread::hardware_concurrency();
size_t constexpr C = A / 8;
std::vector<std::future<void>> asyncs;
int64_t volatile x = 0;
for (size_t i = 0; i < C; ++i) {
for (size_t j = 0; j < B; ++j)
asyncs.emplace_back(std::async(std::launch::async, [&]{ x = x + 1; }));
std::cout << "AsyncPool time " << std::fixed << std::setprecision(3) << (Time() - tb)
<< " sec, speed " << C * B / (Time() - tb) / 1000.0 << " empty K-tasks/sec, "
<< 1'000'000 / (C * B / (Time() - tb)) << " sec/M-task, no-collisions "
<< std::setprecision(7) << double(x) / (C * B) << std::endl;
int main() {
try {
return 0;
} catch (std::exception const & ex) {
std::cout << "Exception: " << ex.what() << std::endl;
return -1;
} catch (...) {
std::cout << "Unknown Exception!" << std::endl;
return -1;
Output for 4 cores (8 hardware threads):
Pool has 7 worker threads.
AtomicPool time 0.903 sec, speed 2321.831 empty K-tasks/sec, 0.431 sec/M-task, no-collisions 0.9999967
AsyncPool time 0.982 sec, speed 266.789 empty K-tasks/sec, 3.750 sec/M-task, no-collisions 0.9999123
Pool has 7 worker threads.
0K (0.074 sec), 8K (0.670 sec), 16K (1.257 sec), 24K (1.852 sec), 32K (2.435 sec), 40K (2.984 sec), 49K (3.650 sec), 49K (3.711 sec),
For comparison below is single-threaded version timings, that is 6x times slower:
0K (0.125 sec), 8K (3.786 sec), 16K (7.754 sec), 24K (11.202 sec), 32K (14.662 sec), 40K (18.056 sec), 49K (21.470 sec), 49K (21.841 sec),

You have this inner loop you want to parallelize:
for (size_t j = 0; j < bits.size(); ++j)
bits[j] = ModifyBit(i, j, hash, bits[j]);
So a good idea is to split it into chunks, and have multiple threads do each chunk in parallel. You can submit chunks to workers easily with a std::atomic<int> counter that increments to identify which chunk to work on. You can also make sure the threads all stop working after one loop before starting the next with a std::barrier:
std::bitset<65536> bits{};
std::thread pool[8]; // Change size accordingly
std::atomic<int> task_number{0};
constexpr std::size_t tasks_per_loop = 32; // Arbitrarily chosen
constexpr std::size_t block_size = (bits.size()+tasks_per_loop-1) / tasks_per_loop;
// (only written to by one thread by the barrier, so not atomic)
uint64_t hash = 0;
int i = 0;
std::barrier barrier(std::size(pool), [&]() {
task_number = 0;
hash = Hash(bits, hash);
for (std::thread& t : pool) {
t = std::thread([&]{
while (i < 50000) {
for (int t; (t = task_number++) < tasks_per_loop;) {
int block_start = t * block_size;
int block_end = std::min(block_start + block_size, bits.size());
for (int j = block_start; j < block_end; ++j) {
bits[j] = ModifyBit(i, j, hash, bits[j]);
// Wait for other threads to finish and hash
// to be calculated before starting next loop
for (std::thread& t : pool) t.join();
(The seemingly easy way of parallelizing the for loop with OpenMP #pragma omp parallel for seemed slower with some testing, perhaps because the tasks were so small)
Here it is against your implementation running similar code:
And on my machine, running this a few times with 1 million iterations took 28 to 32 seconds with my approach and 44 to 50 seconds with your general thread pool approach (granted this is much less general because it can't execute arbitrary std::function<void()> tasks).


C++ async and deferred show no difference in time compared to only async

I am creating a C++ program that uses 100 random number generators. The number generators are split into two groups: ones that create 100 numbers and ones that create 10 000 000 numbers.
I am trying to see the difference between:
Using deferred launching for the 100 numbers and async for the 10 000 000 numbers.
Using only async for both types of number generators.
There's no difference in time, so my code has something wrong with it, but so far I haven't been able to find it because I am a beginner with C++.
Below is the code. I've commented the part that uses only async.
#include <iostream>
#include <chrono>
#include <future>
#include <list>
Using both deferred and async launchings: 5119 ms
Using only async launching: 5139 ms
using namespace std;
class RandomNumberGenerator
enum class task { LIGHT, HEAVY };
task taskType;
RandomNumberGenerator(): taskType(task::LIGHT)
int rnd = rand() % 2;
if (rnd == 0)
taskType = task::LIGHT;
taskType = task::HEAVY;
bool generateNumbers()
int number;
if(taskType == task::LIGHT)
for (int i = 0; i < 100; i++)
number = rand();
for (int i = 0; i < 1000000; i++)
number = rand();
return true;
int main()
cout << "Starting to generate numbers\n";
RandomNumberGenerator objects[100];
auto start = chrono::system_clock::now();
for (int i = 0; i < 100; i++)
future<bool> gotNumbers;
if (objects[i].taskType == RandomNumberGenerator::task::LIGHT)
gotNumbers = async(launch::deferred, &RandomNumberGenerator::generateNumbers, &objects[i]);
gotNumbers = async(launch::async, &RandomNumberGenerator::generateNumbers, &objects[i]);
bool result = gotNumbers.get();
//future<bool> gotNumbers = async(launch::async, &RandomNumberGenerator::generateNumbers, &objects[i]);
//bool result = gotNumbers.get();
auto end = chrono::system_clock::now();
cout << "Total time = " << chrono::duration_cast<chrono::milliseconds>(end - start).count() << " seconds\n";
using launch::deferred or launch::async the same amount of work still needs to be done the only difference is whether it is done on another thread and the current thread blocks waiting for that thread to finish when you call gotNumbers.get() or whether the result is calculated directly in the current thread when you call gotNumbers.get(). Either way you aren't gaining any performance by using additional threads as only one thread is ever executing at a time.
If you start executing the async work before calling objects[i].generateNumbers() you might see more difference (though the overhead of std::async might still outweigh the performance increase).
#if 1
future<bool> gotNumbers;
if ( objects[ i ].taskType == RandomNumberGenerator::task::LIGHT )
gotNumbers = async( launch::deferred, &RandomNumberGenerator::generateNumbers, &objects[ i ] );
gotNumbers = async( launch::async, &RandomNumberGenerator::generateNumbers, &objects[ i ] );
future<bool> gotNumbers = async(launch::async, &RandomNumberGenerator::generateNumbers, &objects[i]);
objects[ i ].generateNumbers();
bool result = gotNumbers.get();

Cyclic splitting of execution into several threads (1-N-1-N-1...)

Consider this case:
for (...)
const size_t count = ...
for (size_t i = 0; i < count; ++i)
calculate(i); // thread-safe function
What is the most elegant solution to maximize performance using C++17 and/or boost?
Cyclic "create + join" threads makes no sense because of huge overhead (which in my case exactly equals possible gain).
So I have to create N threads only once and keep them synchronized with the main one (using: mutex, shared_mutex, condition_variable, atomic, etc.). It appeared to be quite difficult task for such common and clear situation (in order to make everything really safe and fast). Sticking with it during days I have a feeling of "inventing a bicycle"...
Update 1: calculate(x) and calculate(y) can (and should) run in
Update 2: std::atomic::fetch_add (or smth.) is more preferable
than queue (or smth.)
Update 3: extreme computations (i.e. millions of "outer" calls and hundreds of "inner")
Update 4: calculate() changes internal object's data without returning a value
Intermediate solution
For some reason "async + wait" is much faster then "create + join" threads. So these two examples make 100% speed increase:
Example 1
for (...)
const size_t count = ...
future<void> execution[cpu_cores];
for (size_t x = 0; x < cpu_cores; ++x)
execution[x] = async(launch::async, ref(*this), x, count);
for (size_t x = 0; x < cpu_cores; ++x)
void operator()(const size_t x, const size_t count)
for (size_t i = x; i < count; i += cpu_cores)
Example 2
for (...)
index = 0;
const size_t count = ...
future<void> execution[cpu_cores];
for (size_t x = 0; x < cpu_cores; ++x)
execution[x] = async(launch::async, ref(*this), count);
for (size_t x = 0; x < cpu_cores; ++x)
atomic<size_t> index;
void operator()(const size_t count)
for (size_t i = index.fetch_add(1); i < count; i = index.fetch_add(1))
Is it possible to make it even faster by creating threads only once and then synchronize them with a small overhead?
Final solution
Additional +20% of speed increase in comparison to std::async!
for (size_t i = 0; i < _countof(index); ++i) { index[i] = i; }
for_each_n(par_unseq, index, count, [&](const size_t i) { calculate(i); });
Is it possible to avoid redundant array "index"?
for_each_n(par_unseq, counting_iterator<size_t>(0), count,
[&](const size_t i)
In the past, you'd use OpenMP, GNU Parallel, Intel TBB.¹
If you have c++17², I'd suggest using execution policies with standard algorithms.
It's really better than you can expect to do things yourself, although it
requires some fore-thought to choose your types to be amenable to standard algorithms
still helps if you know what will happen under the hood
Here's a simple example without further ado:
Live On Compiler Explorer
#include <thread>
#include <algorithm>
#include <random>
#include <execution>
#include <iostream>
using namespace std::chrono_literals;
static size_t s_random_seed = std::random_device{}();
static auto generate_param() {
static std::mt19937 prng {s_random_seed};
static std::uniform_int_distribution<> dist;
return dist(prng);
struct Task {
Task(int p = generate_param()) : param(p), output(0) {}
int param;
int output;
struct ByParam { bool operator()(Task const& a, Task const& b) const { return a.param < b.param; } };
struct ByOutput { bool operator()(Task const& a, Task const& b) const { return a.output < b.output; } };
static void calculate(Task& task) {
task.output = task.param ^ 0xf0f0f0f0;
int main(int argc, char** argv) {
if (argc>1) {
s_random_seed = std::stoull(argv[1]);
std::vector<Task> jobs;
auto now = std::chrono::high_resolution_clock::now;
auto start = now();
1ull << 28, // reduce for small RAM!
auto laptime = [&](auto caption) {
std::cout << caption << " in " << (now() - start)/1.0s << "s" << std::endl;
start = now();
laptime("generate randum input");
begin(jobs), end(jobs),
laptime("sort by param");
begin(jobs), end(jobs),
begin(jobs), end(jobs),
laptime("sort by output");
auto const checksum = std::transform_reduce(
begin(jobs), end(jobs),
0, std::bit_xor<>{},
std::cout << "Checksum: " << checksum << "\n";
When run with the seed 42, prints:
generate randum input in 10.8819s
sort by param in 8.29467s
calculate in 0.22513s
sort by output in 5.64708s
reduce in 0.108768s
Checksum: 683872090
CPU utilization is 100% on all cores except for the first (random-generation) step.
¹ (I think I have answers demoing all of these on this site).
² See Are C++17 Parallel Algorithms implemented already?

Troubles with simple Lock-Free MPSC Ring Buffer

I am trying to implement an array-based ring buffer that is thread-safe for multiple producers and a single consumer. The main idea is to have atomic head and tail indices. When pushing an element to the queue, the head is increased atomically to reserve a slot in the buffer:
#include <atomic>
#include <chrono>
#include <iostream>
#include <stdexcept>
#include <thread>
#include <vector>
template <class T> class MPSC {
std::atomic<int> head{0}; ///< index of first free slot
std::atomic<int> tail{0}; ///< index of first occupied slot
std::unique_ptr<T[]> data;
std::unique_ptr<std::atomic<bool>[]> valid; ///< indicates whether data at an
///< index has been fully written
/// Compute next index modulo size.
inline int advance(int x) { return (x + 1) % MAX_SIZE; }
explicit MPSC(int size) {
if (size <= 0)
throw std::invalid_argument("size must be greater than 0");
MAX_SIZE = size + 1;
data = std::make_unique<T[]>(MAX_SIZE);
valid = std::make_unique<std::atomic<bool>[]>(MAX_SIZE);
/// Add an element to the queue.
/// If the queue is full, this method blocks until a slot is available for
/// writing. This method is not starvation-free, i.e. it is possible that one
/// thread always fills up the queue and prevents others from pushing.
void push(const T &msg) {
int idx;
int next_idx;
int k = 100;
do {
idx = head;
next_idx = advance(idx);
while (next_idx == tail) { // queue is full
k = k >= 100000 ? k : k * 2; // exponential backoff
} // spin
} while (!head.compare_exchange_weak(idx, next_idx));
if (valid[idx])
// this throws, suggesting that two threads are writing to the same index. I have no idea how this is possible.
throw std::runtime_error("message slot already written");
data[idx] = msg;
valid[idx] = true; // this was set to false by the reader,
// set it to true to indicate completed data write
/// Read an element from the queue.
/// If the queue is empty, this method blocks until a message is available.
/// This method is only safe to be called from one single reader thread.
T pop() {
int k = 100;
while (is_empty() || !valid[tail]) {
k = k >= 100000 ? k : k * 2;
} // spin
T res = data[tail];
valid[tail] = false;
tail = advance(tail);
return res;
bool is_full() { return (head + 1) % MAX_SIZE == tail; }
bool is_empty() { return head == tail; }
When there is a lot of congestion, some messages get overwritten by other threads. Hence there must be something fundamentally wrong with what I'm doing here.
What seems to be happening is that two threads are acquiring the same index to write their data to. Why could that be?
Even if a producer were to pause just before writing it's data, the tail could not increase past this threads idx and hence no other thread should be able to overtake and claim that same idx.
At the risk of posting too much code, here is a simple program that reproduces the problem. It sends some incrementing numbers from many threads and checks whether all numbers are received by the consumer:
#include "mpsc.hpp" // or whatever; the above queue
#include <thread>
#include <iostream>
int main() {
static constexpr int N_THREADS = 10; ///< number of threads
static constexpr int N_MSG = 1E+5; ///< number of messages per thread
struct msg {
int t_id;
int i;
MPSC<msg> q(N_THREADS / 2);
std::thread threads[N_THREADS];
// consumer
threads[0] = std::thread([&q] {
int expected[N_THREADS] {};
for (int i = 0; i < N_MSG * (N_THREADS - 1); ++i) {
msg m = q.pop();
std::cout << "Got message from T-" << m.t_id << ": " << m.i << std::endl;
if (expected[m.t_id] != m.i) {
std::cout << "T-" << m.t_id << " unexpected msg " << m.i << "; expected " << expected[m.t_id] << std::endl;
return -1;
expected[m.t_id] = m.i + 1;
// producers
for (int id = 1; id < N_THREADS; ++id) {
threads[id] = std::thread([id, &q] {
for (int i = 0; i < N_MSG; ++i) {
q.push(msg{id, i});
for (auto &t : threads)
I am trying to implement an array-based ring buffer that is thread-safe for multiple producers and a single consumer.
I assume you are doing this as a learning exercise. Implementing a lock-free queue yourself is most probably the wrong thing to do if you want to solve a real problem.
What seems to be happening is that two threads are acquiring the same index to write their data to. Why could that be?
The combination of that producer spinlock with the outer CAS loop does not work in the intended way:
do {
idx = head;
next_idx = advance(idx);
while (next_idx == tail) { // queue is full
k = k >= 100000 ? k : k * 2; // exponential backoff
} // spin
// ...
// All other threads (producers and consumers) can progress.
// ...
} while (!head.compare_exchange_weak(idx, next_idx));
The queue may be full when the CAS happens because those checks are performed independently. In addition, the CAS may succeed because the other threads may have advanced head to exactly match idx.

How can I refactor this code into multi-thread version?

There is a loop which takes quite a long time and I'm considering refactoring this code into multi-thread version. And here is the model.
Photon photon;
for (int i=0;i<1000000;++i){
// do something
I have to call this function a thousand and thousand times.So I was wondering how can I create some threads to run this function at the some time.
But the photon have to be individual every single time.
the index i can be converted to this:
atomic<int> i{0};
// do something
With threading you have to pay attention to object lifetime and sharing far more than normal.
But the basic solution is
void do_tasks( std::size_t count, std::function<void( std::size_t start, std::size_t finish )> task ) {
auto thread_count = std::thread::hardware_concurrency();
if (thread_count <= 0) thread_count = 1;
std::vector<std::future<void>> threads( thread_count-1 );
auto get_task = [=](std::size_t index) {
auto start = count * index / thread_count;
auto finish = count * (index+1) / thread_count;
// std::cout << "from " << start << " to " << finish << "\n";
return [task, start, finish]{ task(start, finish); };
for( auto& thread : threads ) {
auto index = &;
thread = std::async( std::launch::async, get_task(index) );
get_task( threads.size() )();
for (auto& thread : threads) {
This is a little multi threading library.
You use it like this:
do_tasks( 100, [&](size_t start, size_t finish) {
// do subtasks starting at index start, up to and not including finish
There are other more complex threading libraries, but writing a small half-decent one isn't hard so I did it.
To be explicit:
Photon photon;
do_tasks( 1000000, [&](size_t start, size_t finish) {
for (int i = start; i < finish; ++i) {
but you'll have to be extremely careful making sure there is no unsafe data sharing between the threads, and you aren't just blocking each thread on a common mutex.
Live example
A awful lot depends on how and to what extent photon.launch() can be parallelised.
The code below divides a range into (approximately) equal segments and then executes each segment in a separate thread.
As stated whether that helps will depend on how much of photon.launch() can be done in parallel. If it spends most of its time modifying a shared state and essentially has the form:
void launch(int index){
std::lock_guard<std::mutex> guard{m};
Where m is a member of Photon then little if anything will be gained.
If (at the other extreme) the individual calls to launch never contend for the same data then it can be parallelised up to the number of cores the system can provide.
#include <thread>
#include <vector>
class Photon {
void launch(int index){
//... what goes here matters a lot...
void photon_launch(Photon& photon,int from,int to){
for(auto i=from;i<=to;++i){
int main() {
const size_t loop_count=100000;//How big is the loop?
const size_t thread_count=4;//How many threads can we utilize?
std::vector< std::thread > threads;
Photon photon;
int from=1;
for(size_t i=1;i<=thread_count;++i){
//If loop_count isn't divisible by thread_count evens out the remainder.
int to=(loop_count*i)/thread_count;
//Now the threads are launched we block until they all finish.
//If we don't the program may (will?) finish before the threads.
for(auto& curr : threads){
return 0;

Parallel for_each more than two times slower than std::for_each

I'm reading C++ Concurrency in Action by Anthony Williams. In the chapter about designing concurrent code there is parallel version of std::for_each algorihtm. Here is slightly modified code from the book:
#pragma once
#include <vector>
#include <thread>
class join_threads
explicit join_threads(std::vector<std::thread>& threads)
: threads_(threads) {}
for (size_t i = 0; i < threads_.size(); ++i)
std::vector<std::thread>& threads_;
#pragma once
#include <future>
#include <algorithm>
#include "join_threads.hpp"
template<typename Iterator, typename Func>
void parallel_for_each(Iterator first, Iterator last, Func func)
const auto length = std::distance(first, last);
if (0 == length) return;
const auto min_per_thread = 25u;
const unsigned max_threads = (length + min_per_thread - 1) / min_per_thread;
const auto hardware_threads = std::thread::hardware_concurrency();
const auto num_threads= std::min(hardware_threads != 0 ?
hardware_threads : 2u, max_threads);
const auto block_size = length / num_threads;
std::vector<std::future<void>> futures(num_threads - 1);
std::vector<std::thread> threads(num_threads-1);
join_threads joiner(threads);
auto block_start = first;
for (unsigned i = 0; i < num_threads - 1; ++i)
auto block_end = block_start;
std::advance(block_end, block_size);
std::packaged_task<void (void)> task([block_start, block_end, func]()
std::for_each(block_start, block_end, func);
futures[i] = task.get_future();
threads[i] = std::thread(std::move(task));
block_start = block_end;
std::for_each(block_start, last, func);
for (size_t i = 0; i < num_threads - 1; ++i)
I benchmarked it with sequential version of std::for_each using the following program:
#include <iostream>
#include <random>
#include <chrono>
#include "parallel_for_each.hpp"
using namespace std;
constexpr size_t ARRAY_SIZE = 500'000'000;
typedef std::vector<uint64_t> Array;
template <class FE, class F>
void test_for_each(const Array& a, FE fe, F f, atomic<uint64_t>& result)
auto time_begin = chrono::high_resolution_clock::now();
result = 0;
fe(a.begin(), a.end(), f);
auto time_end = chrono::high_resolution_clock::now();
cout << "Result = " << result << endl;
cout << "Time: " << chrono::duration_cast<chrono::milliseconds>(
time_end - time_begin).count() << endl;
int main()
random_device device;
default_random_engine engine(device());
uniform_int_distribution<uint8_t> distribution(0, 255);
Array a;
cout << "Generating array ... " << endl;
for (size_t i = 0; i < ARRAY_SIZE; ++i)
atomic<uint64_t> result;
auto acc = [&result](uint64_t value) { result += value; };
cout << "parallel_for_each ..." << endl;
test_for_each(a, parallel_for_each<Array::const_iterator, decltype(acc)>, acc, result);
cout << "for_each ..." << endl;
test_for_each(a, for_each<Array::const_iterator, decltype(acc)>, acc, result);
return 0;
The parallel version of the algorithm on my machine is more than two times slower than sequential one:
parallel_for_each ...
Result = 63750301073
Time: 5448
for_each ...
Result = 63750301073
Time: 2496
I'm using GCC 6.2 compiler on Ubuntu Linux running on Intel(R) Core(TM) i3-6100 CPU # 3.70GHz.
How such a behavior can be explained? Is this because of sharing of atomic<uint64_t> variable between threads and cache ping-pong?
I profiled both separately with perf. For the parallel version the stats are the following:
1137982167 cache-references
247652893 cache-misses # 21,762 % of all cache refs
60868183996 cycles
27409239189 instructions # 0,45 insns per cycle
3287117194 branches
80895 faults
4 migrations
And for the sequential one:
402791485 cache-references
246561299 cache-misses # 61,213 % of all cache refs
40284812779 cycles
26515783790 instructions # 0,66 insns per cycle
3188784664 branches
48179 faults
3 migrations
It is obvious that the parallel version generates far more cache references, cycles and faults but why?
You are sharing the same result variable: all the threads are accumulating on atomic<uint64_t> result, thrashing the cache!
Every time a thread writes to result, all the caches in the other cores are invalidated: this leads to cache line contention.
More information:
"Sharing Is the Root of All Contention".
[...] to write to a memory location a core must additionally have exclusive ownership of the cache line containing that location. While one core has exclusive use, all other cores trying to write the same memory location must wait and take turns — that is, they must run serially. Conceptually, it's as if each cache line were protected by a hardware mutex, where only one core can hold the hardware lock on that cache line at a time.
This article on "false sharing", which covers a similar issue, explains more in depth what happens in the caches.
I made some modifications to your program and achieved the following results (on a machine with an i7-4770K [8 threads + hyperthreading]):
Generating array ...
parallel_for_each ...
Result = 63748111806
Time: 195
for_each ...
Result = 63748111806
Time: 2727
The parallel version is roughly 92% faster than the serial version.
std::future and std::packaged_task are heavyweight abstractions. In this case, an std::experimental::latch is sufficient.
Every task is sent to a thread pool This minimizes thread creation overhead.
Every task has its own accumulator. This eliminates sharing.
The code is available here on my GitHub. It uses some personal dependencies, but you should understand the changes regardless.
Here are the most important changes:
// A latch is being used instead of a vector of futures.
ecst::latch l(num_threads - 1);
auto block_start = first;
for (unsigned i = 0; i < num_threads - 1; ++i)
auto block_end = block_start;
std::advance(block_end, block_size);
// `p` is a thread pool.
// Every task posted in the thread pool has its own `tempacc` accumulator.[&, block_start, block_end, tempacc = 0ull]() mutable
// The task accumulator is filled up...
std::for_each(block_start, block_end, [&tempacc](auto x){ tempacc += x; });
// ...and then the atomic variable is incremented ONCE.
block_start = block_end;
// Same idea here: accumulate to local non-atomic counter, then
// add the partial result to the atomic counter ONCE.
auto tempacc2 = 0ull;
std::for_each(block_start, last, [&tempacc2](auto x){ tempacc2 += x; });