Spurious underflow in C++ lock-free queue implementation - c++

I'm trying to implement a lock-free queue that uses a linear circular buffer to store data. In contrast to a general-purpose lock-free queue I have the following relaxing conditions:
I know the worst-case number of elements that will ever be stored in the queue. The queue is part of a system that operates on a fixed set of elements. The code will never attempt to store more elements in the queue as there are elements in this fixed set.
No multi-producer/multi-consumer. The queue will either be used in a multi-producer/single-consumer or a single-producer/multi-consumer setting.
Conceptually, the queue is implemented as follows
Standard power-of-two ring buffer. The underlying data-structure is a standard ring-buffer using the power-of-two trick. Read and write indices are only ever incremented. They are clamped to the size of the underlying array when indexing into the array using a simple bitmask. The read pointer is atomically incremented in pop(), the write pointer is atomically incremented in push().
Size variable gates access to pop(). An additional "size" variable tracks the number of elements in the queue. This eliminates the need to perform arithmetic on the read and write indices. The size variable is atomically incremented after the entire write operation has taken place, i.e. the data has been written to the backing storage and the write cursor has been incremented. I'm using a compare-and-swap (CAS) operation to atomically decrement size in pop() and only continue, if size is non-zero. This way pop() should be guaranteed to return valid data.
My queue implementation is as follows. Note the debug code that halts execution whenever pop() attempts to read past the memory that has previously been written by push(). This should never happen, since ‒ at least conceptually ‒ pop() may only proceed if there are elements on the queue (there should be no underflows).
#include <atomic>
#include <cstdint>
#include <csignal> // XXX for debugging
template <typename T>
class Queue {
private:
uint32_t m_data_size; // Number of elements allocated
std::atomic<T> *m_data; // Queue data, size is power of two
uint32_t m_mask; // Bitwise AND mask for m_rd_ptr and m_wr_ptr
std::atomic<uint32_t> m_rd_ptr; // Circular buffer read pointer
std::atomic<uint32_t> m_wr_ptr; // Circular buffer write pointer
std::atomic<uint32_t> m_size; // Number of elements in the queue
static uint32_t upper_power_of_two(uint32_t v) {
v--; // https://graphics.stanford.edu/~seander/bithacks.html
v |= v >> 1; v |= v >> 2; v |= v >> 4; v |= v >> 8; v |= v >> 16;
v++;
return v;
}
public:
struct Optional { // Minimal replacement for std::optional
bool good;
T value;
Optional() : good(false) {}
Optional(T value) : good(true), value(std::move(value)) {}
explicit operator bool() const { return good; }
};
Queue(uint32_t max_size)
: // XXX Allocate 1 MiB of additional memory for debugging purposes
m_data_size(upper_power_of_two(1024 * 1024 + max_size)),
m_data(new std::atomic<T>[m_data_size]),
m_mask(m_data_size - 1),
m_rd_ptr(0),
m_wr_ptr(0),
m_size(0) {
// XXX Debug code begin
// Fill the memory with a marker so we can detect invalid reads
for (uint32_t i = 0; i < m_data_size; i++) {
m_data[i] = 0xDEADBEAF;
}
// XXX Debug code end
}
~Queue() { delete[] m_data; }
Optional pop() {
// Atomically decrement the size variable
uint32_t size = m_size.load();
while (size != 0 && !m_size.compare_exchange_weak(size, size - 1)) {
}
// The queue is empty, abort
if (size <= 0) {
return Optional();
}
// Read the actual element, atomically increase the read pointer
T res = m_data[(m_rd_ptr++) & m_mask].load();
// XXX Debug code begin
if (res == T(0xDEADBEAF)) {
std::raise(SIGTRAP);
}
// XXX Debug code end
return res;
}
void push(T t) {
m_data[(m_wr_ptr++) & m_mask].store(t);
m_size++;
}
bool empty() const { return m_size == 0; }
};
However, underflows do occur and can easily be triggered in a multi-threaded stress-test. In this particular test I maintain two queues q1 and q2. In the main thread I feed a fixed number of elements into q1. Two worker threads read from q1 and push onto q2 in a tight loop. The main thread reads data from q2 and feeds it back to q1.
This works fine if there is only one worker-thread (single-producer/single-consumer) or as long as all worker-threads are on the same CPU as the main thread. However, it fails as soon as there are two worker threads that are explicitly scheduled onto a different CPU than the main thread.
The following code implements this test
#include <pthread.h>
#include <thread>
#include <vector>
static void queue_stress_test_main(std::atomic<uint32_t> &done_count,
Queue<int> &queue_rd, Queue<int> &queue_wr) {
for (size_t i = 0; i < (1UL << 24); i++) {
auto res = queue_rd.pop();
if (res) {
queue_wr.push(res.value);
}
}
done_count++;
}
static void set_thread_affinity(pthread_t thread, int cpu) {
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(cpu, &cpuset);
if (pthread_setaffinity_np(thread, sizeof(cpu_set_t),
&cpuset) != 0) {
throw "Error while calling pthread_setaffinity_np";
}
}
int main() {
static constexpr uint32_t n_threads{2U}; // Number of worker threads
//static constexpr uint32_t n_threads{1U}; // < Works fine
static constexpr uint32_t max_size{16U}; // Elements in the queue
std::atomic<uint32_t> done_count{0}; // Number of finished threads
Queue<int> queue1(max_size), queue2(max_size);
// Launch n_threads threads, make sure the main thread and the two worker
// threads are on different CPUs.
std::vector<std::thread> threads;
for (uint32_t i = 0; i < n_threads; i++) {
threads.emplace_back(queue_stress_test_main, std::ref(done_count),
std::ref(queue1), std::ref(queue2));
set_thread_affinity(threads.back().native_handle(), 0);
}
set_thread_affinity(pthread_self(), 1);
//set_thread_affinity(pthread_self(), 0); // < Works fine
// Pump data from queue2 into queue1
uint32_t elems_written = 0;
while (done_count < n_threads || !queue2.empty()) {
// Initially fill queue1 with all values from 0..max_size-1
if (elems_written < max_size) {
queue1.push(elems_written++);
}
// Read elements from queue2 and put them into queue1
auto res = queue2.pop();
if (res) {
queue1.push(res.value);
}
}
// Wait for all threads to finish
for (uint32_t i = 0; i < n_threads; i++) {
threads[i].join();
}
}
Most of the time this program triggers the trap in the queue code, which means that pop() attempts to read memory that has never been touched by push() ‒ although pop() should only succeed if push() has been called at least as often as pop().
You can compile and run the above program with GCC/clang on Linux using
c++ -std=c++11 queue.cpp -o queue -lpthread && ./queue
Either just concatenate the above two code blocks or download the complete program here.
Note that I'm a complete novice when it comes to lock-free datastructures. I'm perfectly aware that there are plenty of battle-tested lock-free queue implementations for C++. However, I simply can't figure out why the above code does not work as intended.

You have two bugs, one of which can cause the failure you observe.
Let's look at your push code, except we'll allow only one operation per statement:
void push(T t)
{
auto const claimed_index = m_wr_ptr++; /* 1 */
auto const claimed_offset = claimed_index & m_mask; /* 2 */
auto& claimed_data = m_data[claimed_offset]; /* 3 */
claimed_data.store(t); /* 4 */
m_size++; /* 5 */
}
Now, for a queue with two producers, there is a window of vulnerability to a race condition between operations 1 and 4:
Before:
m_rd_ptr == 1
m_wr_ptr == 1
m_size == 0
Producer A:
/* 1 */ claimed_index = 1; m_wr_ptr = 2;
/* 2 */ claimed_offset = 1;
Scheduler puts Producer A to sleep here
Producer B:
/* 1 */ claimed_index = 2; m_wr_ptr = 3;
/* 2 */ claimed_offset = 2;
/* 3 */ claimed_data = m_data[2];
/* 4 */ claimed_data.store(t);
/* 5 */ m_size = 1;
After:
m_size == 1
m_rd_ptr == 1
m_wr_ptr == 3
m_data[1] == 0xDEADBEAF
m_data[2] == value_produced_by_B
The consumer now runs, sees m_size > 0, and reads from m_data[1] while increasing m_rd_ptr from 1 to 2. But m_data[1] hasn't been written by Producer A yet, and Producer B wrote to m_data[2].
The second bug is the complementary case in pop() when a consumer thread is interrupted between the m_rd_ptr++ action and the .load() call. It can result in reading values out of order, potentially so far out of order that the queue has completely circled and overwritten the original value.
Just because two operations in a single source statement are atomic does not make the entire statement atomic.

Related

Why would adding a delay improve data throughput in this multithreaded environment?

In my application, I have two threads, a producer (thread 1) and a consumer (thread 2). Each thread has an input and output interface (effectively a pointer to a list) that is connected to a third thread which serves as a router.
When the producer writes, it calls memcpy to copy data into a buffer and pushes the buffer into a list. Meanwhile, the router thread is round-robin searching through all the threads that are connected to it and monitoring their interfaces to see if any thread has data to send out. When it sees that thread 1's list is non-empty, it checks to determine which thread the data is intended for. The data is spliced into the destination thread's (in this case thread 2) input list, at which point thread 2 will malloc some memory, memcpy the data into it and return the pointer to this new region.
For my test, I'm measuring throughput to see how long it takes to send 100k messages of varying sizes. Thread 1 sends data of some size, thread 2 reads it and sends back a small reply message, which thread 1 reads. This would be one complete exchange. In the first test, in thread 1, I'm sending all 100k messages, and then reading 100k replies. In the second test, in thread 1, I'm alternating sending a message and waiting for the reply and repeating 100k times. In both tests, thread 2 is in a loop reading the message and sending a reply. I would expect test 1 to have higher throughput because the threads should spend less time waiting around. However, it has markedly worse throughput than test 2. I've measured how long individual function calls (to read/write) take in the two test cases and they invariably take longer in test 1 (based on the means and medians and no delay) though the numbers are of the same order of magnitude.
When I add a loop doing nothing into thread 1's sending loop in test 1, I see dramatically improved throughput for this case as opposed to not having the delay. My only guess is that adding a delay slows down the producer so the consumer can absorb the data which prevents its input list from growing very large. I'm wondering if there may be other explanations and if so, how I can test for them.
Edit
Unfortunately, my own code is just the test I described above which calls a library that actually performs the reads/writes, creates that third thread etc. It's difficult to make a minimal example out of it because the library is complex and not mine. I provide some pseudocode to illustrate the setup in more detail.
int NUM_ITERATIONS = 100000;
int msg_reply = 2; // size of the reply message in words
int msg_size = 512; // indicates 512 64 bit words
void generate(int iterations, int size, interface* out){
std::vector<long long> vec(size);
for(int i = 0; i < size; i++)
vec[i] = (long long) i;
for(int i = 0; i < iterations; i++)
out->lib_write((char*) vec.data(), size);
}
void receive(int iterations, int size, interface* in){
for(int i = 0; i < iterations; i++)
char* data = in->lib_read(size)
void producer(interface* in, interface* out){
// test 1
start = std::chrono::high_resolution_clock::now();
// write data of size msg_size, NUM_ITERATIONS times to out
generate(NUM_ITERATIONS, msg_size, out);
// read data of size msg_reply, NUM_ITERATIONS times from in
receive(NUM_ITERATIONS, msg_reply, in);
end = std::chrono::high_resolution_clock::now();
// using NUM_ITERATIONS, msg_size and time, compute and print throughput to stdout
print_throughput(end-start, "throughput_0", msg_size);
// test 2
start = std::chrono::high_resolution_clock::now();
for(int j = 0; j < NUM_ITERATIONS; j++){
generate(1, msg_size, out);
receive(1, msg_reply, in);
}
end = std::chrono::high_resolution_clock::now();
print_throughput(end-start, "throughput_1", msg_size);
}
void consumer(interface* in, interface* out){
for(int i = 0; i < 2; i++}{
for(int j = 0; j < NUM_ITERATIONS; j++){
receive(1, msg_size, in);
generate(1, msg_reply, out);
}
}
}
The calls to lib_write() and lib_read() become fairly complex. To elaborate on the description above, the data gets memcpy'd into a buffer and then moved into a list. The interface has a condition variable member and the write calls its notify_one() method. The third thread is looping through all the interface pointers it has and checking to see if their lists are non-empty. If so, the data is spliced from one output list to the destination's input list using the splice() method in std::list. Meanwhile, the consumer calls the lib_read() which waits on the condition variable while the interface is empty, and then memcpy's the data into a new region and returns it.
// note: these will not compile as is. Undefined variables are class members
char * interface::lib_read(size_t * _size){
char * ret;
{
std::unique_lock<std::mutex> lock(mutex);
// packets is an std::list containing the incoming data
while (packets.empty()) {
cv.wait(lock);
}
curr_read_it = packets.begin();
}
size_t buff_size = curr_read_it->size;
ret = (char *)malloc(buff_size);
memcpy((char *)ret, (char *)curr_read_it->data, buff_size);
{
std::unique_lock<std::mutex> lock(mutex);
packets.erase(curr_read_it);
curr_read_it = packets.end();
}
return ret;
}
void interface::lib_write(char * data, int size){
// indicates the destination thread id
long long header = 1;
// buffer is a just an array that's max packet sized
memcpy((char *)buffer.data, &header, sizeof(long long));
memcpy((char *)buffer.data + sizeof(long long), (char *)data, size * sizeof(long long));
std::lock_guard<std::mutex> guard(mutex);
packets.push_back(std::move(buffer));
cv.notify_one();
}
// this is on thread 3
void route(){
do{
// this is a vector containing all the "out" interfaces
for(int i = 0; i < out_ptrs.size(); i++){
interface <long long> * _out = out_ptrs[i];
if(!_out->empty()){
// this just returns the header id (also locks the mutex)
long long dest= _out->get_dest();
// looks up the correct interface based on the id and splices
// a packet into from _out to the appropriate one. Locks mutex
in_ptrs[dest_map[dest]]->splice(_out);
}
}
}while(!done());
I was looking for general advice on what factors may influence multithreading performance and what to test for in order to better understand what was going on.
I talked to some other people and the advice I got that was helpful was to determine if the OS scheduling was the issue (which is what I suspected but was unsure how to test). Essentially, I used taskset and sched_affinity() to force the application to run on one core or on a subset of cores and looked at how they compared to each other and to the unrestricted case.
Based on the restrictions, I got dramatically different results and could see some trends so I'm pretty confident in saying that it's an OS scheduling issue. Different ones can yield better performance under different workloads.

Deadlock with blocking queue and barrier in C++

I have this very simple and small C++ program that creates a thread pool, then put messages in a blocking queue shared between threads to say to each thread what to do.
Message can be: -1 (end of stream -> terminate), -2 (barrier -> wait for all threads to reach it, then continue), other values to do random computation. The loop is done in this order: some computation, barrier, some computation, barrier, ..., barrier, end of stream, thread join, exit.
I'm not able to understand why I obtain deadlock even with 2 threads in the pool. The queue is not able to become empty, but the order in which I push and pop messages would always lead to an empty queue!
The blocking queue implementation is the one proposed here (C++ Equivalent to Java's BlockingQueue) with just two methods added. I copy also the queue code below.
Any help?
Main.cpp
#include <iostream>
#include <vector>
#include <thread>
#include "Queue.hpp"
using namespace std;
// function executed by each thread
void f(int i, Queue<int> &q){
while(1){
// take a message from blocking queue
int j= q.pop();
// if it is end of stream then exit
if (j==-1) break;
// if it is barrier, wait for other threads to reach it
if (j==-2){
// active wait! BAD, but anyway...
while(q.size() > 0){
;
}
}
else{
// random stuff
int x = 0;
for(int i=0;i<j;i++)
x += 4;
}
}
}
int main(){
Queue<int> queue; //blocking queue
vector<thread> tids; // thread pool
int nt = 2; // number of threads
int dim = 8; // number to control number of operations
// create thread pool, passing thread id and queue
for(int i=0;i<nt;i++)
tids.push_back(thread(f,i, std::ref(queue)));
for(int dist=1; dist<=dim; dist++){ // without this outer loop the program works fine
// push random number
for(int j=0;j<dist;j++){
queue.push(4);
}
// push barrier code
for(int i=0;i<nt;i++){
queue.push(-2);
}
// active wait! BAD, but anyway...
while (queue.size()>0){
;
}
}
// push end of stream
for(int i=0;i<nt;i++)
queue.push(-1);
// join thread pool
for(int i=0;i<nt;i++){
tids[i].join();
}
return 0;
}
Queue.hpp
#include <deque>
#include <mutex>
#include <condition_variable>
template <typename T>
class Queue
{
private:
std::mutex d_mutex;
std::condition_variable d_condition;
std::deque<T> d_queue;
public:
void push(T const& value) {
{
std::unique_lock<std::mutex> lock(this->d_mutex);
d_queue.push_front(value);
}
this->d_condition.notify_one();
}
T pop() {
std::unique_lock<std::mutex> lock(this->d_mutex);
this->d_condition.wait(lock, [=]{ return !this->d_queue.empty(); });
T rc(std::move(this->d_queue.back()));
this->d_queue.pop_back();
return rc;
}
bool empty(){
std::unique_lock<std::mutex> lock(this->d_mutex);
return this->d_queue.empty();
}
int size(){
std::unique_lock<std::mutex> lock(this->d_mutex);
return this->d_queue.size();
}
};
I think the problem is your active wait that you describe as "BAD, but anyway..." and using the size of the queue as a barrier instead of using a true synchronization barrier
For dim =1 you push a Queue that has 4, -2, -2. One thread will grab the 4 and -2 while the other grabs the remaining -2. At this point the queue is empty and you have three threads (the two workers and main thread) doing an active wait racing to see if the queue has been emptied. There is a mutex on size that only lets one read the size at a time. If the main thread is scheduled first and determines that queue is empty it will push on -1, -1 to signal end of stream. Now, the queue is no longer empty, but one or both of the two worker threads are waiting for it to empty. Since they are waiting for it to be empty before taking another item the queue is deadlocked in this state.
For the case were dim > 1 there is likely a similar issue with pushing the next set of values into the queue on the main thread before both workings acknowledge the empty the queue and exit the active wait.
I had run your code and I understand the problem. The problem is with "-2" option. When the two threads arrive to this point, your main thread already pushed another values to the queue. So, if your queue increased it's size between the time that your threads got "-2" value, and before they arrive to "-2" option, your code will stuck:
Thread 1: get -2.
Thread 2: get -2.
Thread main: push -1.
Thread main: push -1.
Thread 1: wait untill the whole queue will be empty.
Thread 2: wait untill the whole queue will be empty.
queue:
-1
-1
^ this in case that dim equals 1. In your code, dim equals 8, you don't want to see how it looks like..
To solve this, all I did was to disable the following loop:
for(int i=0;i<nt;i++){
queue.push(-2);
}
When this pard disable, the code run perfectly.
This is how I checked it:
std::mutex guarder;
// function executed by each thread
void f(int i, Queue<int> &q){
while(1){
// take a message from blocking queue
guarder.lock();
int j= q.pop();
guarder.unlock();
// if it is end of stream then exit
if (j==-1) break;
// if it is barrier, wait for other threads to reach it
if (j==-2){
// active wait! BAD, but anyway...
while(q.size() > 0){
;
}
}
else{
// random stuff
int x = 0;
for(int i=0;i<j;i++)
x += 4;
guarder.lock();
cout << x << std::endl;
guarder.unlock();
}
}
}
int main(){
Queue<int> queue; //blocking queue
vector<thread> tids; // thread pool
int nt = 2; // number of threads
int dim = 8; // number to control number of operations
// create thread pool, passing thread id and queue
for(int i=0;i<nt;i++)
tids.push_back(thread(f,i, std::ref(queue)));
for(int dist=1; dist<=dim; dist++){ // without this outer loop the program works fine
// push random number
for(int j=0;j<dist;j++){
queue.push(dist);
}
/*// push barrier code
for(int i=0;i<nt;i++){
queue.push(-2);
}*/
// active wait! BAD, but anyway...
while (queue.size()>0){
;
}
}
// push end of stream
for(int i=0;i<nt;i++)
queue.push(-1);
// join thread pool
for(int i=0;i<nt;i++){
tids[i].join();
}
return 0;
}
The result:
4
8
8
12
12
12
16
16
16
20
20
16
20
20
20
24
24
24
24
24
24
28
28
28
28
28
28
28
32
32
32
32
32
32
32
32
BTW, the stuck didn't occur because your "active wait" part. It is not good, but it cause other problems usually (like slowing down your system).

sem_wait() failed to wake up on linux

I have a real-time application that uses a shared FIFO. There are several writer processes and one reader process. Data is periodically written into the FIFO and constantly drained. Theoretically the FIFO should never overflow because the reading speed is faster than all writers combined. However, the FIFO does overflow.
I tried to reproduce the problem and finally worked out the following (simplified) code:
#include <stdint.h>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <cassert>
#include <pthread.h>
#include <semaphore.h>
#include <sys/time.h>
#include <unistd.h>
class Fifo
{
public:
Fifo() : _deq(0), _wptr(0), _rptr(0), _lock(0)
{
memset(_data, 0, sizeof(_data));
sem_init(&_data_avail, 1, 0);
}
~Fifo()
{
sem_destroy(&_data_avail);
}
void Enqueue()
{
struct timeval tv;
gettimeofday(&tv, NULL);
uint64_t enq = tv.tv_usec + tv.tv_sec * 1000000;
while (__sync_lock_test_and_set(&_lock, 1))
sched_yield();
uint8_t wptr = _wptr;
uint8_t next_wptr = (wptr + 1) % c_entries;
int retry = 0;
while (next_wptr == _rptr) // will become full
{
printf("retry=%u enq=%lu deq=%lu count=%d\n", retry, enq, _deq, Count());
for (uint8_t i = _rptr; i != _wptr; i = (i+1)%c_entries)
printf("%u: %lu\n", i, _data[i]);
assert(retry++ < 2);
usleep(500);
}
assert(__sync_bool_compare_and_swap(&_wptr, wptr, next_wptr));
_data[wptr] = enq;
__sync_lock_release(&_lock);
sem_post(&_data_avail);
}
int Dequeue()
{
struct timeval tv;
gettimeofday(&tv, NULL);
uint64_t deq = tv.tv_usec + tv.tv_sec * 1000000;
_deq = deq;
uint8_t rptr = _rptr, wptr = _wptr;
uint8_t next_rptr = (rptr + 1) % c_entries;
bool empty = Count() == 0;
assert(!sem_wait(&_data_avail));// bug in sem_wait?
_deq = 0;
uint64_t enq = _data[rptr]; // enqueue time
assert(__sync_bool_compare_and_swap(&_rptr, rptr, next_rptr));
int latency = deq - enq; // latency from enqueue to dequeue
if (empty && latency < -500)
{
printf("before dequeue: w=%u r=%u; after dequeue: w=%u r=%u; %d\n", wptr, rptr, _wptr, _rptr, latency);
}
return latency;
}
int Count()
{
int count = 0;
assert(!sem_getvalue(&_data_avail, &count));
return count;
}
static const unsigned c_entries = 16;
private:
sem_t _data_avail;
uint64_t _data[c_entries];
volatile uint64_t _deq; // non-0 indicates when dequeue happened
volatile uint8_t _wptr, _rptr; // write, read pointers
volatile uint8_t _lock; // write lock
};
static const unsigned c_total = 10000000;
static const unsigned c_writers = 3;
static Fifo s_fifo;
// writer thread
void* Writer(void* arg)
{
for (unsigned i = 0; i < c_total; i++)
{
int t = rand() % 200 + 200; // [200, 399]
usleep(t);
s_fifo.Enqueue();
}
return NULL;
}
int main()
{
pthread_t thread[c_writers];
for (unsigned i = 0; i < c_writers; i++)
pthread_create(&thread[i], NULL, Writer, NULL);
for (unsigned total = 0; total < c_total*c_writers; total++)
s_fifo.Dequeue();
}
When Enqueue() overflows, the debug print indicates that Dequeue() is stuck (because _deq is not 0). The only place where Dequeue() can get stuck is sem_wait(). However, since the fifo is full (also confirmed by sem_getvalue()), I don't understand how that could happen. Even after several retries (each waits 500us) the fifo was still full even though Dequeue() should definitely drain while Enqueue() is completely stopped (busy retrying).
In the code example, there are 3 writers, each writing every 200-400us. On my computer (8-core i7-2860 running centOS 6.5 kernel 2.6.32-279.22.1.el6.x86_64, g++ 4.47 20120313), the code would fail in a few minutes. I also tried on several other centOS systems and it also failed the same way.
I know that making the fifo bigger can reduce overflow probability (in fact, the program still fails with c_entries=128), but in my real-time application there is hard constraint on enqueue-dequeue latency, so data must be drained quickly. If it's not a bug in sem_wait(), then what prevents it from getting the semaphore?
P.S. If I replace
assert(!sem_wait(&_data_avail));// bug in sem_wait?
with
while (sem_trywait(&_data_avail) < 0) sched_yield();
then the program runs fine. So it seems that there's something wrong in sem_wait() and/or scheduler.
You need to use a combination of sem_wait/sem_post calls to be able to manage your read and write threads.
Your enqueue thread performs a sem_post only and your dequeue performs sem_wait only call. you need to add sem_wait to the enqueue thread and a sem_post on the dequeue thread.
A long time ago, I implemented the ability to have multiple threads/process be able to read some shared memory and only one thread/process write to the shared memory. I used two semaphore, a write semaphore and a read semaphore. The read threads would wait until the write semaphore was not set and then it would set the read semaphore. The write threads would set the write semaphore and then wait until the read semaphore is not set. The read and write threads would then unset the set semaphores when they've completed their tasks. The read semaphore can have n threads lock the read semaphore at a time while the write semaphore can be lock by a single thread at a time.
If it's not a bug in sem_wait(), then what prevents it from getting
the semaphore?
Your program's impatience prevents it. There is no guarantee that the Dequeue() thread is scheduled within a given number of retries. If you change
assert(retry++ < 2);
to
retry++;
you'll see that the program happily continues the reader process sometimes after 8 or perhaps even more retries.
Why does Enqueue have to retry?
It has to retry simply because the main thread's Dequeue() hasn't been scheduled by then.
Dequeue speed is much faster than all writers combined.
Your program shows that this assumption is sometimes false. While apparently the execution time of Dequeue() is much shorter than that of the writers (due to the usleep(t)), this does not imply that Dequeue() is scheduled by the Completely Fair Scheduler more often - and for this the main reason is that you used a nondeterministic scheduling policy. man sched_yield:
sched_yield() is intended for use with read-time scheduling policies
(i.e., SCHED_FIFO or SCHED_RR). Use of sched_yield() with
nondeterministic scheduling policies such as SCHED_OTHER is
unspecified and very likely means your application design is broken.
If you insert
struct sched_param param = { .sched_priority = 1 };
if (sched_setscheduler(0, SCHED_FIFO, &param) < 0)
perror("sched_setscheduler");
at the start of main(), you'll likely see that your program performs as expected (when run with the appropriate priviledge).

C++ Low-Latency Threaded Asynchronous Buffered Stream (intended for logging) – Boost

Question:
3 while loops below contain code that has been commented out. I search for ("TAG1", "TAG2", and "TAG3") for easy identification. I simply want the while loops to wait on the condition tested to become true before proceeding while minimizing CPU resources as much as possible. I first tried using Boost condition variables, but there's a race condition. Putting the thread to sleep for 'x' microseconds is inefficient because there is no way to precisely time the wakeup. Finally, boost::this_thread::yield() does not seem to do anything. Probably because I only have 2 active threads on a dual-core system. Specifically, how can I make the three tagged areas below run more efficiently while introducing as little unnecessary blocking as possible.
BACKGROUND
Objective:
I have an application that logs a lot of data. After profiling, I found that much time is consumed on the logging operations (logging text or binary to a file on the local hard disk). My objective is to reduce the latency on logData calls by replacing non-threaded direct write calls with calls to a threaded buffered stream logger.
Options Explored:
Upgrade 2005-era slow hard disk to SSD...possible. Cost is not prohibitive...but involves a lot of work... more than 200 computers would have to be upgraded...
Boost ASIO...I don't need all the proactor / networking overhead, looking for something simpler and more light-weight.
Design:
Producer and consumer thread pattern, the application writes data into a buffer and a background thread then writes it to disk sometime later. So the ultimate goal is to have the writeMessage function called by the application layer return as fast as possible while data is correctly / completely logged to the log file in a FIFO order sometime later.
Only one application thread, only one writer thread.
Based on ring buffer. The reason for this decision is to use as few locks as possible and ideally...and please correct me if I'm wrong...I don't think I need any.
Buffer is a statically-allocated character array, but could move it to the heap if needed / desired for performance reasons.
Buffer has a start pointer that points to the next character that should be written to the file. Buffer has an end pointer that points to the array index after the last character to be written to the file. The end pointer NEVER passes the start pointer. If a message comes in that is larger than the buffer, then the writer waits until the buffer is emptied and writes the new message to the file directly without putting the over-sized message in the buffer (once the buffer is emptied, the worker thread won't be writing anything so no contention).
The writer (worker thread) only updates the ring buffer's start pointer.
The main (application thread) only updates the ring buffer's end pointer, and again, it only inserts new data into the buffer when there is available space...otherwise it either waits for space in the buffer to become available or writes directly as described above.
The worker thread continuously checks to see if there is data to be written (indicated by the case when the buffer start pointer != buffer end pointer). If there is no data to be written, the worker thread should ideally go to sleep and wake up once the application thread has inserted something into the buffer (and changed the buffer's end pointer such that it no longer points to the same index as the start pointer). What I have below involves while loops continuously checking that condition. It is a very bad / inefficient way of waiting on the buffer.
Results:
On my 2009-era dual-core laptop with SSD, I see that the total write time of the threaded / buffered benchmark vs. direct write is about 1 : 6 (0.609 sec vs. 0.095 sec), but highly variable. Often the buffered write benchmark is actually slower than direct write. I believe that the variability is due to the poor implementation of waiting for space to free up in the buffer, waiting for the buffer to empty, and having the worker-thread wait for work to become available. I have measured that some of the while loops consume over 10000 cycles and I suspect that those cycles are actually competing for hardware resources that the other thread (worker or application) requires to finish the computation being waited on.
Output seems to check out. With TEST mode enabled and a small buffer size of 10 as a stress test, I diffed hundreds of MBs of output and found it to equal the input.
Compiles with current version of Boost (1.55)
Header
#ifndef BufferedLogStream_h
#define BufferedLogStream_h
#include <stdio.h>
#include <iostream>
#include <iostream>
#include <cstdlib>
#include "boost\chrono\chrono.hpp"
#include "boost\thread\thread.hpp"
#include "boost\thread\locks.hpp"
#include "boost\thread\mutex.hpp"
#include "boost\thread\condition_variable.hpp"
#include <time.h>
using namespace std;
#define BENCHMARK_STR_SIZE 128
#define NUM_BENCHMARK_WRITES 524288
#define TEST 0
#define BENCHMARK 1
#define WORKER_LOOP_WAIT_MICROSEC 20
#define MAIN_LOOP_WAIT_MICROSEC 10
#if(TEST)
#define BUFFER_SIZE 10
#else
#define BUFFER_SIZE 33554432 //4 MB
#endif
class BufferedLogStream {
public:
BufferedLogStream();
void openFile(char* filename);
void flush();
void close();
inline void writeMessage(const char* message, unsigned int length);
void writeMessage(string message);
bool operator() () { return start != end; }
private:
void threadedWriter();
inline bool hasSomethingToWrite();
inline unsigned int getFreeSpaceInBuffer();
void appendStringToBuffer(const char* message, unsigned int length);
FILE* fp;
char* start;
char* end;
char* endofringbuffer;
char ringbuffer[BUFFER_SIZE];
bool workerthreadkeepalive;
boost::mutex mtx;
boost::condition_variable waitforempty;
boost::mutex workmtx;
boost::condition_variable waitforwork;
#if(TEST)
struct testbuffer {
int length;
char message[BUFFER_SIZE * 2];
};
public:
void test();
private:
void getNextRandomTest(testbuffer &tb);
FILE* datatowrite;
#endif
#if(BENCHMARK)
public:
void runBenchmark();
private:
void initBenchmarkString();
void runDirectWriteBaseline();
void runBufferedWriteBenchmark();
char benchmarkstr[BENCHMARK_STR_SIZE];
#endif
};
#if(TEST)
int main() {
BufferedLogStream* bl = new BufferedLogStream();
bl->openFile("replicated.txt");
bl->test();
bl->close();
cout << "Done" << endl;
cin.get();
return 0;
}
#endif
#if(BENCHMARK)
int main() {
BufferedLogStream* bl = new BufferedLogStream();
bl->runBenchmark();
cout << "Done" << endl;
cin.get();
return 0;
}
#endif //for benchmark
#endif
Implementation
#include "BufferedLogStream.h"
BufferedLogStream::BufferedLogStream() {
fp = NULL;
start = ringbuffer;
end = ringbuffer;
endofringbuffer = ringbuffer + BUFFER_SIZE;
workerthreadkeepalive = true;
}
void BufferedLogStream::openFile(char* filename) {
if(fp) close();
workerthreadkeepalive = true;
boost::thread t2(&BufferedLogStream::threadedWriter, this);
fp = fopen(filename, "w+b");
}
void BufferedLogStream::flush() {
fflush(fp);
}
void BufferedLogStream::close() {
workerthreadkeepalive = false;
if(!fp) return;
while(hasSomethingToWrite()) {
boost::unique_lock<boost::mutex> u(mtx);
waitforempty.wait_for(u, boost::chrono::microseconds(MAIN_LOOP_WAIT_MICROSEC));
}
flush();
fclose(fp);
fp = NULL;
}
void BufferedLogStream::threadedWriter() {
while(true) {
if(start != end) {
char* currentend = end;
if(start < currentend) {
fwrite(start, 1, currentend - start, fp);
}
else if(start > currentend) {
if(start != endofringbuffer) fwrite(start, 1, endofringbuffer - start, fp);
fwrite(ringbuffer, 1, currentend - ringbuffer, fp);
}
start = currentend;
waitforempty.notify_one();
}
else { //start == end...no work to do
if(!workerthreadkeepalive) return;
boost::unique_lock<boost::mutex> u(workmtx);
waitforwork.wait_for(u, boost::chrono::microseconds(WORKER_LOOP_WAIT_MICROSEC));
}
}
}
bool BufferedLogStream::hasSomethingToWrite() {
return start != end;
}
void BufferedLogStream::writeMessage(string message) {
writeMessage(message.c_str(), message.length());
}
unsigned int BufferedLogStream::getFreeSpaceInBuffer() {
if(end > start) return (start - ringbuffer) + (endofringbuffer - end) - 1;
if(end == start) return BUFFER_SIZE-1;
return start - end - 1; //case where start > end
}
void BufferedLogStream::appendStringToBuffer(const char* message, unsigned int length) {
if(end + length <= endofringbuffer) { //most common case for appropriately-sized buffer
memcpy(end, message, length);
end += length;
}
else {
int lengthtoendofbuffer = endofringbuffer - end;
if(lengthtoendofbuffer > 0) memcpy(end, message, lengthtoendofbuffer);
int remainderlength = length - lengthtoendofbuffer;
memcpy(ringbuffer, message + lengthtoendofbuffer, remainderlength);
end = ringbuffer + remainderlength;
}
}
void BufferedLogStream::writeMessage(const char* message, unsigned int length) {
if(length > BUFFER_SIZE - 1) { //if string is too large for buffer, wait for buffer to empty and bypass buffer, write directly to file
while(hasSomethingToWrite()); {
boost::unique_lock<boost::mutex> u(mtx);
waitforempty.wait_for(u, boost::chrono::microseconds(MAIN_LOOP_WAIT_MICROSEC));
}
fwrite(message, 1, length, fp);
}
else {
//wait until there is enough free space to insert new string
while(getFreeSpaceInBuffer() < length) {
boost::unique_lock<boost::mutex> u(mtx);
waitforempty.wait_for(u, boost::chrono::microseconds(MAIN_LOOP_WAIT_MICROSEC));
}
appendStringToBuffer(message, length);
}
waitforwork.notify_one();
}
#if(TEST)
void BufferedLogStream::getNextRandomTest(testbuffer &tb) {
tb.length = 1 + (rand() % (int)(BUFFER_SIZE * 1.05));
for(int i = 0; i < tb.length; i++) {
tb.message[i] = rand() % 26 + 65;
}
tb.message[tb.length] = '\n';
tb.length++;
tb.message[tb.length] = '\0';
}
void BufferedLogStream::test() {
cout << "Buffer size is: " << BUFFER_SIZE << endl;
testbuffer tb;
datatowrite = fopen("orig.txt", "w+b");
for(unsigned int i = 0; i < 7000000; i++) {
if(i % 1000000 == 0) cout << i << endl;
getNextRandomTest(tb);
writeMessage(tb.message, tb.length);
fwrite(tb.message, 1, tb.length, datatowrite);
}
fflush(datatowrite);
fclose(datatowrite);
}
#endif
#if(BENCHMARK)
void BufferedLogStream::initBenchmarkString() {
for(unsigned int i = 0; i < BENCHMARK_STR_SIZE - 1; i++) {
benchmarkstr[i] = rand() % 26 + 65;
}
benchmarkstr[BENCHMARK_STR_SIZE - 1] = '\n';
}
void BufferedLogStream::runDirectWriteBaseline() {
clock_t starttime = clock();
fp = fopen("BenchMarkBaseline.txt", "w+b");
for(unsigned int i = 0; i < NUM_BENCHMARK_WRITES; i++) {
fwrite(benchmarkstr, 1, BENCHMARK_STR_SIZE, fp);
}
fflush(fp);
fclose(fp);
clock_t elapsedtime = clock() - starttime;
cout << "Direct write baseline took " << ((double) elapsedtime) / CLOCKS_PER_SEC << " seconds." << endl;
}
void BufferedLogStream::runBufferedWriteBenchmark() {
clock_t starttime = clock();
openFile("BufferedBenchmark.txt");
cout << "Opend file" << endl;
for(unsigned int i = 0; i < NUM_BENCHMARK_WRITES; i++) {
writeMessage(benchmarkstr, BENCHMARK_STR_SIZE);
}
cout << "Wrote" << endl;
close();
cout << "Close" << endl;
clock_t elapsedtime = clock() - starttime;
cout << "Buffered write took " << ((double) elapsedtime) / CLOCKS_PER_SEC << " seconds." << endl;
}
void BufferedLogStream::runBenchmark() {
cout << "Buffer size is: " << BUFFER_SIZE << endl;
initBenchmarkString();
runDirectWriteBaseline();
runBufferedWriteBenchmark();
}
#endif
Update: November 25, 2013
I updated the code below use boost::condition_variables, specifically the wait_for() method as recommended by Evgeny Panasyuk. This avoids unnecessarily checking the same condition over and over again. I am currently seeing the buffered version run in about 1/6th the time as the unbuffered / direct-write version. This is not the ideal case because both cases are limited by the hard disk (in my case a 2010 era SSD). I plan to use the code below in an environment where the hard disk will not be the bottleneck and most if not all the time, the buffer should have space available to accommodate the writeMessage requests. That brings me to my next question. How big should I make the buffer? I don't mind allocating 32 MBs or 64 MB to ensure that it never fills up. The code will be running on systems that can spare that. Intuitively, I feel that it's a bad idea to statically allocate a 32 MB character array. Is it? Anyhow, I expect that when I run the code below for my intended application, the latency of logData() calls will be greatly reduced which will yield a significant reduction in overall processing time.
If anyone sees any way to make the code below better (faster, more robust, leaner, etc), please let me know. I appreciate the feedback. Lazin, how would your approach be faster or more efficient than what I have posted below? I kinda like the idea of just having one buffer and making it large enough so that it practically never fills up. Then I don't have to worry about reading from different buffers. Evgeny Panasyuk, I like the approach of using existing code whenever possible, especially if it's an existing boost library. However, I also don't see how the spcs_queue is more efficient than what I have below. I'd rather deal with one large buffer than many smaller ones and have to worry about splitting splitting my input stream on the input and splicing it back together on the output. Your approach would allow me to offload the formatting from the main thread onto the worker thread. That is a cleaver approach. But I'm not sure yet whether it will save a lot of time and to realize the full benefit, I would have to modify code that I do not own.
//End Update
General solution.
I think you must look at the Naggle algorithm. For one producer and one consumer this would look like this:
At the beginning buffer is empty, worker thread is idle and waiting for the events.
Producer writes data to the buffer and notifies worker thread.
Worker thread woke up and start the write operation.
Producer tries to write another message, but buffer is used by worker, so producer allocates another buffer and writes message to it.
Producer tries to write another message, I/O still in progress so producer writes message to previously allocated buffer.
Worker thread done writing buffer to file and sees that there is another buffer with data so it grabs it and starts to write.
The very first buffer is used by producer to write all consecutive messages, until second write operation in progress.
This schema will help achieve low latency requirement, single message will be written to disc instantaneously, but large amount of events will be written by large batches for greather throughput.
If your log messages have levels - you can improve this schema a little bit. All error messages have high priority(level) and must be saved on disc immediately (because they are rare but very valuable) but debug and trace messages have low priority and can be buffered to save bandwidth (because they are very frequent but not as valuable as error and info messages). So when you write error message, you must wait until worker thread is done writing your message (and all messages that are in the same buffer) and then continue, but debug and trace messages can be just written to buffer.
Threading.
Spawning worker thread for each application thread is to costly. You must use single writer thread for each log file. Write buffers must be shared between threads. Each buffer must have two pointers - commit_pointer and prepare_pointer. All buffer space between beginning of the buffer and commit_pointer are available for worker thread. Buffer space between commit_pointer and prepare_pointer are currently updated by application threads. Invariant: commit_pointer <= prepare_pointer.
Write operations can be performed in two steps.
Prepare write. This operation reserves space in a buffer.
Producer calculates len(message) and atomically updates prepare_pointer;
Old prepare_pointer value and len is saved by consumer;
Commit write.
Producer writes message at the beginning of the reserved buffer space (old prepare_pointer value).
Producer busy-waits until commit_pointer is equal to old prepare_pointer value that its save in local variable.
Producer commit write operation by doing commit_pointer = commit_pointer + len atomically.
To prevent false sharing, len(message) can be rounded to cache line size and all extra space can be filled with spaces.
// pseudocode
void write(const char* message) {
int len = strlen(message); // TODO: round to cache line size
const char* old_prepare_ptr;
// Prepare step
while(1)
{
old_prepare_ptr = prepare_ptr;
if (
CAS(&prepare_ptr,
old_prepare_ptr,
prepare_ptr + len) == old_prepare_ptr
)
break;
// retry if another thread perform prepare op.
}
// Write message
memcpy((void*)old_prepare_ptr, (void*)message, len);
// Commit step
while(1)
{
const char* old_commit_ptr = commit_ptr;
if (
CAS(&commit_ptr,
old_commit_ptr,
old_commit_ptr + len) == old_commit_ptr
)
break;
// retry if another thread commits
}
notify_worker_thread();
}
concurrent_queue<T, Size>
The question that I have is how to make the worker thread go to work as soon as there is work to do and sleep when there is no work.
There is boost::lockfree::spsc_queue - wait-free single-producer single-consumer queue. It can be configured to have compile-time capacity (the size of the internal ringbuffer).
From what I understand, you want something similar to following configuration:
template<typename T, size_t N>
class concurrent_queue
{
// T can be wrapped into struct with padding in order to avoid false sharing
mutable boost::lockfree::spsc_queue<T, boost::lockfree::capacity<N>> q;
mutable mutex m;
mutable condition_variable c;
void wait() const
{
unique_lock<mutex> u(m);
c.wait_for(u, chrono::microseconds(1)); // Or whatever period you need.
// Timeout is required, because modification happens not under mutex
// and notification can be lost.
// Another option is just to use sleep/yield, without notifications.
}
void notify() const
{
c.notify_one();
}
public:
void push(const T &t)
{
while(!q.push(t))
wait();
notify();
}
void pop(T &result)
{
while(!q.pop(result))
wait();
notify();
}
};
When there are elements in queue - pop does not block. And when there is enough space in internal buffer - push does not block.
concurrent<T>
I want to reduce both formatting and write times as much as possible so I plan to reduce both.
Check out Herb Sutter talk at C++ and Beyond 2012: C++ Concurrency. At page 14 he shows example of concurrent<T>. Basically it is wrapper around object of type T which starts separate thread for performing all operations on that object. Usage is:
concurrent<ostream*> x(&cout); // starts thread internally
// ...
// x acts as function object.
// It's function call operator accepts action
// which is performed on wrapped object in separate thread.
int i = 42;
x([i](ostream *out){ *out << "i=" << i; }); // passing lambda as action
You can use similar pattern in order to offload all formatting work to consumer thread.
Small Object Optimization
Otherwise, new buffers are allocated and I want to avoid memory allocation after the buffer stream is constructed.
Above concurrent_queue<T, Size> example uses fixed-size buffer which is fully contained within queue, and does not imply additional allocations.
However, Herb's concurrent<T> example uses std::function to pass action into worker thread. That may incur costly allocation.
std::function implementations may use Small Object Optimization (and most implementations do) - small function objects are in-place copy-constructed in internal buffer, but there is no guarantee, and for function objects bigger than threshold - heap allocation would happen.
There are several options to avoid this allocation:
Implement std::function analog with internal buffer large enough to hold target function objects (for example, you can try to modify boost::function or this version).
Use your own function object which would represent all type of log messages. Basically it would contain just values required to format message. As potentially there are different types of messages, consider to use boost::variant (which is literary union coupled with type tag) to represent them.
Putting it all together, here is proof-of-concept (using second option):
LIVE DEMO
#include <boost/lockfree/spsc_queue.hpp>
#include <boost/optional.hpp>
#include <boost/variant.hpp>
#include <condition_variable>
#include <iostream>
#include <cstddef>
#include <thread>
#include <chrono>
#include <mutex>
using namespace std;
/*********************************************/
template<typename T, size_t N>
class concurrent_queue
{
mutable boost::lockfree::spsc_queue<T, boost::lockfree::capacity<N>> q;
mutable mutex m;
mutable condition_variable c;
void wait() const
{
unique_lock<mutex> u(m);
c.wait_for(u, chrono::microseconds(1));
}
void notify() const
{
c.notify_one();
}
public:
void push(const T &t)
{
while(!q.push(t))
wait();
notify();
}
void pop(T &result)
{
while(!q.pop(result))
wait();
notify();
}
};
/*********************************************/
template<typename T, typename F>
class concurrent
{
typedef boost::optional<F> Job;
mutable concurrent_queue<Job, 16> q; // use custom size
mutable T x;
thread worker;
public:
concurrent(T x)
: x{x}, worker{[this]
{
Job j;
while(true)
{
q.pop(j);
if(!j) break;
(*j)(this->x); // you may need to handle exceptions in some way
}
}}
{}
void operator()(const F &f)
{
q.push(Job{f});
}
~concurrent()
{
q.push(Job{});
worker.join();
}
};
/*********************************************/
struct LogEntry
{
struct Formatter
{
typedef void result_type;
ostream *out;
void operator()(double x) const
{
*out << "floating point: " << x << endl;
}
void operator()(int x) const
{
*out << "integer: " << x << endl;
}
};
boost::variant<int, double> data;
void operator()(ostream *out)
{
boost::apply_visitor(Formatter{out}, data);
}
};
/*********************************************/
int main()
{
concurrent<ostream*, LogEntry> log{&cout};
for(int i=0; i!=1024; ++i)
{
log({i});
log({i/10.});
}
}

prevent std::atomic from overflowing

I have an atomic counter (std::atomic<uint32_t> count) which deals out sequentially incrementing values to multiple threads.
uint32_t my_val = ++count;
Before I get my_val I want to ensure that the increment won't overflow (ie: go back to 0)
if (count == std::numeric_limits<uint32_t>::max())
throw std::runtime_error("count overflow");
I'm thinking this is a naive check because if the check is performed by two threads before either increments the counter, the second thread to increment will get 0 back
if (count == std::numeric_limits<uint32_t>::max()) // if 2 threads execute this
throw std::runtime_error("count overflow");
uint32_t my_val = ++count; // before either gets here - possible overflow
As such I guess I need to use a CAS operation to make sure that when I increment my counter, I am indeed preventing a possible overflow.
So my questions are:
Is my implementation correct?
Is it as efficient as it can be (specifically do I need to check against max twice)?
My code (with working exemplar) follows:
#include <iostream>
#include <atomic>
#include <limits>
#include <stdexcept>
#include <thread>
std::atomic<uint16_t> count;
uint16_t get_val() // called by multiple threads
{
uint16_t my_val;
do
{
my_val = count;
// make sure I get the next value
if (count.compare_exchange_strong(my_val, my_val + 1))
{
// if I got the next value, make sure we don't overflow
if (my_val == std::numeric_limits<uint16_t>::max())
{
count = std::numeric_limits<uint16_t>::max() - 1;
throw std::runtime_error("count overflow");
}
break;
}
// if I didn't then check if there are still numbers available
if (my_val == std::numeric_limits<uint16_t>::max())
{
count = std::numeric_limits<uint16_t>::max() - 1;
throw std::runtime_error("count overflow");
}
// there are still numbers available, so try again
}
while (1);
return my_val + 1;
}
void run()
try
{
while (1)
{
if (get_val() == 0)
exit(1);
}
}
catch(const std::runtime_error& e)
{
// overflow
}
int main()
{
while (1)
{
count = 1;
std::thread a(run);
std::thread b(run);
std::thread c(run);
std::thread d(run);
a.join();
b.join();
c.join();
d.join();
std::cout << ".";
}
return 0;
}
Yes, you need to use CAS operation.
std::atomic<uint16_t> g_count;
uint16_t get_next() {
uint16_t new_val = 0;
do {
uint16_t cur_val = g_count; // 1
if (cur_val == std::numeric_limits<uint16_t>::max()) { // 2
throw std::runtime_error("count overflow");
}
new_val = cur_val + 1; // 3
} while(!std::atomic_compare_exchange_weak(&g_count, &cur_val, new_val)); // 4
return new_val;
}
The idea is the following: once g_count == std::numeric_limits<uint16_t>::max(), get_next() function will always throw an exception.
Steps:
Get current value of the counter
If it is maximal, throw an exception (no numbers available anymore)
Get new value as increment of the current value
Try to atomically set new value. If we failed to set it (it was done by another thread already), try again.
If efficiency is a big concern then I'd suggest not being so strict on the check. I'm guessing that under normal use overflow won't be an issue, but do you really need the full 65K range (your example uses uint16)?
It would be easier if you assume some maximum on the number of threads you have running. This is a reasonable limit since no program has unlimited numbers of concurrency. So if you have N threads you can simply reduce your overflow limit to 65K - N. To compare if you overflow you don't need a CAS:
uint16_t current = count.load(std::memory_order_relaxed);
if( current >= (std::numeric_limits<uint16_t>::max() - num_threads - 1) )
throw std::runtime_error("count overflow");
count.fetch_add(1,std::memory_order_relaxed);
This creates a soft-overflow condition. If two threads come here at once both of them will potentially pass, but that's okay since the count variable itself never overflows. Any future arrivals at this point will logically overflow (until count is reduced again).
It seems to me that there's still a race condition where count will be set to 0 momentarily such that another thread will see the 0 value.
Assume that count is at std::numeric_limits<uint16_t>::max() and two threads try to get the incremented value. At the moment that Thread 1 performs the count.compare_exchange_strong(my_val, my_val + 1), count is set to 0 and that's what Thread 2 will see if it happens to call and complete get_val() before Thread 1 has a chance to restore count to max().