consuming linkedlist queue in multithreads - c++

I am learning OpenMP parallel processing library in C++. I felt that I got the basic concepts, and try to test my knowledge by implementing a linked list queue. I wanted to consume the queue from multiple threads.
The challenge here is that not to consume the same node twice. So I was considering sharing the queue between threads but allowing only a single thread to update(go to the next node in the queue) it at a time. For this purpose, I could use critical or lock. However, without using them; somehow, it seems to be working perfectly. No race-condition has occurred.
#include <iostream>
#include <omp.h>
#include <zconf.h>
struct Node {
int data;
struct Node* next = NULL;
Node() {}
Node(int data) {
this->data = data;
}
Node(int data, Node* node) {
this->data = data;
this->next = node;
}
};
void processNode(Node *pNode);
struct Queue {
Node *head = NULL, *tail = NULL;
Queue& add(int data) {
add(new Node(data));
return *this;
}
void add(Node *node) {
if (head == NULL) {
head = node;
tail = node;
} else {
tail->next = node;
tail = node;
}
}
Node* remove() {
Node *node;
node = head;
if (head != NULL)
head = head->next;
return node;
}
};
int main() {
srand(12);
Queue queue;
for (int i = 0; i < 6; ++i) {
queue.add(i);
}
double timer_started = omp_get_wtime();
omp_set_num_threads(3);
#pragma omp parallel
{
Node *n;
while ((n = queue.remove()) != NULL) {
double started = omp_get_wtime();
processNode(n);
double elapsed = omp_get_wtime() - started;
printf("Thread id: %d data: %d, took: %f \n", omp_get_thread_num(), n->data, elapsed);
}
}
double elapsed = omp_get_wtime() - timer_started;
std::cout << "end. took " << elapsed << " in total " << std::endl;
return 0;
}
void processNode(Node *node) {
int r = rand() % 3 + 1; // between 1 and 3
sleep(r);
}
Output looks like this:
Thread id: 0 data: 0, took: 1.000136
Thread id: 2 data: 2, took: 1.000127
Thread id: 2 data: 4, took: 1.000208
Thread id: 1 data: 1, took: 3.001371
Thread id: 0 data: 3, took: 2.001041
Thread id: 2 data: 5, took: 2.004960
end. took 4.00583 in total
I've run this with a different number of threads and many times. But, I couldn't get any race condition or something wrong. I was thinking it was possible for two different threads to invoke 'remove' and process a single node twice. But it did not happen. Why?
https://github.com/muatik/openmp-examples/blob/master/linkedlist/main.cpp

First and foremost, you can never prove multi-threaded code to be correct through testing. Your hunch, that you need a lock / critical section is correct.
Your test is particularly easy on the queue. The following breaks your queue quickly:
for (int i = 0; i < 10000; ++i) {
queue.add(i);
}
double timer_started = omp_get_wtime();
#pragma omp parallel
{
size_t counter = 0;
Node *n;
while ((n = queue.remove()) != NULL) {
processNode(n);
counter++;
}
#pragma omp critical
std::cout << "Thread " << omp_get_thread_num() << " processed " << counter << " nodes." << std::endl;
}
void processNode(Node *node) {}
Show for example the following interesting result:
Thread 1 processed 11133 nodes.
Thread 0 processed 9039 nodes.
But again, if you made a queue that runs a million times correctly with this test code, doesn't mean the queue is implemented correctly.
In particular, it is not sufficient to just protect remove, you must properly protect each and every read and write to the queue data. To get an idea of the difficulty to get this right, watch this excellent talk by Herb Sutter.
Generally, I recommend to use an existing parallel data structure, for example from Boost.Lockfree.
However, unfortunately OpenMP and C++11 lock / atomic primitives don't officially play well together. So strictly speaking, if you use OpenMP, you should stick to OpenMP synchronization primitives or libraries that use them.

Related

TSan race disappears with alignas(32)

I have an implementation of a lockfree queue, which I believe to be correct (or at least data race-free):
#include <atomic>
#include <iostream>
#include <optional>
#include <thread>
struct Job {
int id;
int data;
};
class JobQueue {
using stdmo = std::memory_order;
struct Node {
std::atomic<Node *> next = QUEUE_END;
Job job;
};
static inline Node *const QUEUE_END = nullptr;
static inline Node *const STACK_END = QUEUE_END + 1;
struct GenNodePtr {
Node *node;
std::uintptr_t gen;
};
alignas(64) std::atomic<Node *> jobs_back;
alignas(64) std::atomic<GenNodePtr> jobs_front;
alignas(64) std::atomic<GenNodePtr> stack_top;
public:
JobQueue()
: jobs_back{new Node{}},
jobs_front{GenNodePtr{jobs_back.load(stdmo::relaxed), 1}},
stack_top{GenNodePtr{STACK_END, 1}} {}
~JobQueue() {
Node *cur_queue = jobs_front.load(stdmo::relaxed).node;
while (cur_queue != QUEUE_END) {
Node *next = cur_queue->next;
delete cur_queue;
cur_queue = next;
}
Node *cur_stack = stack_top.load(stdmo::relaxed).node;
while (cur_stack != STACK_END) {
Node *next = cur_stack->next;
delete cur_stack;
cur_stack = next;
}
}
Node *allocate_node() {
GenNodePtr cur_stack = stack_top.load(stdmo::acquire);
while (true) {
if (cur_stack.node == STACK_END) {
return new Node{};
}
Node *cur_stack_next = cur_stack.node->next.load(stdmo::relaxed);
GenNodePtr new_stack{cur_stack_next, cur_stack.gen + 1};
if (stack_top.compare_exchange_weak(cur_stack, new_stack,
stdmo::acq_rel)) {
return cur_stack.node;
}
}
}
void deallocate_node(Node *node) {
GenNodePtr cur_stack = stack_top.load(stdmo::acquire);
while (true) {
node->next.store(cur_stack.node, stdmo::relaxed);
GenNodePtr new_stack{node, cur_stack.gen + 1};
if (stack_top.compare_exchange_weak(cur_stack, new_stack,
stdmo::acq_rel)) {
break;
}
}
}
public:
void enqueue(Job job) {
Node *new_node = allocate_node();
new_node->next.store(QUEUE_END, stdmo::relaxed);
Node *old_dummy = jobs_back.exchange(new_node, stdmo::acq_rel);
old_dummy->job = job;
old_dummy->next.store(new_node, stdmo::release);
}
std::optional<Job> try_dequeue() {
GenNodePtr old_front = jobs_front.load(stdmo::relaxed);
while (true) {
Node *old_front_next = old_front.node->next.load(stdmo::acquire);
if (old_front_next == QUEUE_END) {
return std::nullopt;
}
GenNodePtr new_front{old_front_next, old_front.gen + 1};
if (jobs_front.compare_exchange_weak(old_front, new_front,
stdmo::relaxed)) {
break;
}
}
Job job = old_front.node->job;
deallocate_node(old_front.node);
return job;
}
};
int main() {
JobQueue queue;
std::atomic<int> i = 0;
std::thread consumer{[&queue, &i]() {
// producer enqueues 1
while (i.load(std::memory_order_relaxed) != 1) {}
std::atomic_thread_fence(std::memory_order_acq_rel);
std::cout << queue.try_dequeue().value_or(Job{-1, -1}).data
<< std::endl;
std::atomic_thread_fence(std::memory_order_acq_rel);
i.store(2, std::memory_order_relaxed);
// producer enqueues 2 and 3
}};
std::thread producer{[&queue, &i]() {
queue.enqueue(Job{1, 1});
std::atomic_thread_fence(std::memory_order_acq_rel);
i.store(1, std::memory_order_relaxed);
// consumer consumes here
while (i.load(std::memory_order_relaxed) != 2) {}
std::atomic_thread_fence(std::memory_order_acq_rel);
queue.enqueue(Job{2, 2});
queue.enqueue(Job{3, 3});
}};
producer.join();
consumer.join();
return 0;
}
This queue is implemented as a singly-linked double ended linked list. It uses a dummy node to decouple producers and consumers, and it uses a generation counter and node recycling (using an internal stack) to avoid the ABA problem and a use-after-free in try_dequeue.
Running this under TSan compiled with Clang 13.0.1, Linux x64, I get the following race:
WARNING: ThreadSanitizer: data race (pid=223081)
Write of size 8 at 0x7b0400000008 by thread T2:
#0 JobQueue::enqueue(Job) .../bug4.cpp:85 (bug4.tsan+0xe3e53)
#1 operator() .../bug4.cpp:142 (bug4.tsan+0xe39ee)
...
Previous read of size 8 at 0x7b0400000008 by thread T1:
#0 JobQueue::try_dequeue() .../bug4.cpp:104 (bug4.tsan+0xe3c07)
#1 operator() .../bug4.cpp:121 (bug4.tsan+0xe381c)
...
Run on Godbolt (note, because of how Godbolt runs the program, TSan doesn't show line number information)
This race is between this previous read in try_dequeue called from the consumer thread:
Job job = old_front.node->job;
and this later write in enqueue, which is the third call to enqueue by the producer thread:
old_dummy->job = job;
I believe this race to be impossible, because the producer thread should synchronise with the consumer thread via the acquire-release compare-exchange to stack_top in allocate_node and deallocate_node.
Now, the weird thing is that making GenNodePointer alignas(32) removes the race.
Run on Godbolt
Questions:
Is this race actually possible?
Why does increasing the alignment of GenNodePointer make TSan no longer register a race?

Troubles with simple Lock-Free MPSC Ring Buffer

I am trying to implement an array-based ring buffer that is thread-safe for multiple producers and a single consumer. The main idea is to have atomic head and tail indices. When pushing an element to the queue, the head is increased atomically to reserve a slot in the buffer:
#include <atomic>
#include <chrono>
#include <iostream>
#include <stdexcept>
#include <thread>
#include <vector>
template <class T> class MPSC {
private:
int MAX_SIZE;
std::atomic<int> head{0}; ///< index of first free slot
std::atomic<int> tail{0}; ///< index of first occupied slot
std::unique_ptr<T[]> data;
std::unique_ptr<std::atomic<bool>[]> valid; ///< indicates whether data at an
///< index has been fully written
/// Compute next index modulo size.
inline int advance(int x) { return (x + 1) % MAX_SIZE; }
public:
explicit MPSC(int size) {
if (size <= 0)
throw std::invalid_argument("size must be greater than 0");
MAX_SIZE = size + 1;
data = std::make_unique<T[]>(MAX_SIZE);
valid = std::make_unique<std::atomic<bool>[]>(MAX_SIZE);
}
/// Add an element to the queue.
///
/// If the queue is full, this method blocks until a slot is available for
/// writing. This method is not starvation-free, i.e. it is possible that one
/// thread always fills up the queue and prevents others from pushing.
void push(const T &msg) {
int idx;
int next_idx;
int k = 100;
do {
idx = head;
next_idx = advance(idx);
while (next_idx == tail) { // queue is full
k = k >= 100000 ? k : k * 2; // exponential backoff
std::this_thread::sleep_for(std::chrono::nanoseconds(k));
} // spin
} while (!head.compare_exchange_weak(idx, next_idx));
if (valid[idx])
// this throws, suggesting that two threads are writing to the same index. I have no idea how this is possible.
throw std::runtime_error("message slot already written");
data[idx] = msg;
valid[idx] = true; // this was set to false by the reader,
// set it to true to indicate completed data write
}
/// Read an element from the queue.
///
/// If the queue is empty, this method blocks until a message is available.
/// This method is only safe to be called from one single reader thread.
T pop() {
int k = 100;
while (is_empty() || !valid[tail]) {
k = k >= 100000 ? k : k * 2;
std::this_thread::sleep_for(std::chrono::nanoseconds(k));
} // spin
T res = data[tail];
valid[tail] = false;
tail = advance(tail);
return res;
}
bool is_full() { return (head + 1) % MAX_SIZE == tail; }
bool is_empty() { return head == tail; }
};
When there is a lot of congestion, some messages get overwritten by other threads. Hence there must be something fundamentally wrong with what I'm doing here.
What seems to be happening is that two threads are acquiring the same index to write their data to. Why could that be?
Even if a producer were to pause just before writing it's data, the tail could not increase past this threads idx and hence no other thread should be able to overtake and claim that same idx.
EDIT
At the risk of posting too much code, here is a simple program that reproduces the problem. It sends some incrementing numbers from many threads and checks whether all numbers are received by the consumer:
#include "mpsc.hpp" // or whatever; the above queue
#include <thread>
#include <iostream>
int main() {
static constexpr int N_THREADS = 10; ///< number of threads
static constexpr int N_MSG = 1E+5; ///< number of messages per thread
struct msg {
int t_id;
int i;
};
MPSC<msg> q(N_THREADS / 2);
std::thread threads[N_THREADS];
// consumer
threads[0] = std::thread([&q] {
int expected[N_THREADS] {};
for (int i = 0; i < N_MSG * (N_THREADS - 1); ++i) {
msg m = q.pop();
std::cout << "Got message from T-" << m.t_id << ": " << m.i << std::endl;
if (expected[m.t_id] != m.i) {
std::cout << "T-" << m.t_id << " unexpected msg " << m.i << "; expected " << expected[m.t_id] << std::endl;
return -1;
}
expected[m.t_id] = m.i + 1;
}
});
// producers
for (int id = 1; id < N_THREADS; ++id) {
threads[id] = std::thread([id, &q] {
for (int i = 0; i < N_MSG; ++i) {
q.push(msg{id, i});
}
});
}
for (auto &t : threads)
t.join();
}
I am trying to implement an array-based ring buffer that is thread-safe for multiple producers and a single consumer.
I assume you are doing this as a learning exercise. Implementing a lock-free queue yourself is most probably the wrong thing to do if you want to solve a real problem.
What seems to be happening is that two threads are acquiring the same index to write their data to. Why could that be?
The combination of that producer spinlock with the outer CAS loop does not work in the intended way:
do {
idx = head;
next_idx = advance(idx);
while (next_idx == tail) { // queue is full
k = k >= 100000 ? k : k * 2; // exponential backoff
std::this_thread::sleep_for(std::chrono::nanoseconds(k));
} // spin
//
// ...
//
// All other threads (producers and consumers) can progress.
//
// ...
//
} while (!head.compare_exchange_weak(idx, next_idx));
The queue may be full when the CAS happens because those checks are performed independently. In addition, the CAS may succeed because the other threads may have advanced head to exactly match idx.

Trying to process linked list data in parallel with OpenMP

I am trying to process linked list data in parallel with OpenMP in C++. I'm pretty new to OpenMP and pretty rusty with C++. What I want to do is get several threads to break up the linked list, and output the data of the Nodes in their particular range. I don't care about the order in which the output occurs. If I can get this working, I want to replace the simple output with some actual processing of the Node data.
I've found several things on the internet (including a few questions on this site) and from what I found, I cobbled together a code like this:
#include <iostream>
#include <omp.h>
// various and sundry other stuff ...
struct Node {
int data;
Node* next;
};
int main() {
struct Node *newHead;
struct Node *head = new Node;
struct Node *currNode;
int n;
int tid;
//create a bunch of Nodes in linked list with "data" ...
// traverse the linked list:
// examine data
#pragma omp parallel private(tid)
{
currNode = head;
tid=omp_get_thread_num();
#pragma omp single
{
while (currNode) {
#pragma omp task firstprivate(currNode)
{
cout << "Node data: " << currNode->data << " " << tid << "\n";
} // end of pragma omp task
currNode = currNode->next;
} // end of while
} //end of pragma omp single
} // end of pragma omp parallel
// clean up etc. ...
} // end of main
So I run:
>: export OMP_NUM_THREADS=6
>: g++ -fopenmp ll_code.cpp
>: ./a.out
And the output is:
Node data: 5 0
Node data: 10 0
Node data: 20 0
Node data: 30 0
Node data: 35 0
Node data: 40 0
Node data: 45 0
Node data: 50 0
Node data: 55 0
Node data: 60 0
Node data: 65 0
Node data: 70 0
Node data: 75 0
So, tid is always 0. And that means, unless I'm really misunderstanding something, only one thread did anything with the linked list, and so the linked list was not traversed in parallel at all.
When I get rid of single, the code fails with a seg fault. I have tried moving a few variables in and out of the OpenMP directive scopes, with no change. Changing the number of threads has no effect. How can this be made to work?
A secondary question: Some sites say the firstprivate(currNode) is necessary and others say currNode is firstprivate by default. Who is right?
You certainly can traverse a linked list using multiple threads, but it will be actually slower than just using a single thread.
The reason is that, to know the address of a node N != 0, you must know the address of node N-1.
Assume now that you have N threads, each responsible for "starting at i position". The above paragraph implies that a thread i will depend on the result of thread i-1, which in turn will depend on the result of thread i-2, and so on.
What you end up with is a serial traversal anyway. But now, instead of just a simple for, you have to synchronize threads too, making things inherently slower.
But, if you're trying to do some heavy processing that would benefit from being run in parallel, then yes, you're going for the right approach. You can just change how you're getting the thread id:
#include <iostream>
#include <omp.h>
struct Node {
int data;
Node* next;
};
int main() {
struct Node *head = new Node;
struct Node *currNode = head;
head->data = 0;
for (int i=1;i<10;++i) {
currNode->next = new Node;
currNode = currNode->next;
currNode->data = i;
}
// traverse the linked list:
// examine data
#pragma omp parallel
{
currNode = head;
#pragma omp single
{
while (currNode) {
#pragma omp task firstprivate(currNode)
{
#pragma omp critical (cout)
std::cout << "Node data: " << currNode->data << " " << omp_get_thread_num() << "\n";
}
currNode = currNode->next;
}
}
}
}
Possible output:
Node data: 0 4
Node data: 6 4
Node data: 7 4
Node data: 8 4
Node data: 9 4
Node data: 1 3
Node data: 2 5
Node data: 3 2
Node data: 4 1
Node data: 5 0
See it live!
Finally, for a more idiomatic approach, consider using a std::forward_list:
#include <forward_list>
#include <iostream>
#include <omp.h>
int main() {
std::forward_list<int> list;
for (int i=0;i<10;++i) list.push_front(i);
#pragma omp parallel
#pragma omp single
for(auto data : list) {
#pragma omp task firstprivate(data)
#pragma omp critical (cout)
std::cout << "Node data: " << data << " " << omp_get_thread_num() << "\n";
}
}

Blocking queue race condition?

I'm trying to implement a high performance blocking queue backed by a circular buffer on top of pthreads, semaphore.h and gcc atomic builtins. The queue needs to handle multiple simulataneous readers and writers from different threads.
I've isolated some sort of race condition, and I'm not sure if it's a faulty assumption about the behavior of some of the atomic operations and semaphores, or whether my design is fundamentally flawed.
I've extracted and simplified it to the below standalone example. I would expect that this program never returns. It does however return after a few hundred thousand iterations with corruption detected in the queue.
In the below example (for exposition) it doesn't actually store anything, it just sets to 1 a cell that would hold the actual data, and 0 to represent an empty cell. There is a counting semaphore (vacancies) representing the number of vacant cells, and another counting semaphore (occupants) representing the number of occupied cells.
Writers do the following:
decrement vacancies
atomically get next head index (mod queue size)
write to it
increment occupants
Readers do the opposite:
decrement occupants
atomically get next tail index (mod queue size)
read from it
increment vacancies
I would expect that given the above, precisely one thread can be reading or writing any given cell at one time.
Any ideas about why it doesn't work or debugging strategies appreciated. Code and output below...
#include <stdlib.h>
#include <semaphore.h>
#include <iostream>
using namespace std;
#define QUEUE_CAPACITY 8 // must be power of 2
#define NUM_THREADS 2
struct CountingSemaphore
{
sem_t m;
CountingSemaphore(unsigned int initial) { sem_init(&m, 0, initial); }
void post() { sem_post(&m); }
void wait() { sem_wait(&m); }
~CountingSemaphore() { sem_destroy(&m); }
};
struct BlockingQueue
{
unsigned int head; // (head % capacity) is next head position
unsigned int tail; // (tail % capacity) is next tail position
CountingSemaphore vacancies; // how many cells are vacant
CountingSemaphore occupants; // how many cells are occupied
int cell[QUEUE_CAPACITY];
// (cell[x] == 1) means occupied
// (cell[x] == 0) means vacant
BlockingQueue() :
head(0),
tail(0),
vacancies(QUEUE_CAPACITY),
occupants(0)
{
for (size_t i = 0; i < QUEUE_CAPACITY; i++)
cell[i] = 0;
}
// put an item in the queue
void put()
{
vacancies.wait();
// atomic post increment
set(__sync_fetch_and_add(&head, 1) % QUEUE_CAPACITY);
occupants.post();
}
// take an item from the queue
void take()
{
occupants.wait();
// atomic post increment
get(__sync_fetch_and_add(&tail, 1) % QUEUE_CAPACITY);
vacancies.post();
}
// set cell i
void set(unsigned int i)
{
// atomic compare and assign
if (!__sync_bool_compare_and_swap(&cell[i], 0, 1))
{
corrupt("set", i);
exit(-1);
}
}
// get cell i
void get(unsigned int i)
{
// atomic compare and assign
if (!__sync_bool_compare_and_swap(&cell[i], 1, 0))
{
corrupt("get", i);
exit(-1);
}
}
// corruption detected
void corrupt(const char* action, unsigned int i)
{
static CountingSemaphore sem(1);
sem.wait();
cerr << "corruption detected" << endl;
cerr << "action = " << action << endl;
cerr << "i = " << i << endl;
cerr << "head = " << head << endl;
cerr << "tail = " << tail << endl;
for (unsigned int j = 0; j < QUEUE_CAPACITY; j++)
cerr << "cell[" << j << "] = " << cell[j] << endl;
}
};
BlockingQueue q;
// keep posting to the queue forever
void* Source(void*)
{
while (true)
q.put();
return 0;
}
// keep taking from the queue forever
void* Sink(void*)
{
while (true)
q.take();
return 0;
}
int main()
{
pthread_t id;
// start some pthreads to run Source function
for (int i = 0; i < NUM_THREADS; i++)
if (pthread_create(&id, NULL, &Source, 0))
abort();
// start some pthreads to run Sink function
for (int i = 0; i < NUM_THREADS; i++)
if (pthread_create(&id, NULL, &Sink, 0))
abort();
while (true);
}
Compile the above as follows:
$ g++ -pthread AboveCode.cpp
$ ./a.out
The output is different every time, but here is one example:
corruption detected
action = get
i = 6
head = 122685
tail = 122685
cell[0] = 0
cell[1] = 0
cell[2] = 1
cell[3] = 0
cell[4] = 1
cell[5] = 0
cell[6] = 1
cell[7] = 1
My system is Ubuntu 11.10 on Intel Core 2:
$ uname -a
Linux 3.0.0-14-generic #23-Ubuntu SMP \
Mon Nov 21 20:28:43 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
$ cat /proc/cpuinfo | grep Intel
model name : Intel(R) Core(TM)2 Quad CPU Q9300 # 2.50GHz
$ g++ --version
g++ (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1
Thanks,
Andrew.
One of possible situations, traced step by step for two writer threads (W0, W1) and one reader thread (R0). W0 entered put() earlier than W1, was interrupted by OS or hardware and finished later.
w0 (core 0) w1 (core 1) r0
t0 ---- --- blocked on occupants.wait() / take
t1 entered put() --- ---
t2 vacancies.wait() entered put() ---
t3 got new_head = 1 vacancies.wait() ---
t4 <interrupted by OS> got new_head = 2 ---
t5 written 1 at cell[2] ---
t6 occupants.post(); ---
t7 exited put() waked up
t8 --- got new_tail = 1
t9 <still in interrupt> --- read 0 from ceil[1] !! corruption !!
t10 written 1 at cell[1]
t11 occupants.post();
t12 exited put()
From a design point of view, I would consider the whole queue as a shared resource and protect it with a single mutex.
Writers do the following:
take the mutex
write to the queue (including handling of indexes)
free the mutex
Readers do the following:
take the mutex
read from the queue (including handling of indexes)
free the mutex
I have a theory. It's a circular queue so one reading thread may be getting lapped. Say a reader takes index 0. Before it does anything it loses the CPU. Another reader thread takes index 1, then 2, then 3 ... then 7, then 0. The first reader wakes up and both threads think they have exclusive access to index 0. Not sure how to prove it. Hope that helps.

C++ Lock free producer/consumer queue

I was looking at the sample code for a lock-free queue at:
http://drdobbs.com/high-performance-computing/210604448?pgno=2
(Also reference in many SO questions such as Is there a production ready lock-free queue or hash implementation in C++)
This looks like it should work for a single producer/consumer, although there are a number of typos in the code. I've updated the code to read as shown below, but it's crashing on me. Anybody have suggestions why?
In particular, should divider and last be declared as something like:
atomic<Node *> divider, last; // shared
I don't have a compiler supporting C++0x on this machine, so perhaps that's all I need...
// Implementation from http://drdobbs.com/high-performance-computing/210604448
// Note that the code in that article (10/26/11) is broken.
// The attempted fixed version is below.
template <typename T>
class LockFreeQueue {
private:
struct Node {
Node( T val ) : value(val), next(0) { }
T value;
Node* next;
};
Node *first, // for producer only
*divider, *last; // shared
public:
LockFreeQueue()
{
first = divider = last = new Node(T()); // add dummy separator
}
~LockFreeQueue()
{
while( first != 0 ) // release the list
{
Node* tmp = first;
first = tmp->next;
delete tmp;
}
}
void Produce( const T& t )
{
last->next = new Node(t); // add the new item
last = last->next; // publish it
while (first != divider) // trim unused nodes
{
Node* tmp = first;
first = first->next;
delete tmp;
}
}
bool Consume( T& result )
{
if (divider != last) // if queue is nonempty
{
result = divider->next->value; // C: copy it back
divider = divider->next; // D: publish that we took it
return true; // and report success
}
return false; // else report empty
}
};
I wrote the following code to test this. Main (not shown) just calls TestQ().
#include "LockFreeQueue.h"
const int numThreads = 1;
std::vector<LockFreeQueue<int> > q(numThreads);
void *Solver(void *whichID)
{
int id = (long)whichID;
printf("Thread %d initialized\n", id);
int result = 0;
do {
if (q[id].Consume(result))
{
int y = 0;
for (int x = 0; x < result; x++)
{ y++; }
y = 0;
}
} while (result != -1);
return 0;
}
void TestQ()
{
std::vector<pthread_t> threads;
for (int x = 0; x < numThreads; x++)
{
pthread_t thread;
pthread_create(&thread, NULL, Solver, (void *)x);
threads.push_back(thread);
}
for (int y = 0; y < 1000000; y++)
{
for (unsigned int x = 0; x < threads.size(); x++)
{
q[x].Produce(y);
}
}
for (unsigned int x = 0; x < threads.size(); x++)
{
q[x].Produce(-1);
}
for (unsigned int x = 0; x < threads.size(); x++)
pthread_join(threads[x], 0);
}
Update: It ends up that the crash is being caused by the queue declaration:
std::vector<LockFreeQueue<int> > q(numThreads);
When I change this to be a simple array, it runs fine. (I implemented a version with locks and it was crashing too.) I see that the destructor is being called immediate after the constructor, resulting in doubly-freed memory. But, does anyone know WHY the destructor would be called immediately with a std::vector?
You'll need to make several of the pointers std::atomic, as you note, and you'll need to use compare_exchange_weak in a loop to update them atomically. Otherwise, multiple consumers might consume the same node and multiple producers might corrupt the list.
It's critically important that these writes (just one example from your code) occur in order:
last->next = new Node(t); // add the new item
last = last->next; // publish it
That's not guaranteed by C++ -- the optimizer can rearrange things however it likes, as long as the current thread always acts as-if the program ran exactly the way you wrote it. And then the CPU cache can come along and reorder things further.
You need memory fences. Making the pointers use the atomic type should have that effect.
This could be totally off the mark, but I can't help but wonder whether you're having some sort of static initialization related issue... For laughs, try declaring q as a pointer to a vector of lock-free queues and allocating it on the heap in main().