Lock-Free Queue with boost::atomic - Am I doing this right? - c++

Short version:
I'm trying to replace std::atomic from C++11 used in a lock-free, single producer, single consumer queue implementation from here. How do I replace this with boost::atomic?
Long version:
I'm trying to get a better performance out of our app with worker threads. Each thread has its own task queue. We have to synchronize using lock before dequeue/enqueue each task.
Then I found Herb Sutter's article on lock-free queue. It seems like an ideal replacement. But the code uses std::atomic from C++11, which I couldn't introduce to the project at this time.
More googling led to some examples, such as this one for Linux (echelon's), and this one for Windows (TINESWARE's). Both use platform's specific constructs such as WinAPI's InterlockedExchangePointer, and GCC's __sync_lock_test_and_set.
I only need to support Windows & Linux so maybe I can get away with some #ifdefs. But I thought it might be nicer to use what boost::atomic provides. Boost Atomic is not part of official Boost library yet. So I downloaded the source from http://www.chaoticmind.net/~hcb/projects/boost.atomic/ and use the include files with my project.
This is what I get so far:
#pragma once
#include <boost/atomic.hpp>
template <typename T>
class LockFreeQueue
{
private:
struct Node
{
Node(T val) : value(val), next(NULL) { }
T value;
Node* next;
};
Node* first; // for producer only
boost::atomic<Node*> divider; // shared
boost::atomic<Node*> last; // shared
public:
LockFreeQueue()
{
first = new Node(T());
divider = first;
last= first;
}
~LockFreeQueue()
{
while(first != NULL) // release the list
{
Node* tmp = first;
first = tmp->next;
delete tmp;
}
}
void Produce(const T& t)
{
last.load()->next = new Node(t); // add the new item
last = last.load()->next;
while(first != divider) // trim unused nodes
{
Node* tmp = first;
first = first->next;
delete tmp;
}
}
bool Consume(T& result)
{
if(divider != last) // if queue is nonempty
{
result = divider.load()->next->value; // C: copy it back
divider = divider.load()->next;
return true; // and report success
}
return false; // else report empty
}
};
Some modifications to note:
boost::atomic<Node*> divider; // shared
boost::atomic<Node*> last; // shared
and
last.load()->next = new Node(t); // add the new item
last = last.load()->next;
and
result = divider.load()->next->value; // C: copy it back
divider = divider.load()->next;
Am I applying the load() (and the implicit store()) from boost::atomic correctly right here? Can we say this is equivalent to Sutter's original C++11 lock-free queue?
PS. I studied many of the threads on SO, but none seems to provide an example for boost::atomic & lock-free queue.

Have you tried Intel Thread Building Blocks' atomic<T>? Cross platform and free.
Also...
Single producer/single consumer makes your problem much easier because your linearization point can be a single operator. It becomes easier still if you are prepared to accept a bounded queue.
A bounded queue offers advantages for cache performance because you can reserve a cache aligned memory block to maximize your hits, e.g.:
#include <vector>
#include "tbb/atomic.h"
#include "tbb/cache_aligned_allocator.h"
template< typename T >
class SingleProdcuerSingleConsumerBoundedQueue {
typedef vector<T, cache_aligned_allocator<T> > queue_type;
public:
BoundedQueue(int capacity):
queue(queue_type()) {
head = 0;
tail = 0;
queue.reserve(capacity);
}
size_t capacity() {
return queue.capacity();
}
bool try_pop(T& result) {
if(tail - head == 0)
return false;
else {
result = queue[head % queue.capacity()];
head.fetch_and_increment(); //linearization point
return(true);
}
}
bool try_push(const T& source) {
if(tail - head == queue.capacity())
return(false);
else {
queue[tail % queue.capacity()] = source;
tail.fetch_and_increment(); //linearization point
return(true);
}
}
~BoundedQueue() {}
private:
queue_type queue;
atomic<int> head;
atomic<int> tail;
};

Check out this boost.atomic ringbuffer example from the documentation:
#include <boost/atomic.hpp>
template <typename T, size_t Size>
class ringbuffer
{
public:
ringbuffer() : head_(0), tail_(0) {}
bool push(const T & value)
{
size_t head = head_.load(boost::memory_order_relaxed);
size_t next_head = next(head);
if (next_head == tail_.load(boost::memory_order_acquire))
return false;
ring_[head] = value;
head_.store(next_head, boost::memory_order_release);
return true;
}
bool pop(T & value)
{
size_t tail = tail_.load(boost::memory_order_relaxed);
if (tail == head_.load(boost::memory_order_acquire))
return false;
value = ring_[tail];
tail_.store(next(tail), boost::memory_order_release);
return true;
}
private:
size_t next(size_t current)
{
return (current + 1) % Size;
}
T ring_[Size];
boost::atomic<size_t> head_, tail_;
};
// How to use
int main()
{
ringbuffer<int, 32> r;
// try to insert an element
if (r.push(42)) { /* succeeded */ }
else { /* buffer full */ }
// try to retrieve an element
int value;
if (r.pop(value)) { /* succeeded */ }
else { /* buffer empty */ }
}
The code's only limitation is that the buffer length has to be known at compile time (or at construction time, if you replace the array by a std::vector<T>). To allow the buffer to grow and shrink is not trivial, as far as I understand.

Related

False sharing prevention with alignas is broken

I'm not used to post any question on the internet so please tell me if I'm doing something wrong.
In short
How to correctly prevent false sharing on a 64bits architecture with a CPU cacheline size of 64bytes ?
How does the usage of C++ 'alignas' keyword and a simple byte array (ex: char[64]) can affect multithreading efficiency ?
Context
While working on a very efficient implementation of a Single Consumer Single Producer Queue, I've encountered unlogical behaviours from GCC compiler while benchmarking my code.
Full story
I hope somebody will have the necessary knowledge to explain what is going on.
I'm currently using GCC 10.2.0 and its C++ 20 implementation on arch linux. My laptop is a Lenovo T470S having a i7-7500U processor.
Let me begin with the data structure:
class SPSCQueue
{
public:
...
private:
alignas(64) std::atomic<size_t> _tail { 0 }; // Tail accessed by both producer and consumer
Buffer _buffer {}; // Buffer cache for the producer, equivalent to _buffer2
std::size_t _headCache { 0 }; // Head cache for the producer
char _pad0[64 - sizeof(Buffer) - sizeof(std::size_t)]; // 64 bytes alignment padding
alignas(64) std::atomic<size_t> _head { 0 }; // Head accessed by both producer and consumer
Buffer _buffer2 {}; // Buffer cache for the consumer, equivalent to _buffer2
std::size_t _tailCache { 0 }; // Head cache for the consumer
char _pad1[64 - sizeof(Buffer) - sizeof(std::size_t)]; // 64 bytes alignment padding
};
The following data structure obtains a fast and stable 20ns at pushing / poping on my system.
However, only changing the alignment using the following members makes the benchmark unstable and give between 20 and 30ns.
alignas(64) std::atomic<size_t> _tail { 0 }; // Tail accessed by both producer and consumer
struct alignas(64) {
Buffer _buffer {}; // Buffer cache for the producer, equivalent to _buffer2
std::size_t _headCache { 0 }; // Head cache for the producer
};
alignas(64) std::atomic<size_t> _head { 0 }; // Head accessed by both producer and consumer
struct alignas(64) {
Buffer _buffer2 {}; // Buffer cache for the consumer, equivalent to _buffer1
std::size_t _tailCache { 0 }; // Tail cache for the consumer
};
Lastly, I got even more lost when I tried this configuration giving me results between 40 and 55ns.
std::atomic<size_t> _tail { 0 }; // Tail accessed by both producer and consumer
char _pad0[64 - sizeof(std::atomic<size_t>)];
Buffer _buffer {}; // Buffer cache for the producer, equivalent to _buffer2
std::size_t _headCache { 0 }; // Head cache for the producer
char _pad1[64 - sizeof(Buffer) - sizeof(std::size_t)];
std::atomic<size_t> _head { 0 }; // Head accessed by both producer and consumer
char _pad2[64 - sizeof(std::atomic<size_t>)];
Buffer _buffer2 {}; // Buffer cache for the consumer, equivalent to _buffer2
std::size_t _tailCache { 0 }; // Head cache for the consumer
char _pad3[64 - sizeof(Buffer) - sizeof(std::size_t)];
This time I got the queue push/pop oscillating both between 40 and 55ns.
I'm very lost at this point because I don't know where should I look for answers. Until now the C++ memory layout has been very intuitive for me but I realized that I still miss very important knowledges to be better at high frequency multithreading.
Minimal code sample
If you wish to compile the whole code to test it yourself, here are the few files needed:
SPSCQueue.hpp:
#pragma once
#include <atomic>
#include <cstdlib>
#include <cinttypes>
#define KF_ALIGN_CACHELINE alignas(kF::Core::Utils::CacheLineSize)
namespace kF::Core
{
template<typename Type>
class SPSCQueue;
namespace Utils
{
/** #brief Helper used to perfect forward move / copy constructor */
template<typename Type, bool ForceCopy = false>
void ForwardConstruct(Type *dest, Type *source) {
if constexpr (!ForceCopy && std::is_move_assignable_v<Type>)
new (dest) Type(std::move(*source));
else
new (dest) Type(*source);
}
/** #brief Helper used to perfect forward move / copy assignment */
template<typename Type, bool ForceCopy = false>
void ForwardAssign(Type *dest, Type *source) {
if constexpr (!ForceCopy && std::is_move_assignable_v<Type>)
*dest = std::move(*source);
else
*dest = *source;
}
/** #brief Theorical cacheline size */
constexpr std::size_t CacheLineSize = 64ul;
}
}
/**
* #brief The SPSC queue is a lock-free queue that only supports a Single Producer and a Single Consumer
* The queue is really fast compared to other more flexible implementations because the fact that only two thread can simultaneously read / write
* means that less synchronization is needed for each operation.
* The queue supports ranged push / pop to insert multiple elements without performance impact
*
* #tparam Type to be inserted
*/
template<typename Type>
class kF::Core::SPSCQueue
{
public:
/** #brief Buffer structure containing all cells */
struct Buffer
{
Type *data { nullptr };
std::size_t capacity { 0 };
};
/** #brief Local thread cache */
struct Cache
{
Buffer buffer {};
std::size_t value { 0 };
};
/** #brief Default constructor initialize the queue */
SPSCQueue(const std::size_t capacity);
/** #brief Destruct and release all memory (unsafe) */
~SPSCQueue(void) { clear(); std::free(_buffer.data); }
/** #brief Push a single element into the queue
* #return true if the element has been inserted */
template<typename ...Args>
[[nodiscard]] inline bool push(Args &&...args);
/** #brief Pop a single element from the queue
* #return true if an element has been extracted */
[[nodiscard]] inline bool pop(Type &value);
/** #brief Clear all elements of the queue (unsafe) */
void clear(void);
private:
KF_ALIGN_CACHELINE std::atomic<size_t> _tail { 0 }; // Tail accessed by both producer and consumer
struct {
Buffer _buffer {}; // Buffer cache for the producer, equivalent to _buffer2
std::size_t _headCache { 0 }; // Head cache for the producer
char _pad0[Utils::CacheLineSize - sizeof(Buffer) - sizeof(std::size_t)];
};
KF_ALIGN_CACHELINE std::atomic<size_t> _head { 0 }; // Head accessed by both producer and consumer
struct{
Buffer _buffer2 {}; // Buffer cache for the consumer, equivalent to _buffer2
std::size_t _tailCache { 0 }; // Head cache for the consumer
char _pad1[Utils::CacheLineSize - sizeof(Buffer) - sizeof(std::size_t)];
};
/** #brief Copy and move constructors disabled */
SPSCQueue(const SPSCQueue &other) = delete;
SPSCQueue(SPSCQueue &&other) = delete;
};
static_assert(sizeof(kF::Core::SPSCQueue<int>) == 4 * kF::Core::Utils::CacheLineSize);
template<typename Type>
kF::Core::SPSCQueue<Type>::SPSCQueue(const std::size_t capacity)
{
_buffer.capacity = capacity;
if (_buffer.data = reinterpret_cast<Type *>(std::malloc(sizeof(Type) * capacity)); !_buffer.data)
throw std::runtime_error("Core::SPSCQueue: Malloc failed");
_buffer2 = _buffer;
}
template<typename Type>
template<typename ...Args>
bool kF::Core::SPSCQueue<Type>::push(Args &&...args)
{
static_assert(std::is_constructible<Type, Args...>::value, "Type must be constructible from Args...");
const auto tail = _tail.load(std::memory_order_relaxed);
auto next = tail + 1;
if (next == _buffer.capacity) [[unlikely]]
next = 0;
if (auto head = _headCache; next == head) [[unlikely]] {
head = _headCache = _head.load(std::memory_order_acquire);
if (next == head) [[unlikely]]
return false;
}
new (_buffer.data + tail) Type{ std::forward<Args>(args)... };
_tail.store(next, std::memory_order_release);
return true;
}
template<typename Type>
bool kF::Core::SPSCQueue<Type>::pop(Type &value)
{
const auto head = _head.load(std::memory_order_relaxed);
if (auto tail = _tailCache; head == tail) [[unlikely]] {
tail = _tailCache = _tail.load(std::memory_order_acquire);
if (head == tail) [[unlikely]]
return false;
}
auto *elem = reinterpret_cast<Type *>(_buffer2.data + head);
auto next = head + 1;
if (next == _buffer2.capacity) [[unlikely]]
next = 0;
value = std::move(*elem);
elem->~Type();
_head.store(next, std::memory_order_release);
return true;
}
template<typename Type>
void kF::Core::SPSCQueue<Type>::clear(void)
{
for (Type type; pop(type););
}
The benchmark, using google benchmark.
bench_SPSCQueue.cpp:
#include <thread>
#include <benchmark/benchmark.h>
#include "SPSCQueue.hpp"
using namespace kF;
using Queue = Core::SPSCQueue<std::size_t>;
constexpr std::size_t Capacity = 4096;
static void SPSCQueue_NoisyPush(benchmark::State &state)
{
Queue queue(Capacity);
std::atomic<bool> running = true;
std::size_t i = 0ul;
std::thread thd([&queue, &running] { for (std::size_t tmp; running; benchmark::DoNotOptimize(queue.pop(tmp))); });
for (auto _ : state) {
decltype(std::chrono::high_resolution_clock::now()) start;
do {
start = std::chrono::high_resolution_clock::now();
} while (!queue.push(42ul));
auto end = std::chrono::high_resolution_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::duration<double>>(end - start);
auto iterationTime = elapsed.count();
state.SetIterationTime(iterationTime);
}
running = false;
if (thd.joinable())
thd.join();
}
BENCHMARK(SPSCQueue_NoisyPush)->UseManualTime();
static void SPSCQueue_NoisyPop(benchmark::State &state)
{
Queue queue(Capacity);
std::atomic<bool> running = true;
std::size_t i = 0ul;
std::thread thd([&queue, &running] { while (running) benchmark::DoNotOptimize(queue.push(42ul)); });
for (auto _ : state) {
std::size_t tmp;
decltype(std::chrono::high_resolution_clock::now()) start;
do {
start = std::chrono::high_resolution_clock::now();
} while (!queue.pop(tmp));
auto end = std::chrono::high_resolution_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::duration<double>>(end - start);
auto iterationTime = elapsed.count();
state.SetIterationTime(iterationTime);
}
running = false;
if (thd.joinable())
thd.join();
}
BENCHMARK(SPSCQueue_NoisyPop)->UseManualTime();
Thanks to your useful comments (and mostly, thanks to Peter Cordes), it seems that the issue was coming from the L2 data prefetcher.
Because of my SPSC queue design, each thread must access two consecutive cachelines to push / pop the queue.
If the structure itself is not aligned to 128 bytes, its address will not be aligned on 128 bytes and the compiler will not be able to optimize out the access of the two aligned cacheline.
Thus, the simple fix is:
template<typename Type>
class alignas(128) SPSCQueue { ... };
Here (section 2.5.5.4 Data Prefetching) is a interesting paper from Intel explaining optimization on their architectures and how prefetching is done in various levels of caches.

lock free stack: what is the correct use of memory order?

The below class describes a lock free stack of uint32_t sequential values (full code here). For instance, LockFreeIndexStack stack(5); declares a stack containing the numbers {0, 1, 2, 3, 4}. This class has pool semantic. The capacity of the stack is fixed. Only the values originally introduced in the stack can be extracted and reinserted. So at any particular point in time any of those values can be either inside the stack or outside, but not both. A thread can only push an index that it previously got via a pop. So the correct usage is for a thread to do:
auto index = stack.pop(); // get an index from the stack, if available
if(index.isValid()) {
// do something with 'index'
stack.push(index); // return index to the stack
}
Both the push and pop methods are implemented with an atomic load and a CAS loop.
What is the correct memory order semantic I should use in the atomic operations in pop and push (I wrote my guess commented out)?
struct LockFreeIndexStack
{
typedef uint64_t bundle_t;
typedef uint32_t index_t;
private:
static const index_t s_null = ~index_t(0);
typedef std::atomic<bundle_t> atomic_bundle_t;
union Bundle {
Bundle(index_t index, index_t count)
{
m_value.m_index = index;
m_value.m_count = count;
}
Bundle(bundle_t bundle)
{
m_bundle = bundle;
}
struct {
index_t m_index;
index_t m_count;
} m_value;
bundle_t m_bundle;
};
public:
LockFreeIndexStack(index_t n)
: m_top(Bundle(0, 0).m_bundle)
, m_next(n, s_null)
{
for (index_t i = 1; i < n; ++i)
m_next[i - 1] = i;
}
index_t pop()
{
Bundle curtop(m_top.load()); // memory_order_acquire?
while(true) {
index_t candidate = curtop.m_value.m_index;
if (candidate != s_null) { // stack is not empty?
index_t next = m_next[candidate];
Bundle newtop(next, curtop.m_value.m_count);
// In the very remote eventuality that, between reading 'm_top' and
// the CAS operation other threads cause all the below circumstances occur simultaneously:
// - other threads execute exactly a multiple of 2^32 pop or push operations,
// so that 'm_count' assumes again the original value;
// - the value read as 'candidate' 2^32 transactions ago is again top of the stack;
// - the value 'm_next[candidate]' is no longer what it was 2^32 transactions ago
// then the stack will get corrupted
if (m_top.compare_exchange_weak(curtop.m_bundle, newtop.m_bundle)) {
return candidate;
}
}
else {
// stack was empty, no point in spinning
return s_null;
}
}
}
void push(index_t index)
{
Bundle curtop(m_top.load()); // memory_order_relaxed?
while (true) {
index_t current = curtop.m_value.m_index;
m_next[index] = current;
Bundle newtop = Bundle(index, curtop.m_value.m_count + 1);
if (m_top.compare_exchange_weak(curtop.m_bundle, newtop.m_bundle)) {
return;
}
};
}
private:
atomic_bundle_t m_top;
std::vector<index_t> m_next;
};

Need to reference and update value from nested class C++

Bear with me, I'm new to C++. I'm trying to update a value which is stored in a vector, but I'm getting this error:
non-const lvalue reference to type 'Node'
I'm using a simple wrapper around std::vector so I can share methods like contains and others (similar to how the ArrayList is in Java).
#include <vector>
using namespace std;
template <class T> class NewFrames {
public:
// truncated ...
bool contains(T data) {
for(int i = 0; i < this->vec->size(); i++) {
if(this->vec->at(i) == data) {
return true;
}
}
return false;
}
int indexOf(T data) {
for(int i = 0; i < this->vec->size(); i++) {
if(this->vec->at(i) == data) {
return i;
}
}
return -1;
}
T get(int index) {
if(index > this->vec->size()) {
throw std::out_of_range("Cannot get index that exceeds the capacity");
}
return this->vec->at(index);
}
private:
vector<T> *vec;
};
#endif // A2_NEWFRAMES_H
The class which utilizes this wrapper is defined as follows:
#include "Page.h"
#include "NewFrames.h"
class Algo {
private:
typedef struct Node {
unsigned reference:1;
int data;
unsigned long _time;
Node() { }
Node(int data) {
this->data = data;
this->reference = 0;
this->_time = (unsigned long) time(NULL);
}
} Node;
unsigned _faults;
Page page;
NewFrames<Node> *frames;
};
I'm at a point where I need to reference one of the Node objects inside of the vector, but I need to be able to change reference to a different value. From what I've found on SO, I need to do this:
const Node &n = this->frames->get(this->frames->indexOf(data));
I've tried just using:
Node n = this->frames->get(this->frames->indexOf(data));
n.reference = 1;
and then viewing the data in the debugger, but the value is not updated when I check later on. Consider this:
const int data = this->page.pages[i];
const bool contains = this->frames->contains(Node(data));
Node node = this->frames->get(index);
for(unsigned i = 0; i < this->page.pages.size(); i++) {
if(node == NULL && !contains) {
// add node
} else if(contains) {
Node n = this->frames->get(this->frames->indexOf(data));
if(n.reference == 0) {
n.reference = 1;
} else {
n.reference = 0;
}
} else {
// do other stuff
}
}
With subsequent passes of the loop, the node with that particular data value is somehow different.
But if I attempt to change n.reference, I'll get an error because const is preventing the object from changing. Is there a way I can get this node so I can change it? I'm coming from the friendly Java world where something like this would work, but I want to know/understand why this doesn't work in C++.
Node n = this->frames->get(this->frames->indexOf(data));
n.reference = 1;
This copies the Node from frames and stores the copy as the object n. Modifying the copy does not change the original node.
The simplest "fix" is to use a reference. That means changing the return type of get from T to T&, and changing the previous two lines to
Node& n = this->frames->get(this->frames->indexOf(data));
n.reference = 1;
That should get the code to work. But there is so much indirection in the code that there are likely to be other problems that haven't shown up yet. As #nwp said in a comment, using vector<T> instead of vector<T>* will save you many headaches.
And while I'm giving style advice, get rid of those this->s; they're just noise. And simplify the belt-and-suspenders validity checks: when you loop from 0 to vec.size() you don't need to check that the index is okay when you access the element; change vec.at(i) to vec[i]. And in get, note that vec.at(index) will throw an exception if index is out of bounds, so you can either skip the initial range check or keep the check (after fixing it so that it checks the actual range) and, again, use vec[index] instead of vec.at(index).

Why dynamic memory allocation is not linear in scale?

I am investigating data structures to satisfy O(1) get operations and came across a structure called Trie.
I have implemented the below simple Trie structure to hold numbers (digits only).
Ignore the memory leak - it is not the topic here :)
The actual storage in the Data class is not related as well.
#include <sstream>
#include <string>
struct Data
{
Data(): m_nData(0){}
int m_nData;
};
struct Node
{
Node(): m_pData(NULL)
{
for (size_t n = 0; n < 10; n++)
{
digits[n] = NULL;
}
}
void m_zAddPartialNumber(std::string sNumber)
{
if (sNumber.empty() == true) // last digit
{
m_pData = new Data;
m_pData->m_nData = 1;
}
else
{
size_t nDigit = *(sNumber.begin()) - '0';
if (digits[nDigit] == NULL)
{
digits[nDigit] = new Node;
}
digits[nDigit]->m_zAddPartialNumber(sNumber.substr(1, sNumber.length() - 1));
}
}
Data* m_pData;
Node* digits[10];
};
struct DB
{
DB() : root(NULL){}
void m_zAddNumber(std::string sNumber)
{
if (root == NULL)
{
root = new Node;
}
root->m_zAddPartialNumber(sNumber);
}
Node* root;
};
int main()
{
DB oDB;
for (size_t nNumber = 0; nNumber <= 10000; nNumber++)
{
std::ostringstream convert;
convert << nNumber;
std::string sNumber = convert.str();
oDB.m_zAddNumber(sNumber);
}
return 0;
}
My main function is simply inserting numbers into the data structure.
I've examined the overall memory allocated using Windows task manager and came across an interesting feature i can't explain and am seeking your advice.
I've re-executed my simple program with different numbers inserted to the structure (altering the for loop stop condition) - here is a table of the experiment results:
Plotting the numbers in a logarithmic scaled graph reveals:
As you can see, the graph is not linear.
My question is why?
I would expect the allocation to behave linear across the range.
A linear relation of y on x is of the form y=a+bx. This is a straight line in a y vs x plot, but not in a log(y) vs log(x) plot, unless the constant a=0. So, I conjecture that your relation may still be (nearly) linear with a~340 kB.

C++ Lock free producer/consumer queue

I was looking at the sample code for a lock-free queue at:
http://drdobbs.com/high-performance-computing/210604448?pgno=2
(Also reference in many SO questions such as Is there a production ready lock-free queue or hash implementation in C++)
This looks like it should work for a single producer/consumer, although there are a number of typos in the code. I've updated the code to read as shown below, but it's crashing on me. Anybody have suggestions why?
In particular, should divider and last be declared as something like:
atomic<Node *> divider, last; // shared
I don't have a compiler supporting C++0x on this machine, so perhaps that's all I need...
// Implementation from http://drdobbs.com/high-performance-computing/210604448
// Note that the code in that article (10/26/11) is broken.
// The attempted fixed version is below.
template <typename T>
class LockFreeQueue {
private:
struct Node {
Node( T val ) : value(val), next(0) { }
T value;
Node* next;
};
Node *first, // for producer only
*divider, *last; // shared
public:
LockFreeQueue()
{
first = divider = last = new Node(T()); // add dummy separator
}
~LockFreeQueue()
{
while( first != 0 ) // release the list
{
Node* tmp = first;
first = tmp->next;
delete tmp;
}
}
void Produce( const T& t )
{
last->next = new Node(t); // add the new item
last = last->next; // publish it
while (first != divider) // trim unused nodes
{
Node* tmp = first;
first = first->next;
delete tmp;
}
}
bool Consume( T& result )
{
if (divider != last) // if queue is nonempty
{
result = divider->next->value; // C: copy it back
divider = divider->next; // D: publish that we took it
return true; // and report success
}
return false; // else report empty
}
};
I wrote the following code to test this. Main (not shown) just calls TestQ().
#include "LockFreeQueue.h"
const int numThreads = 1;
std::vector<LockFreeQueue<int> > q(numThreads);
void *Solver(void *whichID)
{
int id = (long)whichID;
printf("Thread %d initialized\n", id);
int result = 0;
do {
if (q[id].Consume(result))
{
int y = 0;
for (int x = 0; x < result; x++)
{ y++; }
y = 0;
}
} while (result != -1);
return 0;
}
void TestQ()
{
std::vector<pthread_t> threads;
for (int x = 0; x < numThreads; x++)
{
pthread_t thread;
pthread_create(&thread, NULL, Solver, (void *)x);
threads.push_back(thread);
}
for (int y = 0; y < 1000000; y++)
{
for (unsigned int x = 0; x < threads.size(); x++)
{
q[x].Produce(y);
}
}
for (unsigned int x = 0; x < threads.size(); x++)
{
q[x].Produce(-1);
}
for (unsigned int x = 0; x < threads.size(); x++)
pthread_join(threads[x], 0);
}
Update: It ends up that the crash is being caused by the queue declaration:
std::vector<LockFreeQueue<int> > q(numThreads);
When I change this to be a simple array, it runs fine. (I implemented a version with locks and it was crashing too.) I see that the destructor is being called immediate after the constructor, resulting in doubly-freed memory. But, does anyone know WHY the destructor would be called immediately with a std::vector?
You'll need to make several of the pointers std::atomic, as you note, and you'll need to use compare_exchange_weak in a loop to update them atomically. Otherwise, multiple consumers might consume the same node and multiple producers might corrupt the list.
It's critically important that these writes (just one example from your code) occur in order:
last->next = new Node(t); // add the new item
last = last->next; // publish it
That's not guaranteed by C++ -- the optimizer can rearrange things however it likes, as long as the current thread always acts as-if the program ran exactly the way you wrote it. And then the CPU cache can come along and reorder things further.
You need memory fences. Making the pointers use the atomic type should have that effect.
This could be totally off the mark, but I can't help but wonder whether you're having some sort of static initialization related issue... For laughs, try declaring q as a pointer to a vector of lock-free queues and allocating it on the heap in main().