disruptor: claiming slots in the face of integer overflow - c++

I am implementing the disruptor pattern for inter thread communication in C++ for multiple producers. In the implementation from LMAX the next(n) method in MultiProducerSequencer.java uses signed integers (okay it's Java) but also the C++ port (disruptor--) uses signed integers. After a very (very) long time overflow will result in undefined behavior.
Unsigned integers have multiple advantages:
correct behavior on overflow
no need for 64 bit integers
Here is my approach for claiming n slots (source is attached at the end): next_ is the index of next free slot that can be claimed, tail_ is the last free slot that can be claimed (will be updated somewhere else). n is smaller than the buffer size. My approach is to normalize next and tail position for intermediate calculations by subtracting tail from next. Adding n to normalized next norm must be smaller than the buffer size to successfully claim the slots between next_ and next_+n. It is assumed that norm + n will not overflow.
1) Is it correct or does next_ get passed tail_ in some cases? Does it work with smaller integer types like uint32_t or uint16_t iff buffer size and n are restricted e.g. to 1/10 * maximum integer of these types.
2) If it is not correct then I would like to know the concrete case.
3) Is something else wrong or what can be improved? (I omitted the cacheline padding)
class msg_ctrl
{
public:
inline msg_ctrl();
inline int claim(size_t n, uint64_t& seq);
inline int publish(size_t n, uint64_t seq);
inline int tail(uint64_t t);
public:
std::atomic<uint64_t> next_;
std::atomic<uint64_t> head_;
std::atomic<uint64_t> tail_;
};
// Implementation -----------------------------------------
msg_ctrl::msg_ctrl() : next_(2), head_(1), tail_(0)
{}
int msg_ctrl::claim(size_t n, uint64_t& seq)
{
uint64_t const size = msg_buffer::size();
if (n > 1024) // please do not try to reserve too much slots
return -1;
uint64_t curr = 0;
do
{
curr = next_.load();
uint64_t tail = tail_.load();
uint64_t norm = curr - tail;
uint64_t next = norm + n;
if (next > size)
std::this_thread::yield(); // todo: some wait strategy
else if (next_.compare_exchange_weak(curr, curr + n))
break;
} while (true);
seq = curr;
return 0;
}
int msg_ctrl::publish(size_t n, uint64_t seq)
{
uint64_t tmp = seq-1;
uint64_t val = seq+n-1;
while (!head_.compare_exchange_weak(tmp, val))
{
tmp = seq-1;
std::this_thread::yield();
}
return 0;
}
int msg_ctrl::tail(uint64_t t)
{
tail_.store(t);
return 0;
}
Publishing to the ring buffer will look like:
size_t n = 15;
uint64_t seq = 0;
msg_ctrl->claim(n, seq);
//fill items in buffer
buffer[seq + 0] = an item
buffer[seq + 1] = an item
...
buffer[seq + n-1] = an item
msg_ctrl->publish(n, seq);

Related

lock free stack: what is the correct use of memory order?

The below class describes a lock free stack of uint32_t sequential values (full code here). For instance, LockFreeIndexStack stack(5); declares a stack containing the numbers {0, 1, 2, 3, 4}. This class has pool semantic. The capacity of the stack is fixed. Only the values originally introduced in the stack can be extracted and reinserted. So at any particular point in time any of those values can be either inside the stack or outside, but not both. A thread can only push an index that it previously got via a pop. So the correct usage is for a thread to do:
auto index = stack.pop(); // get an index from the stack, if available
if(index.isValid()) {
// do something with 'index'
stack.push(index); // return index to the stack
}
Both the push and pop methods are implemented with an atomic load and a CAS loop.
What is the correct memory order semantic I should use in the atomic operations in pop and push (I wrote my guess commented out)?
struct LockFreeIndexStack
{
typedef uint64_t bundle_t;
typedef uint32_t index_t;
private:
static const index_t s_null = ~index_t(0);
typedef std::atomic<bundle_t> atomic_bundle_t;
union Bundle {
Bundle(index_t index, index_t count)
{
m_value.m_index = index;
m_value.m_count = count;
}
Bundle(bundle_t bundle)
{
m_bundle = bundle;
}
struct {
index_t m_index;
index_t m_count;
} m_value;
bundle_t m_bundle;
};
public:
LockFreeIndexStack(index_t n)
: m_top(Bundle(0, 0).m_bundle)
, m_next(n, s_null)
{
for (index_t i = 1; i < n; ++i)
m_next[i - 1] = i;
}
index_t pop()
{
Bundle curtop(m_top.load()); // memory_order_acquire?
while(true) {
index_t candidate = curtop.m_value.m_index;
if (candidate != s_null) { // stack is not empty?
index_t next = m_next[candidate];
Bundle newtop(next, curtop.m_value.m_count);
// In the very remote eventuality that, between reading 'm_top' and
// the CAS operation other threads cause all the below circumstances occur simultaneously:
// - other threads execute exactly a multiple of 2^32 pop or push operations,
// so that 'm_count' assumes again the original value;
// - the value read as 'candidate' 2^32 transactions ago is again top of the stack;
// - the value 'm_next[candidate]' is no longer what it was 2^32 transactions ago
// then the stack will get corrupted
if (m_top.compare_exchange_weak(curtop.m_bundle, newtop.m_bundle)) {
return candidate;
}
}
else {
// stack was empty, no point in spinning
return s_null;
}
}
}
void push(index_t index)
{
Bundle curtop(m_top.load()); // memory_order_relaxed?
while (true) {
index_t current = curtop.m_value.m_index;
m_next[index] = current;
Bundle newtop = Bundle(index, curtop.m_value.m_count + 1);
if (m_top.compare_exchange_weak(curtop.m_bundle, newtop.m_bundle)) {
return;
}
};
}
private:
atomic_bundle_t m_top;
std::vector<index_t> m_next;
};

Random access to array of raw buffers of different sizes?

I have an array of array: struct chunk { char * data; size_t size; }; chunk * chunks;. The data size in each chunk is dynamic and differ between chunks. Linear access to data is easy with a nested for loop:
for (chunk * chunk_it = chunks; chunk_it != chunks + count; ++chunk_it) {
for (char * it = chunk_it->data; it != chunk_it->data + chunk_it->size; ++it) {
/* use it here */
}
}
I want to turn this into random access to chunks->data using operator[] as an interface, spanning multiple chunks.
It works by linearly searching for the right chunk, then just calculating the offset of the data I want.
template <class T>
void random_access(int n) {
chunk * c;
for (int i = 0; i < count; ++i) {
c = chunks + i;
size_t size = c->size;
if (n - size < 0) {
n -= size; // mutate n to fit into current chunk
} else {
break; // found
}
}
T * data = reinterpret_cast<T *>(c->data + n);
// use data here
}
Is there a more efficient way to do this? It would be crazy to do this every time I need a T from chunks. I plan on iterating over all chunk data linearly, but I want to use the data outside of the function, and thus need to return it at the inner loop (hence why I want to turn it inside out). I also thought of using a function pointer at the inner loop, but rather not as just doing chunk_iterator[n] is much nicer.
I understand your data structure is more complicated but could you not do something like this?
I build a contiguous block of the chunk data and record the position and size of each one in the chunks array:
class chunk_manager
{
struct chunk
{
std::size_t position;
std::size_t size;
chunk(std::size_t position, std::size_t size)
: position(position), size(size) {}
};
public:
void add_chunk(std::string const& chunk)
{
m_chunks.emplace_back(m_data.size(), chunk.size());
m_data.append(chunk);
}
char* random_access(std::size_t n) { return &m_data[n]; }
std::size_t size_in_bytes() const { return m_data.size(); }
private:
std::vector<chunk> m_chunks;
std::string m_data;
};
int main()
{
chunk_manager cm;
cm.add_chunk("abc");
cm.add_chunk("def");
cm.add_chunk("ghi");
for(auto n = 0ULL; n < cm.size_in_bytes(); ++n)
std::cout << cm.random_access(n) << '\n';
}

Why does random extra code improve performance?

Struct Node {
Node *N[SIZE];
int value;
};
struct Trie {
Node *root;
Node* findNode(Key *key) {
Node *C = &root;
char u;
while (1) {
u = key->next();
if (u < 0) return C;
// if (C->N[0] == C->N[0]); // this line will speed up execution significantly
C = C->N[u];
if (C == 0) return 0;
}
}
void addNode(Key *key, int value){...};
};
In this implementation of Prefix Tree (aka Trie) I found out that 90% of findNode() execution time is taken by a single operation C=C->N[u];
In my attempt to speed up this code, I randomly added the line that is commented in the snipped above, and code became 30% faster! Why is that?
UPDATE
Here is complete program.
#include "stdio.h"
#include "sys/time.h"
long time1000() {
timeval val;
gettimeofday(&val, 0);
val.tv_sec &= 0xffff;
return val.tv_sec * 1000 + val.tv_usec / 1000;
}
struct BitScanner {
void *p;
int count, pos;
BitScanner (void *p, int count) {
this->p = p;
this->count = count;
pos = 0;
}
int next() {
int bpos = pos >> 1;
if (bpos >= count) return -1;
unsigned char b = ((unsigned char*)p)[bpos];
if (pos++ & 1) return (b >>= 4);
return b & 0xf;
}
};
struct Node {
Node *N[16];
__int64_t value;
Node() : N(), value(-1) { }
};
struct Trie16 {
Node root;
bool add(void *key, int count, __int64_t value) {
Node *C = &root;
BitScanner B(key, count);
while (true) {
int u = B.next();
if (u < 0) {
if (C->value == -1) {
C->value = value;
return true; // value added
}
C->value = value;
return false; // value replaced
}
Node *Q = C->N[u];
if (Q) {
C = Q;
} else {
C = C->N[u] = new Node;
}
}
}
Node* findNode(void *key, int count) {
Node *C = &root;
BitScanner B(key, count);
while (true) {
char u = B.next();
if (u < 0) return C;
// if (C->N[0] == C->N[1]);
C = C->N[0+u];
if (C == 0) return 0;
}
}
};
int main() {
int T = time1000();
Trie16 trie;
__int64_t STEPS = 100000, STEP = 500000000, key;
key = 0;
for (int i = 0; i < STEPS; i++) {
key += STEP;
bool ok = trie.add(&key, 8, key+222);
}
printf("insert time:%i\n",time1000() - T); T = time1000();
int err = 0;
key = 0;
for (int i = 0; i < STEPS; i++) {
key += STEP;
Node *N = trie.findNode(&key, 8);
if (N==0 || N->value != key+222) err++;
}
printf("find time:%i\n",time1000() - T); T = time1000();
printf("errors:%i\n", err);
}
This is largely a guess but from what I read about CPU data prefetcher it would only prefetch if it sees multiple access to the same memory location and that access matches prefetch triggers, for example looks like scanning. In your case if there is only single access to C->N the prefetcher would not be interested, however if there are multiple and it can predict that the later access is further into the same bit of memory that can make it to prefetch more than one cache line.
If the above was happening then C->N[u] would not have to wait for memory to arrive from RAM therefore would be faster.
It looks like what you are doing is preventing processor stalls by delaying the execution of code until the data is available locally.
Doing it this way is very error prone unlikely to continue working consistently. The better way is to get the compiler to do this. By default most compilers generate code for a generic processor family. BUT if you look at the available flags you can usually find flags for specifying your specific processor so it can generate more specific code (like pre-fetches and stall code).
See: GCC: how is march different from mtune? the second answer goes into some detail: https://stackoverflow.com/a/23267520/14065
Since each write operation is costly than the read.
Here If you see that,
C = C->N[u]; it means CPU is executing write in each iteration for the variable C.
But when you perform if (C->N[0] == C->N[1]) dummy++; write on dummy is executed only if C->N[0] == C->N[1]. So you have save many write instructions of CPU by using if condition.

Huffman Decoding Sub-Table

I've been trying to implement a huffman decoder, and my initial attempt suffered from low performance due to a sub-optimal choice of decoding algorithm.
I thought I try to implement huffman decoding using table-lookups. However, I go a bit stuck on generating the subtables and was hoping someone could point me in the right direction.
struct node
{
node* children; // 0 right, 1 left
uint8_t value;
uint8_t is_leaf;
};
struct entry
{
uint8_t next_table_index;
std::vector<uint8_t> values;
entry() : next_table_index(0){}
};
void build_tables(node* nodes, std::vector<std::array<entry, 256>>& tables, int table_index);
void unpack_tree(void* data, node* nodes);
std::vector<uint8_t, tbb::cache_aligned_allocator<uint8_t>> decode_huff(void* input)
{
// Initial setup
CACHE_ALIGN node nodes[512] = {};
auto data = reinterpret_cast<unsigned long*>(input);
size_t table_size = *(data++); // Size is first 32 bits.
size_t result_size = *(data++); // Data size is second 32 bits.
unpack_tree(data, nodes);
auto huffman_data = reinterpret_cast<long*>(input) + (table_size+32)/32;
size_t data_size = *(huffman_data++); // Size is first 32 bits.
auto huffman_data2 = reinterpret_cast<char*>(huffman_data);
// Build tables
std::vector<std::array<entry, 256>> tables(1);
build_tables(nodes, tables, 0);
// Decode
uint8_t current_table_index = 0;
std::vector<uint8_t, tbb::cache_aligned_allocator<uint8_t>> result;
while(result.size() < result_size)
{
auto& table = tables[current_table_index];
uint8_t key = *(huffman_data2++);
auto& values = table[key].values;
result.insert(result.end(), values.begin(), values.end());
current_table_index = table[key].next_table_index;
}
result.resize(result_size);
return result;
}
void build_tables(node* nodes, std::vector<std::array<entry, 256>>& tables, int table_index)
{
for(int n = 0; n < 256; ++n)
{
auto current = nodes;
for(int i = 0; i < 8; ++i)
{
current = current->children + ((n >> i) & 1);
if(current->is_leaf)
tables[table_index][n].values.push_back(current->value);
}
if(!current->is_leaf)
{
if(current->value == 0)
{
current->value = tables.size();
tables.push_back(std::array<entry, 256>());
build_tables(current, tables, current->value);
}
tables[table_index][n].next_table_index = current->value;
}
}
}
void unpack_tree(void* data, node* nodes)
{
node* nodes_end = nodes+1;
bit_reader table_reader(data);
unsigned char n_bits = ((table_reader.next_bit() << 2) | (table_reader.next_bit() << 1) | (table_reader.next_bit() << 0)) & 0x7; // First 3 bits are n_bits-1.
// Unpack huffman-tree
std::stack<node*> stack;
stack.push(&nodes[0]); // "nodes" is root
while(!stack.empty())
{
node* ptr = stack.top();
stack.pop();
if(table_reader.next_bit())
{
ptr->is_leaf = 1;
ptr->children = nodes[0].children;
for(int n = n_bits; n >= 0; --n)
ptr->value |= table_reader.next_bit() << n;
}
else
{
ptr->children = nodes_end;
nodes_end += 2;
stack.push(ptr->children+0);
stack.push(ptr->children+1);
}
}
}
First off, avoid all those vectors. You can have pointers into a single preallocated buffer, but you don't want the scenario where vector allocates these tiny, tiny buffers all over memory, and your cache footprint goes through the roof.
Note also that the number of non-leaf states might be much less than 256. Indeed, it might be as low as 128. By assigning them low state IDs, we can avoid generating table entries for the entire set of state nodes (which may be as high as 511 nodes in total). After all, after consuming input, we'll never end up on a leaf node; if we do, we generate output, then head back to the root.
The first thing we should do, then, is reassign those states that correspond to internal nodes (ie, ones with pointers out to non-leaves) to low state numbers. You can use this to also reduce memory consumption for your state transition table.
Once we've assigned these low state numbers, we can go through each possible non-leaf state, and each possible input byte (ie, a doubly-nested for loop). Traverse the tree as you would for a bit-based decoding algorithm, and record the set of output bytes, the final node ID you end up on (which must not be a leaf!), and whether you hit an end-of-stream mark.

Iterative version of a recursive algorithm is slower

I'm trying to implement an iterative version of Tarjan's strongly connected components (SCCs), reproduced here for your convenience (source: http://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm).
Input: Graph G = (V, E)
index = 0 // DFS node number counter
S = empty // An empty stack of nodes
forall v in V do
if (v.index is undefined) // Start a DFS at each node
tarjan(v) // we haven't visited yet
procedure tarjan(v)
v.index = index // Set the depth index for v
v.lowlink = index
index = index + 1
S.push(v) // Push v on the stack
forall (v, v') in E do // Consider successors of v
if (v'.index is undefined) // Was successor v' visited?
tarjan(v') // Recurse
v.lowlink = min(v.lowlink, v'.lowlink)
else if (v' is in S) // Was successor v' in stack S?
v.lowlink = min(v.lowlink, v'.lowlink )
if (v.lowlink == v.index) // Is v the root of an SCC?
print "SCC:"
repeat
v' = S.pop
print v'
until (v' == v)
My iterative version uses the following Node struct.
struct Node {
int id; //Signed int up to 2^31 - 1 = 2,147,483,647
int index;
int lowlink;
Node *caller; //If you were looking at the recursive version, this is the node before the recursive call
unsigned int vindex; //Equivalent to the iterator in the for-loop in tarjan
vector<Node *> *nodeVector; //Vector of adjacent Nodes
};
Here's what I did for the iterative version:
void Graph::runTarjan(int out[]) { //You can ignore out. It's a 5-element array that keeps track of the largest 5 SCCs
int index = 0;
tarStack = new stack<Node *>();
onStack = new bool[numNodes];
for (int n = 0; n < numNodes; n++) {
if (nodes[n].index == unvisited) {
tarjan_iter(&nodes[n], index);
}
}
}
void Graph::tarjan_iter(Node *u, int &index) {
u->index = index;
u->lowlink = index;
index++;
u->vindex = 0;
tarStack->push(u);
u->caller = NULL; //Equivalent to the node from which the recursive call would spawn.
onStack[u->id - 1] = true;
Node *last = u;
while(true) {
if(last->vindex < last->nodeVector->size()) { //Equivalent to the check in the for-loop in the recursive version
Node *w = (*(last->nodeVector))[last->vindex];
last->vindex++; //Equivalent to incrementing the iterator in the for-loop in the recursive version
if(w->index == unvisited) {
w->caller = last;
w->vindex = 0;
w->index = index;
w->lowlink = index;
index++;
tarStack->push(w);
onStack[w->id - 1] = true;
last = w;
} else if(onStack[w->id - 1] == true) {
last->lowlink = min(last->lowlink, w->index);
}
} else { //Equivalent to the nodeSet iterator pointing to end()
if(last->lowlink == last->index) {
numScc++;
Node *top = tarStack->top();
tarStack->pop();
onStack[top->id - 1] = false;
int size = 1;
while(top->id != last->id) {
top = tarStack->top();
tarStack->pop();
onStack[top->id - 1] = false;
size++;
}
insertNewSCC(size); //Ranks the size among array of 5 elements
}
Node *newLast = last->caller; //Go up one recursive call
if(newLast != NULL) {
newLast->lowlink = min(newLast->lowlink, last->lowlink);
last = newLast;
} else { //We've seen all the nodes
break;
}
}
}
}
My iterative version runs and gives me the same output as the recursive version. The problem is that the iterative version is slower, and I'm not sure why. Can anyone give me some insight on my implementation? Is there a better way to implement the recursive algorithm iteratively?
A recursive algorithm uses the stack as storage area. In the iterative version, you use some vectors, which themselves rely on heap allocation. Stack-based allocation is known to be very fast, since it is only a matter of moving an end-of-stack pointer, whereas heap allocation may be substantially slower. That the iterative version is slower is not fully surprising.
Generally speaking, if the problem at hand fits well within a stack-only recursive model, then, by all means, recurse.