I've been trying to implement a huffman decoder, and my initial attempt suffered from low performance due to a sub-optimal choice of decoding algorithm.
I thought I try to implement huffman decoding using table-lookups. However, I go a bit stuck on generating the subtables and was hoping someone could point me in the right direction.
struct node
node* children; // 0 right, 1 left
uint8_t value;
uint8_t is_leaf;
struct entry
uint8_t next_table_index;
std::vector<uint8_t> values;
entry() : next_table_index(0){}
void build_tables(node* nodes, std::vector<std::array<entry, 256>>& tables, int table_index);
void unpack_tree(void* data, node* nodes);
std::vector<uint8_t, tbb::cache_aligned_allocator<uint8_t>> decode_huff(void* input)
// Initial setup
CACHE_ALIGN node nodes[512] = {};
auto data = reinterpret_cast<unsigned long*>(input);
size_t table_size = *(data++); // Size is first 32 bits.
size_t result_size = *(data++); // Data size is second 32 bits.
unpack_tree(data, nodes);
auto huffman_data = reinterpret_cast<long*>(input) + (table_size+32)/32;
size_t data_size = *(huffman_data++); // Size is first 32 bits.
auto huffman_data2 = reinterpret_cast<char*>(huffman_data);
// Build tables
std::vector<std::array<entry, 256>> tables(1);
build_tables(nodes, tables, 0);
// Decode
uint8_t current_table_index = 0;
std::vector<uint8_t, tbb::cache_aligned_allocator<uint8_t>> result;
while(result.size() < result_size)
auto& table = tables[current_table_index];
uint8_t key = *(huffman_data2++);
auto& values = table[key].values;
result.insert(result.end(), values.begin(), values.end());
current_table_index = table[key].next_table_index;
return result;
void build_tables(node* nodes, std::vector<std::array<entry, 256>>& tables, int table_index)
for(int n = 0; n < 256; ++n)
auto current = nodes;
for(int i = 0; i < 8; ++i)
current = current->children + ((n >> i) & 1);
if(current->value == 0)
current->value = tables.size();
tables.push_back(std::array<entry, 256>());
build_tables(current, tables, current->value);
tables[table_index][n].next_table_index = current->value;
void unpack_tree(void* data, node* nodes)
node* nodes_end = nodes+1;
bit_reader table_reader(data);
unsigned char n_bits = ((table_reader.next_bit() << 2) | (table_reader.next_bit() << 1) | (table_reader.next_bit() << 0)) & 0x7; // First 3 bits are n_bits-1.
// Unpack huffman-tree
std::stack<node*> stack;
stack.push(&nodes[0]); // "nodes" is root
node* ptr =;
ptr->is_leaf = 1;
ptr->children = nodes[0].children;
for(int n = n_bits; n >= 0; --n)
ptr->value |= table_reader.next_bit() << n;
ptr->children = nodes_end;
nodes_end += 2;

First off, avoid all those vectors. You can have pointers into a single preallocated buffer, but you don't want the scenario where vector allocates these tiny, tiny buffers all over memory, and your cache footprint goes through the roof.
Note also that the number of non-leaf states might be much less than 256. Indeed, it might be as low as 128. By assigning them low state IDs, we can avoid generating table entries for the entire set of state nodes (which may be as high as 511 nodes in total). After all, after consuming input, we'll never end up on a leaf node; if we do, we generate output, then head back to the root.
The first thing we should do, then, is reassign those states that correspond to internal nodes (ie, ones with pointers out to non-leaves) to low state numbers. You can use this to also reduce memory consumption for your state transition table.
Once we've assigned these low state numbers, we can go through each possible non-leaf state, and each possible input byte (ie, a doubly-nested for loop). Traverse the tree as you would for a bit-based decoding algorithm, and record the set of output bytes, the final node ID you end up on (which must not be a leaf!), and whether you hit an end-of-stream mark.


C++ permutation tree

I have tasks and I want to calculate the most profitable order to arrange them.
Instead of checking every permutation and doing n*n! calculations, I want to build a tree of permutations, that is, the number of children at each level decreases by 1, and at each node the sub-permutation that has already been calculated will be saved and not recalculated.
For example, if I have 4 tasks, the tree will look like this:
My attached code is missing. I don't know how to build the tree and the give nodes the indexes as in the figure. I know how to deal with a binary tree, but not with a tree where the number of children is different at each lavel.
(The value of each task depends on its location.
I know how to do that, so I didn't include it in the question).
int n = 4;
struct node
int task_index = -1;
double value;
struct node **next;
void build_tree(node *current_node, int current_level = 0)
if (current_level < 1 || current_level >= n)
// current_node->task_index = ? ;
current_node->next = new node *[n - current_level];
for (int i = 0; i < n - current_level; i++)
build_tree(current_node->next[i], current_level + 1);
void print_tree(node *current_node, int current_level = 0)
// print indexes
void delete_tree(node *current_node, int current_level = 0)
// delete nodes
int main()
struct node *root = new node;
delete root;
return 0;
void build_tree(node *current_node, int current_level = 0)
if (current_level < 1 || current_level >= n)
// current_node->task_index = ? ;
current_node->next = new node *[n - current_level];
for (int i = 0; i < n - current_level; i++)
build_tree(current_node->next[i], current_level + 1);
When called with the default parameter of current_level = 0, as you illustrate in your code below, this function exits on the first line without doing anything. You need to decide whether you are indexing starting from 0 or from 1.
Other than that, the general outline of the algorithm looks okay, although I did not explicitly check for correctness.
Now, more broadly: is this an exercise to see if you can write a tree structure, or are you trying to get the job done? In the latter case you probably want to use a prebuilt data structure like that in the boost graph library.
If it's an exercise in building a tree structure, is it specifically an exercise to see if you can write code dealing with raw pointers-to-pointers? If not, you should work with the correct C++ containers for the job. For instance you probably want to store the list of child nodes in a std::vector rather than have a pointer-to-pointer with the only way to tell how many child nodes exist being the depth of the node in the tree. (There may be some use case for such an extremely specialized structure if you are hyper-optimizing something for a very specific reason, but it doesn't look like that's what's going on here.)
From your explanation what you are trying to build is a data structure that reuses sub-trees for common permutations:
012 -> X
210 -> X
such that X is only instantiated once. This, of course, is recursive, seeing as
01 -> Y
10 -> Y
Y2 -> X
If you look at it closely, there are 2^n such subtrees, because any prefix can have any one of the n input tasks used or not. This means you can represent the subtree as an index into an array of size 2^n, with a total footprint O(n*2^n), which improves on the vastly larger >n! tree:
struct Edge {
std::size_t task;
std::size_t sub;
struct Node {
std::vector<Edge> successor; // size in [0,n]
std::vector<Node> permutations; // size exactly 2^n
This will have this structure:
permutations: 0 1 2 3 4 ...
Where the node at, e.g., location 3 has both task 0 and 1 already used and "points" to all (n-2) subtrees.
Of course, building this is not entirely trivial, but it compressed the search space and allows you re-use results for specific sub-trees.
You can build the table like this:
for (std::size_t i = 0; i < size(permutations); ++i) {
permutations[i].successor.reserve(n); // maybe better heuristic?
for (std::size_t j = 0; j < n; ++j) {
if (((1<<j) & i) == 0) {
Here is a live demo for n=4.
The recursive way to generate permutations is if you have n items then all of the permutations of the items are each of the n items concatenated with the permutations of the n-1 remaining items. In code this is easier to do if you pass around the collection of items.
Below I do it with an std::vector<int>. Once using a vector it makes more sense to just follow the "rule of zero" pattern and let the nodes have vectors of children and then not need to dynamically allocate anything manually:
#include <vector>
#include <algorithm>
#include <iostream>
struct node
int task_index = -1;
double value;
std::vector<node> next;
std::vector<int> remove_item(int item, const std::vector<int>& items) {
std::vector<int> output(items.size() - 1);
std::copy_if(items.begin(), items.end(), output.begin(),
[item](auto v) {return v != item; }
return output;
void build_tree(node& current_node, const std::vector<int>& tasks)
auto n = static_cast<int>(tasks.size());
for (auto curr_task : tasks) {
node child{ curr_task, 0.0, {} };
if (n > 1) {
build_tree(child, remove_item(curr_task, tasks));
void print_tree(const node& current_node)
std::cout << "( " << current_node.task_index << " ";
for (const auto& child : {
std::cout << " )";
int main()
node root{ -1, 0.0, {} };
build_tree(root, { 1, 2, 3 });
return 0;

Getting a floating point exception error while doing text frequency analysis?

So for a school project, we are being asked to do a word frequency analysis of a text file using dictionaries and bucket hashing. The output should be something like this:
$ ./stats < jabberwocky.txt
READING text from STDIN. Hit ctrl-d when done entering text.
HERE are the word statistics of that text:
There are 94 distinct words used in that text.
The top 10 ranked words (with their frequencies) are:
1. the:19, 2. and:14, 3. !:11, 4. he:7, 5. in:6, 6. .:5, 7.
through:3, 8. my:3, 9. jabberwock:3, 10. went:2
Among its 94 words, 57 of them appear exactly once.
Most of the code has been written for us, but there are four functions we need to complete to get this working:
increment(dict D, std::str w) which will increment the count of a word or add a new entry in the dictionary if it isn't there,
getCount(dict D, std::str w) which fetches the count of a word or returns 0,
dumpAndDestroy(dict D) which dumps the words and counts of those words into a new array by decreasing order of count and deletes D's buckets off the heap, and returns the pointer to that array,
rehash(dict D, std::str w) which rehashes the function when needed.
The structs used are here for reference:
// entry
// A linked list node for word/count entries in the dictionary.
struct entry {
std::string word; // The word that serves as the key for this entry.
int count; // The integer count associated with that word.
struct entry* next;
// bucket
// A bucket serving as the collection of entries that map to a
// certain location within a bucket hash table.
struct bucket {
entry* first; // It's just a pointer to the first entry in the
// bucket list.
// dict
// The unordered dictionary of word/count entries, organized as a
// bucket hash table.
struct dict {
bucket* buckets; // An array of buckets, indexed by the hash function.
int numIncrements; // Total count over all entries. Number of `increment` calls.
int numBuckets; // The array is indexed from 0 to numBuckets.
int numEntries; // The total number of entries in the whole
// dictionary, distributed amongst its buckets.
int loadFactor; // The threshold maximum average size of the
// buckets. When numEntries/numBuckets exceeds
// this loadFactor, the table gets rehashed.
I've written these functions, but when I try to run it with a text file, I get a Floating point exception error. I've emailed my professor for help, but he hasn't replied. This project is due very soon, so help would be much appreciated! My written functions for these are as below:
int getCount(dict* D, std::string w) {
int stringCount;
int countHash = hashValue(w, numKeys(D));
bucket correctList = D->buckets[countHash];
entry* current = correctList.first;
while (current != nullptr && current->word < w) {
if (current->word == w) {
stringCount = current->count;
current = current->next;
std::cout << "getCount working" << std::endl;
return stringCount;
void rehash(dict* D) {
int newSize = (D->numBuckets * 2) + 1;
bucket** newArray = new bucket*[newSize];
for (int i = 0; i < D->numBuckets; i++) {
entry *n = D->buckets->first;
while (n != nullptr) {
entry *tmp = n;
n = n->next;
int newHashValue = hashValue(tmp->word, newSize);
newArray[newHashValue]->first = tmp;
delete [] D->buckets;
D->buckets = *newArray;
std::cout << "rehash working" << std::endl;
void increment(dict* D, std::string w) {
int incrementHash = hashValue(w, numKeys(D));
entry* current = D->buckets[incrementHash].first;
if (current == nullptr) {
int originalLF = D->loadFactor;
if ((D->numEntries + 1)/(D->numBuckets) > originalLF) {
int incrementHash = hashValue(w, numKeys(D));
D->buckets[incrementHash].first->word = w;
while (current != nullptr && current->word < w) {
entry* follow = current;
current = current->next;
if (current->word == w) {
std::cout << "increment working" << std::endl;
entry* dumpAndDestroy(dict* D) {
entry* es = new entry[D->numEntries];
for (int i = 0; i < D->numEntries; i++) {
es[i].word = "foo";
es[i].count = 0;
for (int j = 0; j < D->numBuckets; j++) {
entry* current = D->buckets[j].first;
while (current != nullptr) {
es[j].word = current->word;
es[j].count = current->count;
current = current->next;
delete [] D->buckets;
std::cout << "dumpAndDestroy working" << std::endl;
return es;
A floating-point exception is usually caused by the code attempting to divide-by-zero (or attempting to modulo-by-zero, which implicitly causes a divide-by-zero). With that in mind, I suspect this line is the locus of your problem:
if ((D->numEntries + 1)/(D->numBuckets) > originalLF) {
Note that if D->numBuckets is equal to zero, this line will do a divide-by-zero. I suggest temporarily inserting a line like like
std::cout << "about to divide by " << D->numBuckets << std::endl;
just before that line, and then re-running your program; that will make the problem apparent, assuming it is the problem. The solution, of course, is to make sure your code doesn't divide-by-zero (i.e. by setting D->numBuckets to the appropriate value, or alternatively by checking to see if it is zero before trying to use it is a divisor)

disruptor: claiming slots in the face of integer overflow

I am implementing the disruptor pattern for inter thread communication in C++ for multiple producers. In the implementation from LMAX the next(n) method in uses signed integers (okay it's Java) but also the C++ port (disruptor--) uses signed integers. After a very (very) long time overflow will result in undefined behavior.
Unsigned integers have multiple advantages:
correct behavior on overflow
no need for 64 bit integers
Here is my approach for claiming n slots (source is attached at the end): next_ is the index of next free slot that can be claimed, tail_ is the last free slot that can be claimed (will be updated somewhere else). n is smaller than the buffer size. My approach is to normalize next and tail position for intermediate calculations by subtracting tail from next. Adding n to normalized next norm must be smaller than the buffer size to successfully claim the slots between next_ and next_+n. It is assumed that norm + n will not overflow.
1) Is it correct or does next_ get passed tail_ in some cases? Does it work with smaller integer types like uint32_t or uint16_t iff buffer size and n are restricted e.g. to 1/10 * maximum integer of these types.
2) If it is not correct then I would like to know the concrete case.
3) Is something else wrong or what can be improved? (I omitted the cacheline padding)
class msg_ctrl
inline msg_ctrl();
inline int claim(size_t n, uint64_t& seq);
inline int publish(size_t n, uint64_t seq);
inline int tail(uint64_t t);
std::atomic<uint64_t> next_;
std::atomic<uint64_t> head_;
std::atomic<uint64_t> tail_;
// Implementation -----------------------------------------
msg_ctrl::msg_ctrl() : next_(2), head_(1), tail_(0)
int msg_ctrl::claim(size_t n, uint64_t& seq)
uint64_t const size = msg_buffer::size();
if (n > 1024) // please do not try to reserve too much slots
return -1;
uint64_t curr = 0;
curr = next_.load();
uint64_t tail = tail_.load();
uint64_t norm = curr - tail;
uint64_t next = norm + n;
if (next > size)
std::this_thread::yield(); // todo: some wait strategy
else if (next_.compare_exchange_weak(curr, curr + n))
} while (true);
seq = curr;
return 0;
int msg_ctrl::publish(size_t n, uint64_t seq)
uint64_t tmp = seq-1;
uint64_t val = seq+n-1;
while (!head_.compare_exchange_weak(tmp, val))
tmp = seq-1;
return 0;
int msg_ctrl::tail(uint64_t t)
return 0;
Publishing to the ring buffer will look like:
size_t n = 15;
uint64_t seq = 0;
msg_ctrl->claim(n, seq);
//fill items in buffer
buffer[seq + 0] = an item
buffer[seq + 1] = an item
buffer[seq + n-1] = an item
msg_ctrl->publish(n, seq);

Why does random extra code improve performance?

Struct Node {
Node *N[SIZE];
int value;
struct Trie {
Node *root;
Node* findNode(Key *key) {
Node *C = &root;
char u;
while (1) {
u = key->next();
if (u < 0) return C;
// if (C->N[0] == C->N[0]); // this line will speed up execution significantly
C = C->N[u];
if (C == 0) return 0;
void addNode(Key *key, int value){...};
In this implementation of Prefix Tree (aka Trie) I found out that 90% of findNode() execution time is taken by a single operation C=C->N[u];
In my attempt to speed up this code, I randomly added the line that is commented in the snipped above, and code became 30% faster! Why is that?
Here is complete program.
#include "stdio.h"
#include "sys/time.h"
long time1000() {
timeval val;
gettimeofday(&val, 0);
val.tv_sec &= 0xffff;
return val.tv_sec * 1000 + val.tv_usec / 1000;
struct BitScanner {
void *p;
int count, pos;
BitScanner (void *p, int count) {
this->p = p;
this->count = count;
pos = 0;
int next() {
int bpos = pos >> 1;
if (bpos >= count) return -1;
unsigned char b = ((unsigned char*)p)[bpos];
if (pos++ & 1) return (b >>= 4);
return b & 0xf;
struct Node {
Node *N[16];
__int64_t value;
Node() : N(), value(-1) { }
struct Trie16 {
Node root;
bool add(void *key, int count, __int64_t value) {
Node *C = &root;
BitScanner B(key, count);
while (true) {
int u =;
if (u < 0) {
if (C->value == -1) {
C->value = value;
return true; // value added
C->value = value;
return false; // value replaced
Node *Q = C->N[u];
if (Q) {
C = Q;
} else {
C = C->N[u] = new Node;
Node* findNode(void *key, int count) {
Node *C = &root;
BitScanner B(key, count);
while (true) {
char u =;
if (u < 0) return C;
// if (C->N[0] == C->N[1]);
C = C->N[0+u];
if (C == 0) return 0;
int main() {
int T = time1000();
Trie16 trie;
__int64_t STEPS = 100000, STEP = 500000000, key;
key = 0;
for (int i = 0; i < STEPS; i++) {
key += STEP;
bool ok = trie.add(&key, 8, key+222);
printf("insert time:%i\n",time1000() - T); T = time1000();
int err = 0;
key = 0;
for (int i = 0; i < STEPS; i++) {
key += STEP;
Node *N = trie.findNode(&key, 8);
if (N==0 || N->value != key+222) err++;
printf("find time:%i\n",time1000() - T); T = time1000();
printf("errors:%i\n", err);
This is largely a guess but from what I read about CPU data prefetcher it would only prefetch if it sees multiple access to the same memory location and that access matches prefetch triggers, for example looks like scanning. In your case if there is only single access to C->N the prefetcher would not be interested, however if there are multiple and it can predict that the later access is further into the same bit of memory that can make it to prefetch more than one cache line.
If the above was happening then C->N[u] would not have to wait for memory to arrive from RAM therefore would be faster.
It looks like what you are doing is preventing processor stalls by delaying the execution of code until the data is available locally.
Doing it this way is very error prone unlikely to continue working consistently. The better way is to get the compiler to do this. By default most compilers generate code for a generic processor family. BUT if you look at the available flags you can usually find flags for specifying your specific processor so it can generate more specific code (like pre-fetches and stall code).
See: GCC: how is march different from mtune? the second answer goes into some detail:
Since each write operation is costly than the read.
Here If you see that,
C = C->N[u]; it means CPU is executing write in each iteration for the variable C.
But when you perform if (C->N[0] == C->N[1]) dummy++; write on dummy is executed only if C->N[0] == C->N[1]. So you have save many write instructions of CPU by using if condition.

Why dynamic memory allocation is not linear in scale?

I am investigating data structures to satisfy O(1) get operations and came across a structure called Trie.
I have implemented the below simple Trie structure to hold numbers (digits only).
Ignore the memory leak - it is not the topic here :)
The actual storage in the Data class is not related as well.
#include <sstream>
#include <string>
struct Data
Data(): m_nData(0){}
int m_nData;
struct Node
Node(): m_pData(NULL)
for (size_t n = 0; n < 10; n++)
digits[n] = NULL;
void m_zAddPartialNumber(std::string sNumber)
if (sNumber.empty() == true) // last digit
m_pData = new Data;
m_pData->m_nData = 1;
size_t nDigit = *(sNumber.begin()) - '0';
if (digits[nDigit] == NULL)
digits[nDigit] = new Node;
digits[nDigit]->m_zAddPartialNumber(sNumber.substr(1, sNumber.length() - 1));
Data* m_pData;
Node* digits[10];
struct DB
DB() : root(NULL){}
void m_zAddNumber(std::string sNumber)
if (root == NULL)
root = new Node;
Node* root;
int main()
for (size_t nNumber = 0; nNumber <= 10000; nNumber++)
std::ostringstream convert;
convert << nNumber;
std::string sNumber = convert.str();
return 0;
My main function is simply inserting numbers into the data structure.
I've examined the overall memory allocated using Windows task manager and came across an interesting feature i can't explain and am seeking your advice.
I've re-executed my simple program with different numbers inserted to the structure (altering the for loop stop condition) - here is a table of the experiment results:
Plotting the numbers in a logarithmic scaled graph reveals:
As you can see, the graph is not linear.
My question is why?
I would expect the allocation to behave linear across the range.
A linear relation of y on x is of the form y=a+bx. This is a straight line in a y vs x plot, but not in a log(y) vs log(x) plot, unless the constant a=0. So, I conjecture that your relation may still be (nearly) linear with a~340 kB.