array traversal vs pointer, cache efficiency aspect - c++

void foo(Node* p[], int size){
_uint64 arr_of_values[_MAX_THREADS];
for (int i=0 ; i < size ; i++ ){
arr_of_values[i] = p[i]->....;
// much code here
void foo(Node* p[], int size){
_uint64 arr_of_values[_MAX_THREADS];
Node* p_end = p[size];
for ( ; p != p_end ; ){
arr_of_values[i] = (*p)->.....;
// much code here
I created this function to demonstrate what i am asking:
what is more efficient from the cache efficiency aspect : taking p[i] or using *p++?
(i'll never use the p[i-x] in the rest of the code, but i may use p[i] or *p in the following calculation)

the most important is to avoid false sharing in the arr_of_values. Each thread write into its own slot, but 8 or 16 slots share a cache line (depending on CPU) causing a massive false sharing problem. Add padding between the slots to cache align each thread's slot, or accumulate on stack and write only once at the end:
void foo(Node* p[], int size){
_uint64 arr_of_values[_MAX_THREADS];
Node* p_end = p[size];
for ( ; p != p_end ; ){
temp = .....;
// much code here
arr_of_values[i] = temp;
the question of access by pointer or access by index is irrelevant with today's compiler.s
Your next actions should be: grab a copy of the Software optimization Cookbook. Read it. Measure. Fix the measured hotspot, not the guesstimates.

The problem from a cache point of view isn't the way you are accessing the elements. In this case using a pointer or the array index is equivalent.
BTW Node* p[] is an array of pointer. So you could have possibly allocated your Node objects into distant memory areas. (For example using several ptr = new Node()). The best cache performance are gainable if:
Your Node are stored contiguosly into memory
Node size doesn't exceed the cache size.


C++ inserting (and shifting) data into an array

I am trying to insert data into a leaf node (an array) of a B-Tree. Here is the code I have so far:
void LeafNode::insertCorrectPosLeaf(int num)
for (int pos=count; pos>=0; pos--) // goes through values in leaf node
if (num < values[pos-1]) // if inserting num < previous value in leaf node
{continue;} // conitnue searching for correct place
else // if inserting num >= previous value in leaf node
values[pos] = num; // inserts in position
} // insertCorrectPos()
Before the line values[pos] = num, I think need to write some code that shifts the existing data instead of overwriting it. I am trying to use memmove but have a question about it. Its third parameter is the number of bytes to copy. If I am moving a single int on a 64 bit machine, does this mean I would put a "4" here? If I am going about this completely wrong any any help would be greatly appreciated. Thanks
The easiest way (and probably the most efficient) would be to use one of the standard libraries predefined structures to implement "values". I suggest either list or vector. This is because both list and vector has an insert function that does it for you. I suggest the vector class specifically is because it has the same kind of interface that an array has. However, if you want to optimize for speed of this action specifically, then I would suggest the list class because of the way it is implemented.
If you would rather to it the hard way, then here goes...
First, you need to make sure that you have the space to work in. You can either allocate dynamically:
int *values = new int[size];
or statically
int values[MAX_SIZE];
If you allocate statically, then you need to make sure that MAX_SIZE is some gigantic value that you will never ever exceed. Furthermore, you need to check the actual size of the array against the amount of allocated space every time you add an element.
if (size < MAX_SIZE-1)
// add an element
If you allocate dynamically, then you need to reallocate the whole array every time you add an element.
int *temp = new int[size+1];
for (int i = 0; i < size; i++)
temp[i] = values[i];
delete [] values;
values = temp;
temp = NULL;
// add the element
When you insert a new value, you need to shift every value over.
int temp = 0;
for (i = 0; i < size+1; i++)
if (values[i] > num || i == size)
temp = values[i];
values[i] = num;
num = temp;
Keep in mind that this is not at all optimized. A truly magical implementation would combine the two allocation strategies by dynamically allocating more space than you need, then growing the array by blocks when you run out of space. This is exactly what the vector implementation does.
The list implementation uses a linked list which has O(1) time for inserting a value because of it's structure. However, it is much less space inefficient and has O(n) time for accessing an element at location n.
Also, this code was written on the fly... be careful when using it. There might be a weird edge case that I am missing in the last code segment.

How can I improve the performance of my ring buffer code?

I am using a ringbuffer to hold samples for a streaming audio application. I copied the ringbuffer implementation from Ken Greenebaum's Audio Anecdotes 2 book.
After running Intel's Vtune analyzer on my code, it tells me that most of the time is being spent in the functions getSamplesAvailable() and getSpaceAvailable().
Can anyone advise as to how I might optimise these functions?
int count = (mTail - mHead + mSize) % mSize;
unsigned int RingBuffer::getSpaceAvailable(void)
int free = (mHead - mTail + mSize - 1)%mSize;
int underMark = mHighWaterMark - getSamplesAvailable();
int spaceAvailable = min(underMark, free);
int RingBuffer::push(int value)
int status = 1;
if(getSpaceAvailable()) {
// next two operations do NOT have to be atomic!
// do NOT have to worry about collision with _tail
mBuffer[mTail] = value; // store value
mTail = ++mTail % mSize; // increment tail
} else {
status = 0;
int RingBuffer::pop(int *value)
int status = 1;
if(getSamplesAvailable()) {
*value = mBuffer[mHead];
mHead = ++mHead % mSize; // increment head
} else {
status = 0;
If you can make mSize a power of two, you can replace
(mTail - mHead + mSize) % mSize
(mTail - mHead) & (mSize-1)
(mHead - mTail + mSize - 1) % mSize
(mHead - mTail - 1) & (mSize - 1)
I think the problem is not their complexity, they are just basic integer arithmetic, but how many times they are called.
Is there any possibility of doing "batch" (inserting or retrieving various values at once) updates on the buffer? That way you could save some calculations.
Using a power of two as Henrik proposed is the first thing to do. There is also the possibility to change the way you code the mTail and mHead indexes. Instead of keeping them in the [0, mSize[ range, you can let them run freely as uint32_t.
When accessing an element you will need to do a modulo mSize which will slow down each access.
mBuffer[mTail % mSize] = value;
But it will simpify for instance the count of samples (even if your indexes wrap over the uint32_t max value):
int count = mTail - mHead;
It will also allow you to fully use the ring buffer, instead of loosing one element to differentiate the cases where the buffer is full or empty.
If speed is the most important thing for you and you can live with the fact that it is a) non portable (only Windows, although linux has the same basic functionality as well so that should work there as well) and b) only works in release builds (well has more to do with how VC++ allocates memory in debug mode - probably there's some compile flag for this?) you can use the following:
DWORD size = 64 * 1024; // HAS to be a multiple of 64k due to how win allocates memory
HANDLE mapped_memory = CreateFileMapping(INVALID_HANDLE_VALUE, NULL, PAGE_READWRITE, 0, size, NULL);
int *p1 = (int*)MapViewOfFile(mapped_memory, FILE_MAP_WRITE, 0, 0, size);
int *p2 = (int*)MapViewOfFile(mapped_memory, FILE_MAP_WRITE, 0, 0, size);
// p1 and p2 should be adjacent in memory, if not try again.. no idea if there's some
// better method under windows
Basically you now have two adjacent memory blocks in virtual memory that point to the same physical memory. Ie if you write through pdw1 you'll see the changes in pdw2 and vice-versa.
The advantage is that you can now more efficiently read and write to the buffer and also larger amounts than only one word at a time. You just have to decrement the pointers correctly - shouldn't be too hard to implement.
Edit: Now see that - there's even a POSIX implementation on wiki.

Branchless memory manager?

Anyone thought about how to write a memory manager (in C++) that is completely branch free? I've written a pool, a stack, a queue, and a linked list (allocating from the pool), but I am wondering how plausible it is to write a branch free general memory manager.
This is all to help make a really reusable framework for doing solid concurrent, in-order CPU, and cache friendly development.
Edit: by branchless I mean without doing direct or indirect function calls, and without using ifs. I've been thinking that I can probably implement something that first changes the requested size to zero for false calls, but haven't really got much more than that.
I feel that it's not impossible, but the other aspect of this exercise is then profiling it on said "unfriendly" processors to see if it's worth trying as hard as this to avoid branching.
While I don't think this is a good idea, one solution would be to have pre-allocated buckets of various log2 sizes, stupid pseudocode:
class Allocator {
void* malloc(size_t size) {
int bucket = log2(size + sizeof(int));
int* pointer = reinterpret_cast<int*>(m_buckets[bucket].back());
*pointer = bucket; //Store which bucket this was allocated from
return pointer + 1; //Dont overwrite header
void free(void* pointer) {
int* temp = reinterpret_cast<int*>(pointer) - 1;
vector< vector<void*> > m_buckets;
(You would of course also replace the std::vector with a simple array + counter).
EDIT: In order to make this robust (i.e. handle the situation where the bucket is empty) you would have to add some form of branching.
EDIT2: Here's a small branchless log2 function:
//returns the smallest x such that value <= (1 << x)
log2(int value) {
union Foo {
int x;
float y;
} foo;
foo.y = value - 1;
return ((foo.x & (0xFF << 23)) >> 23) - 126; //Extract exponent (base 2) of floating point number
This gives the correct result for allocations < 33554432 bytes. If you need larger allocations you'll have to switch to doubles.
Here's a link to how floating point numbers are represented in memory.
The only way I know to create a truly branchless allocator is to reserve all the memory it will potentially use in advance. Otherwise there's always going to be some hidden code somewhere to see if we're exceeding some current capacity whether it's in a hidden push_back in a vector checking if the size exceeds capacity used to implement it or something of that sort.
Here is one such crude example of a fixed alloc which has a completely branchless malloc and free method.
class FixedAlloc
FixedAlloc(int element_size, int num_reserve)
element_size = max(element_size, sizeof(Chunk));
mem = new char[num_reserve * element_size];
char* ptr = mem;
free_chunk = reinterpret_cast<Chunk*>(ptr);
free_chunk->next = 0;
Chunk* last_chunk = free_chunk;
for (int j=1; j < num_reserve; ++j)
ptr += element_size;
Chunk* chunk = reinterpret_cast<Chunk*>(ptr);
chunk->next = 0;
last_chunk->next = chunk;
last_chunk = chunk;
delete[] mem;
void* malloc()
assert(free_chunk && free_chunk->next && "Reserve memory exhausted!");
Chunk* chunk = free_chunk;
free_chunk = free_chunk->next;
return chunk->mem;
void free(void* mem)
Chunk* chunk = static_cast<Chunk*>(mem);
chunk->next = free_chunk;
free_chunk = chunk;
union Chunk
Chunk* next;
char mem[1];
char* mem;
Chunk* free_chunk;
Since it's totally branchless, it simply segfaults if you try to allocate more memory than initially reserved. It also has undefined behavior for trying to free a null pointer. I also avoided dealing with alignment for the sake of a simpler example.

lock free arena allocator implementation - correct?

for a simple pointer-increment allocator (do they have an official name?) I am looking for a lock-free algorithm. It seems trivial, but I'd like to get soem feedback whether my implementaiton is correct.
not threadsafe implementation:
byte * head; // current head of remaining buffer
byte * end; // end of remaining buffer
void * Alloc(size_t size)
if (end-head < size)
return 0; // allocation failure
void * result = head;
head += size;
return head;
My attempt at a thread safe implementation:
void * Alloc(size_t size)
byte * current;
current = head;
if (end - current < size)
return 0; // allocation failure
} while (CMPXCHG(&head, current+size, current) != current));
return current;
where CMPXCHG is an interlocked compare exchange with (destination, exchangeValue, comparand) arguments, returning the original value
Looks good to me - if another thread allocates between the get-current and cmpxchg, the loop attempts again. Any comments?
Your current code appears to work. Your code behaves the same as the below code, which is a simple pattern that you can use for implementing any lock-free algorithm that operates on a single word of data without side-effects
original = *data; // Capture.
result = DoOperation(original); // Attempt operation
} while (CMPXCHG(data, result, original) != original);
EDIT: My original suggestion of interlocked add won't quite work here because you support trying to allocate and failing if not enough space left. You've already modified the pointer and causing subsequent allocs to fail if you used InterlockedAdd.

PushFront method for an array C++

I thought i'd post a little of my homework assignment. Im so lost in it. I just have to be really efficient. Without using any stls, boosts and the like. By this post, I was hoping that someone could help me figure it out.
bool stack::pushFront(const int nPushFront)
if ( count == maxSize ) // indicates a full array
return false;
else if ( count <= 0 )
items[top+1].n = nPushFront;
return true;
for ( int i = 0; i < count - 1; i++ )
intBackPtr = intFrontPtr;
*intBackPtr = *intFrontPtr;
items[top+1].n = nPushFront;
return true;
I just cannot figure out for the life of me to do this correctly! I hope im doing this right, what with the pointers and all
int *intFrontPtr = &items[0].n;
int *intBackPtr = &items[capacity-1].n;
Im trying to think of this pushFront method like shifting an array to the right by 'n' units...I can only seem to do that in an array that is full. Can someone out their please help me?
Firstly, I'm not sure why you have the line else if ( count <= 0 ) - the count of items in your stack should never be below 0.
Usually, you would implement a stack not by pushing to the front, but pushing and popping from the back. So rather than moving everything along, as it looks like you're doing, just store a pointer to where the last element is, and insert just after that, and pop from there. When you push, just increment that pointer, and when you pop, decrement it (you don't even have to delete it). If that pointer is at the end of your array, you're full (so you don't even have to store a count value). And if it's at the start, then it's empty.
If you're after a queue, look into Circular Queues. That's typically how you'd implement one in an array. Alternatively, rather than using an array, try a Linked List - that lets it be arbitrarily big (the only limit is your computer's memory).
You don't need any pointers to shift an array. Just use simple for statement:
int *a; // Your array
int count; // Elements count in array
int length; // Length of array (maxSize)
bool pushFront(const int nPushFront)
if (count == length) return false;
for (int i = count - 1; i >= 0; --i)
Swap(a[i], a[i + 1]);
a[0] = nPushFront; ++count;
return true;
Without doing your homework for you let me see if I can give you some hints. Implementing a deque (double ended queue) is really quite easy if you can get your head around a few concepts.
Firstly, it is key to note that since we will be popping off the front and/or back in order to efficiently code an algorithm which uses contiguous storage we need to be able to pop front/back without shifting the entire array (what you currently do). A much better and in my mind simpler way is to track the front AND the back of the relevant data within your deque.
As a simple example of the above concept consider a static (cannot grow) deque of size 10:
class Deque
: front(0)
, count(0) {}
size_t front;
size_t count;
enum {
int data[MAXSIZE];
You can of course implement this and allow it to grow in size etc. But for simplicity I'm leaving all that out. Now to allow a user to add to the deque:
void Deque::push_back(int value)
throw std::runtime_error("Deque full!");
data[(front+count)%MAXSIZE] = value;
And to pop off the back:
int Deque::pop_back()
throw std::runtime_error("Deque empty! Cannot pop!");
int value = data[(front+(--count))%MAXSIZE];
return value;
Now the key thing to observe in the above functions is how we are accessing the data within the array. By modding with MAXSIZE we ensure that we are not accessing out of bounds, and that we are hitting the right value. Also as the value of front changes (due to push_front, pop_front) the modulus operator ensures that wrap around is dealt with appropriately. I'll show you how to do push_front, you can figure out pop_front for yourself:
void Deque::push_front(int value)
throw std::runtime_error("Deque full!");
// Determine where front should now be.
if (front==0)
front = MAXSIZE-1;
data[front] = value;