Branchless memory manager?

Branchless memory manager? - c++

Anyone thought about how to write a memory manager (in C++) that is completely branch free? I've written a pool, a stack, a queue, and a linked list (allocating from the pool), but I am wondering how plausible it is to write a branch free general memory manager.
This is all to help make a really reusable framework for doing solid concurrent, in-order CPU, and cache friendly development.
Edit: by branchless I mean without doing direct or indirect function calls, and without using ifs. I've been thinking that I can probably implement something that first changes the requested size to zero for false calls, but haven't really got much more than that.
I feel that it's not impossible, but the other aspect of this exercise is then profiling it on said "unfriendly" processors to see if it's worth trying as hard as this to avoid branching.

While I don't think this is a good idea, one solution would be to have pre-allocated buckets of various log2 sizes, stupid pseudocode:
class Allocator {
void* malloc(size_t size) {
int bucket = log2(size + sizeof(int));
int* pointer = reinterpret_cast<int*>(m_buckets[bucket].back());
m_buckets[bucket].pop_back();
*pointer = bucket; //Store which bucket this was allocated from
return pointer + 1; //Dont overwrite header
}
void free(void* pointer) {
int* temp = reinterpret_cast<int*>(pointer) - 1;
m_buckets[*temp].push_back(temp);
}
vector< vector<void*> > m_buckets;
};
(You would of course also replace the std::vector with a simple array + counter).
EDIT: In order to make this robust (i.e. handle the situation where the bucket is empty) you would have to add some form of branching.
EDIT2: Here's a small branchless log2 function:
//returns the smallest x such that value <= (1 << x)
int
log2(int value) {
union Foo {
int x;
float y;
} foo;
foo.y = value - 1;
return ((foo.x & (0xFF << 23)) >> 23) - 126; //Extract exponent (base 2) of floating point number
}
This gives the correct result for allocations < 33554432 bytes. If you need larger allocations you'll have to switch to doubles.
Here's a link to how floating point numbers are represented in memory.

The only way I know to create a truly branchless allocator is to reserve all the memory it will potentially use in advance. Otherwise there's always going to be some hidden code somewhere to see if we're exceeding some current capacity whether it's in a hidden push_back in a vector checking if the size exceeds capacity used to implement it or something of that sort.
Here is one such crude example of a fixed alloc which has a completely branchless malloc and free method.
class FixedAlloc
{
public:
FixedAlloc(int element_size, int num_reserve)
{
element_size = max(element_size, sizeof(Chunk));
mem = new char[num_reserve * element_size];
char* ptr = mem;
free_chunk = reinterpret_cast<Chunk*>(ptr);
free_chunk->next = 0;
Chunk* last_chunk = free_chunk;
for (int j=1; j < num_reserve; ++j)
{
ptr += element_size;
Chunk* chunk = reinterpret_cast<Chunk*>(ptr);
chunk->next = 0;
last_chunk->next = chunk;
last_chunk = chunk;
}
}
~FixedAlloc()
{
delete[] mem;
}
void* malloc()
{
assert(free_chunk && free_chunk->next && "Reserve memory exhausted!");
Chunk* chunk = free_chunk;
free_chunk = free_chunk->next;
return chunk->mem;
}
void free(void* mem)
{
Chunk* chunk = static_cast<Chunk*>(mem);
chunk->next = free_chunk;
free_chunk = chunk;
}
private:
union Chunk
{
Chunk* next;
char mem[1];
};
char* mem;
Chunk* free_chunk;
};
Since it's totally branchless, it simply segfaults if you try to allocate more memory than initially reserved. It also has undefined behavior for trying to free a null pointer. I also avoided dealing with alignment for the sake of a simpler example.

Related

How to implement resize() to change capacity of dynamic member data in c++

I need to implement the function resize() :
void IntSet::resize(int new_capacity)
{
if (new_capacity < used)
new_capacity = used;
if (used == 0)
new_capacity = 1;
capacity = new_capacity;
int * newData = new int[capacity];
for (int i = 0; i < used; ++i)
newData[i] = data[i];
delete [] data;
data = newData;
}
inside the functions :
IntSet IntSet::unionWith(const IntSet& otherIntSet) const
{
IntSet unionSet = otherIntSet;
for (int i = 0; i < used; i++)
{
if (unionSet.contains(data[i]))
unionSet.add(data[i]);
}
return unionSet;
}
and this one: ( NOTE: that I have it already inside the add() function but I think it is incorrect)
bool IntSet::add(int anInt)
{
if (contains(anInt) == false)
{
if (used >= capacity)
resize(used++);
data[used++] = anInt;
return true;
}
return false;
}
The program compiles correctly without errors, but it does give me an error of Segmentation fault
NOTE: The main thing is that I need help in learning how to use the resize function to resize the capacity of the the dynamic member data. Also, I know vectors would help in this case, but we are not allowed to use vectors yet
Here is the Special Requirements from professor:
>Special Requirement (You will lose points if you don't observe this.) <br/>
>When calling resize *(while implementing some of the member functions)* to
>increase the capacity of the dynamic arrays, use the following resizing
>rule (unless the new capacity has to be something else higher as dictated
>by other >overriding factors): <br/>
>
>*"new capacity" is "roughly 1.5 x old capacity" and at least "old capacity
> + 1".* <br/>
>
>The latter *(at least " old capacity + 1 ")* is a simple way to take care
>of the subtle case where " 1.5 x old capacity " evaluates (with truncation)
>to the >same as "old capacity". <br/>

When you resize because of add, you increment used twice.
bool IntSet::add(int anInt)
{
if (contains(anInt) == false)
{
if (used >= capacity)
resize(used++); // Here And this is a post increment.
// resize will be called with used before the increment
// so you will wind up asking for a buffer the same size.
data[used++] = anInt; // and here.
return true;
}
return false;
}
So nothing goes into used. You skip a space and write into the space after. Plus resize(used++); didn't ask for more space, so you actually wind up writing two spots outside the allocated storage, and this probably triggers the segfault.
Solution
You don't want to increment anything at resize(used++);. You want to add one to the capacity, but not increment it, so
resize(capacity +1);
Looks about right. However, what the instructions asked for is something more like:
int newcap = capacity * 1.5;
if (newcap == capacity) // newcap didn't change. eg: 1*1.5 = 1
{
newcap ++;
}
resize(newcap);
This is brute force though. There are smarter ways to do this.

memcpy not copying into buffer

I have a class with a std::vector<unsigned char> mPacket as a packet buffer (for sending UDP strings). There is a corresponding member variable mPacketNumber that keeps track of how many packets have been sent so far.
The first thing I do in the class is reserve space:
mPacket.reserve(400);
and then later, in a loop that runs while I want packets to get sent:
mPacket.clear(); //empty out the vector
long packetLength = 0; //keep track of packetLength for sending udp strings
memcpy(&mPacket[0], &&mPacketNumber, 4); //4 bytes because it's a long
packetLength += 4; //add 4 bytes to the packet length
memcpy(&mPacket[packetLength], &data, dataLength);
packetLength += dataLength;
udp.send(mPacket.data(), packetLength);
Except I realized that nothing was getting sent! How peculiar.
So I dug a bit deeper, and found that mPacket.size() returns zero, while packetLength returns the size I think the packet should be.
I can't think of a reason for mPacket to have zero length -- even if I'm mishandling the data, the header with mPacketNumber should have been written just fine.
Can anyone suggest why I'm running into this problem?
thanks!

The elements you reserve are not for normal use. The elements are created only if you resize the vector. While it might somehow look it works, it would be a different situation with types having constructors - you could see that the constructors were not called. This is undefined behaviour - you're accessing elements which you aren't allowed in this situation.
The .reserve() operation is normally used together with .push_back() to avoid reallocations, but this is not the case here.
The .size() is not modified if you use .reserve(). You should use .resize() instead.
Alternatively, you can use your copy operation together with .push_back() and .reserve(), but you need to drop the usage of memcpy, and instead use the std::copy together with std::back_inserter, which uses .push_back() to push the elements to the other container:
std::copy(reinterpret_cast<unsigned char*>(&mPacketNumber), reinterpret_cast<unsigned char*>(&mPacketNumber) + sizeof(mPacketNumber), std::back_inserter(mPacket))
std::copy(reinterpret_cast<unsigned char*>(&data), reinterpret_cast<unsigned char*>(&data) + dataLength, std::back_inserter(mPacket));
These reinterpret_casts are vexing, but the code still has one advantage - you won't get buffer overrun in case your estimate was too low.

vector, apparently, doesn't count the elements when you call size(). There's a counter variable inside the vector that holds that information, because vector has plenty of memory allocated and can't really know where the end of your data is. It changes counter variable as you add/remove elements using methods of vector object, because they are programmed to do so.
You added data directly to its array pointer, which awakens no reaction of your vector object because it does not use any of its methods. Data is there, but vector doesn't acknowledge it, so counter remains at 0 and size() returns 0.
You should either replace all size() calls with packageLength, or use methods inside your vector to add/remove/read data, or use a dynamically allocated array instead of a vector, or create your own class for containing array and managing it the way you like it. To be honest, using a vector in a situation like this doesn't really make sense.
Vector is a conventional high-level object-oriented component and in most os the cases it should be used that way.
Example of one's own Array class:
If you used your own dynamically allocated array, you'd have to remember its length all the time in order to use it. So lets create a class that will cut us some slack in that. This example has element transfer based on memcpy, and the [] notation works perfectly. It has an original max length, but extends itself when necessary.
Also, this is an in-line class. certain IDEs may ask of you to actually seperate it in header and source file, so you may have to do that yourself.
Add more methods yourself if necessary. When applying this, do not use memcpy unless you're going to change arraySize attribute manually. You've got integrated addFrom and addBytesFrom methods that use memcpy inside (assuming calling array being the destination) and separately increase arraySize. If you do want to use memcpy, setSize method can be used for forcing new array size without modifying the array.
#include <cstring>
//this way you can easily change types during coding in case you change your mind
//more conventional object-oriented method would use templates and generic programming, but lets not complicate too much now
typedef unsigned char type;
class Array {
private:
type *array;
long arraySize;
long allocAmount; //number of allocated bytes
long currentMaxSize; //number of allocated elements
//private call that extends memory taken by the array
bool reallocMore()
{
//preserve old data
type *temp = new type[currentMaxSize];
memcpy(temp, array, allocAmount);
long oldAmount = allocAmount;
//calculate new max size and number of allocation bytes
currentMaxSize *= 16;
allocAmount = currentMaxSize * sizeof(type);
//reallocate array and copy its elements back into it
delete[] array;
array = new type[currentMaxSize];
memcpy(array, temp, oldAmount);
//we no longer need temp to take space in out heap
delete[] temp;
//check if space was successfully allocated
if(array) return true;
else return false;
}
public:
//constructor
Array(bool huge)
{
if(huge) currentMaxSize = 1024 * 1024;
else currentMaxSize = 1024;
allocAmount = currentMaxSize * sizeof(type);
array = new type[currentMaxSize];
arraySize = 0;
}
//copy elements from another array and add to this one, updating arraySize
bool addFrom(void *src, long howMany)
{
//predict new array size and extend if larger than currentMaxSize
long newSize = howMany + arraySize;
while(true)
{
if(newSize > currentMaxSize)
{
bool result = reallocMore();
if(!result) return false;
}
else break;
}
//add new elements
memcpy(&array[arraySize], src, howMany * sizeof(type));
arraySize = newSize;
return true;
}
//copy BYTES from another array and add to this one, updating arraySize
bool addBytesFrom(void *src, long byteNumber)
{
//predict new array size and extend if larger than currentMaxSize
int typeSize = sizeof(type);
long howMany = byteNumber / typeSize;
if(byteNumber % typeSize != 0) howMany++;
long newSize = howMany + arraySize;
while(true)
{
if(newSize > currentMaxSize)
{
bool result = reallocMore();
if(!result) return false;
}
else break;
}
//add new elements
memcpy(&array[arraySize], src, byteNumber);
arraySize = newSize;
return true;
}
//clear the array as if it's just been made
bool clear(bool huge)
{
//huge >>> 1MB, not huge >>> 1KB
if(huge) currentMaxSize = 1024 * 1024;
else currentMaxSize = 1024;
allocAmount = currentMaxSize * sizeof(type);
delete[] array;
array = new type[currentMaxSize];
arraySize = 0;
}
//if you modify this array out of class, you must manually set the correct size
bool setSize(long newSize) {
while(true)
{
if(newSize > currentMaxSize)
{
bool result = reallocMore();
if(!result) return false;
}
else break;
}
arraySize = newSize;
}
//current number of elements
long size() {
return arraySize;
}
//current number of elements
long sizeInBytes() {
return arraySize * sizeof(type);
}
//this enables the usage of [] as in yourArray[i]
type& operator[](long i)
{
return array[i];
}
};

mPacket.reserve();
mPacket.resize(4 + dataLength); //call this first and copy into, you can get what you want
mPacket.clear(); //empty out the vector
long packetLength = 0; //keep track of packetLength for sending udp strings
memcpy(&mPacket[0], &&mPacketNumber, 4); //4 bytes because it's a long
packetLength += 4; //add 4 bytes to the packet length
memcpy(&mPacket[packetLength], &data, dataLength);
packetLength += dataLength;
udp.send(mPacket, packetLength);

Allocating an Array in Memory Manager

I want to successfully allocate an Array in my Memory Manager. I am having a hard time getting the data setup successfully in my Heap. I don't know how to instantiate the elements of the array, and then set the pointer that is passed in to that Array. Any help would be greatly appreciated. =)
Basically to sum it up, I want to write my own new[#] function using my own Heap block instead of the normal heap. Don't even want to think about what would be required for a dynamic array. o.O
// Parameter 1: Pointer that you want to pointer to the Array.
// Parameter 2: Amount of Array Elements requested.
// Return: true if Allocation was successful, false if it failed.
template <typename T>
bool AllocateArray(T*& data, unsigned int count)
{
if((m_Heap.m_Pool == nullptr) || count <= 0)
return false;
unsigned int allocSize = sizeof(T)*count;
// If we have an array, pad an extra 16 bytes so that it will start the data on a 16 byte boundary and have room to store
// the number of items allocated within this pad space, and the size of the original data type so in a delete call we can move
// the pointer by the appropriate size and call a destructor(potentially a base class destructor) on each element in the array
allocSize += 16;
unsigned int* mem = (unsigned int*)(m_Heap.Allocate(allocSize));
if(!mem)
{
return false;
}
mem[2] = count;
mem[3] = sizeof(T);
T* iter = (T*)(&(mem[4]));
data = iter;
iter++;
for(unsigned int i = 0; i < count; ++i,++iter)
{
// I have tried a bunch of stuff, not sure what to do. :(
}
return true;
}
Heap Allocate function:
void* Heap::Allocate(unsigned int allocSize)
{
Header* HeadPtr = FindBlock(allocSize);
Footer* FootPtr = (Footer*)HeadPtr;
FootPtr = (Footer*)((char*)FootPtr + (HeadPtr->size + sizeof(Header)));
// Right Split Free Memory if there is enough to make another block.
if((HeadPtr->size - allocSize) >= MINBLOCKSIZE)
{
// Create the Header for the Allocated Block and Update it's Footer
Header* NewHead = (Header*)FootPtr;
NewHead = (Header*)((char*)NewHead - (allocSize + sizeof(Header)));
NewHead->size = allocSize;
NewHead->next = NewHead;
NewHead->prev = NewHead;
FootPtr->size = NewHead->size;
// Create the Footer for the remaining Free Block and update it's size
Footer* NewFoot = (Footer*)NewHead;
NewFoot = (Footer*)((char*)NewFoot - sizeof(Footer));
HeadPtr->size -= (allocSize + HEADANDFOOTSIZE);
NewFoot->size = HeadPtr->size;
// Turn new Header and Old Footer High Bits On
(NewHead->size |= (1 << 31));
(FootPtr->size |= (1 << 31));
// Return actual allocated memory's location
void* MemAddress = NewHead;
MemAddress = ((char*)MemAddress + sizeof(Header));
m_PoolSizeTotal = HeadPtr->size;
return MemAddress;
}
else
{
// Updating descriptors
HeadPtr->prev->next = HeadPtr->next;
HeadPtr->next->prev = HeadPtr->prev;
HeadPtr->next = NULL;
HeadPtr->prev = NULL;
// Turning Header and Footer High Bits On
(HeadPtr->size |= (1 << 31));
(FootPtr->size |= (1 << 31));
// Return actual allocated memory's location
void* MemAddress = HeadPtr;
MemAddress = ((char*)MemAddress + sizeof(Header));
m_PoolSizeTotal = HeadPtr->size;
return MemAddress;
}
}
Main.cpp
int* TestArray;
MemoryManager::GetInstance()->CreateHeap(1); // Allocates 1MB
MemoryManager::GetInstance()->AllocateArray(TestArray, 3);
MemoryManager::GetInstance()->DeallocateArray(TestArray);
MemoryManager::GetInstance()->DestroyHeap();

As far as these two specific points:
Instantiate the elements of the array
Set the pointer that is passed in to that Array.
For (1): there is no definitive notion of "initializing" the elements of the array in C++. There are at least two reasonable behaviors, this depends on the semantics you want. The first is to simply zero the array (see memset). The other would be to call the default constructor for each element of the array -- I would not recommend this option as the default (zero argument) constructor may not exist.
EDIT: Example initialization using inplace-new
for (i = 0; i < len; i++)
new (&arr[i]) T();
For (2): It is not exactly clear what you mean by "and then set the pointer that is passed in to that Array." You could "set" the memory returned as data = static_cast<T*>(&mem[4]), which you already do.
A few other words of cautioning (having written my own memory managers), be very careful about byte alignment (reinterpret_cast(mem) % 16); you'll want to ensure you are returning points that are word (or even 16 byte) aligned. Also, I would recommend using inttypes.h to explicitly use uint64_t to be explicit about sizing -- current it looks like this allocator will break for >4GB allocations.
EDIT:
Speaking from experiment -- writing a memory allocator is a very difficult thing to do, and it is even more painful to debug. As commenters have stated, a memory allocator is specific to the kernel -- so information about your platform would be very helpful.

array traversal vs pointer, cache efficiency aspect

void foo(Node* p[], int size){
_uint64 arr_of_values[_MAX_THREADS];
for (int i=0 ; i < size ; i++ ){
arr_of_values[i] = p[i]->....;
// much code here
//
}
}
vs
void foo(Node* p[], int size){
_uint64 arr_of_values[_MAX_THREADS];
Node* p_end = p[size];
for ( ; p != p_end ; ){
arr_of_values[i] = (*p)->.....;
p++;
// much code here
//
}
}
I created this function to demonstrate what i am asking:
what is more efficient from the cache efficiency aspect : taking p[i] or using *p++?
(i'll never use the p[i-x] in the rest of the code, but i may use p[i] or *p in the following calculation)

the most important is to avoid false sharing in the arr_of_values. Each thread write into its own slot, but 8 or 16 slots share a cache line (depending on CPU) causing a massive false sharing problem. Add padding between the slots to cache align each thread's slot, or accumulate on stack and write only once at the end:
void foo(Node* p[], int size){
_uint64 arr_of_values[_MAX_THREADS];
Node* p_end = p[size];
for ( ; p != p_end ; ){
temp = .....;
p++;
// much code here
//
}
arr_of_values[i] = temp;
}
the question of access by pointer or access by index is irrelevant with today's compiler.s
Your next actions should be: grab a copy of the Software optimization Cookbook. Read it. Measure. Fix the measured hotspot, not the guesstimates.

The problem from a cache point of view isn't the way you are accessing the elements. In this case using a pointer or the array index is equivalent.
BTW Node* p[] is an array of pointer. So you could have possibly allocated your Node objects into distant memory areas. (For example using several ptr = new Node()). The best cache performance are gainable if:
Your Node are stored contiguosly into memory
Node size doesn't exceed the cache size.

How can I improve the performance of my ring buffer code?

I am using a ringbuffer to hold samples for a streaming audio application. I copied the ringbuffer implementation from Ken Greenebaum's Audio Anecdotes 2 book.
After running Intel's Vtune analyzer on my code, it tells me that most of the time is being spent in the functions getSamplesAvailable() and getSpaceAvailable().
Can anyone advise as to how I might optimise these functions?
RingBuffer::getSamplesAvailable(void)
{
int count = (mTail - mHead + mSize) % mSize;
return(count);
}
unsigned int RingBuffer::getSpaceAvailable(void)
{
int free = (mHead - mTail + mSize - 1)%mSize;
int underMark = mHighWaterMark - getSamplesAvailable();
int spaceAvailable = min(underMark, free);
return(spaceAvailable);
}
int RingBuffer::push(int value)
{
int status = 1;
if(getSpaceAvailable()) {
// next two operations do NOT have to be atomic!
// do NOT have to worry about collision with _tail
mBuffer[mTail] = value; // store value
mTail = ++mTail % mSize; // increment tail
} else {
status = 0;
}
return(status);
}
int RingBuffer::pop(int *value)
{
int status = 1;
if(getSamplesAvailable()) {
*value = mBuffer[mHead];
mHead = ++mHead % mSize; // increment head
} else {
status = 0;
}
return(status);
}

If you can make mSize a power of two, you can replace
(mTail - mHead + mSize) % mSize
by
(mTail - mHead) & (mSize-1)
and
(mHead - mTail + mSize - 1) % mSize
by
(mHead - mTail - 1) & (mSize - 1)

I think the problem is not their complexity, they are just basic integer arithmetic, but how many times they are called.
Is there any possibility of doing "batch" (inserting or retrieving various values at once) updates on the buffer? That way you could save some calculations.

Using a power of two as Henrik proposed is the first thing to do. There is also the possibility to change the way you code the mTail and mHead indexes. Instead of keeping them in the [0, mSize[ range, you can let them run freely as uint32_t.
When accessing an element you will need to do a modulo mSize which will slow down each access.
mBuffer[mTail % mSize] = value;
But it will simpify for instance the count of samples (even if your indexes wrap over the uint32_t max value):
int count = mTail - mHead;
It will also allow you to fully use the ring buffer, instead of loosing one element to differentiate the cases where the buffer is full or empty.

If speed is the most important thing for you and you can live with the fact that it is a) non portable (only Windows, although linux has the same basic functionality as well so that should work there as well) and b) only works in release builds (well has more to do with how VC++ allocates memory in debug mode - probably there's some compile flag for this?) you can use the following:
DWORD size = 64 * 1024; // HAS to be a multiple of 64k due to how win allocates memory
HANDLE mapped_memory = CreateFileMapping(INVALID_HANDLE_VALUE, NULL, PAGE_READWRITE, 0, size, NULL);
int *p1 = (int*)MapViewOfFile(mapped_memory, FILE_MAP_WRITE, 0, 0, size);
int *p2 = (int*)MapViewOfFile(mapped_memory, FILE_MAP_WRITE, 0, 0, size);
// p1 and p2 should be adjacent in memory, if not try again.. no idea if there's some
// better method under windows
Basically you now have two adjacent memory blocks in virtual memory that point to the same physical memory. Ie if you write through pdw1 you'll see the changes in pdw2 and vice-versa.
The advantage is that you can now more efficiently read and write to the buffer and also larger amounts than only one word at a time. You just have to decrement the pointers correctly - shouldn't be too hard to implement.
Edit: Now see that - there's even a POSIX implementation on wiki.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Branchless memory manager? - c++

Related

How to implement resize() to change capacity of dynamic member data in c++

memcpy not copying into buffer

Allocating an Array in Memory Manager

array traversal vs pointer, cache efficiency aspect

How can I improve the performance of my ring buffer code?

Categories

Resources