How can I improve the performance of my ring buffer code?

How can I improve the performance of my ring buffer code? - c++

I am using a ringbuffer to hold samples for a streaming audio application. I copied the ringbuffer implementation from Ken Greenebaum's Audio Anecdotes 2 book.
After running Intel's Vtune analyzer on my code, it tells me that most of the time is being spent in the functions getSamplesAvailable() and getSpaceAvailable().
Can anyone advise as to how I might optimise these functions?
RingBuffer::getSamplesAvailable(void)
{
int count = (mTail - mHead + mSize) % mSize;
return(count);
}
unsigned int RingBuffer::getSpaceAvailable(void)
{
int free = (mHead - mTail + mSize - 1)%mSize;
int underMark = mHighWaterMark - getSamplesAvailable();
int spaceAvailable = min(underMark, free);
return(spaceAvailable);
}
int RingBuffer::push(int value)
{
int status = 1;
if(getSpaceAvailable()) {
// next two operations do NOT have to be atomic!
// do NOT have to worry about collision with _tail
mBuffer[mTail] = value; // store value
mTail = ++mTail % mSize; // increment tail
} else {
status = 0;
}
return(status);
}
int RingBuffer::pop(int *value)
{
int status = 1;
if(getSamplesAvailable()) {
*value = mBuffer[mHead];
mHead = ++mHead % mSize; // increment head
} else {
status = 0;
}
return(status);
}

If you can make mSize a power of two, you can replace
(mTail - mHead + mSize) % mSize
by
(mTail - mHead) & (mSize-1)
and
(mHead - mTail + mSize - 1) % mSize
by
(mHead - mTail - 1) & (mSize - 1)

I think the problem is not their complexity, they are just basic integer arithmetic, but how many times they are called.
Is there any possibility of doing "batch" (inserting or retrieving various values at once) updates on the buffer? That way you could save some calculations.

Using a power of two as Henrik proposed is the first thing to do. There is also the possibility to change the way you code the mTail and mHead indexes. Instead of keeping them in the [0, mSize[ range, you can let them run freely as uint32_t.
When accessing an element you will need to do a modulo mSize which will slow down each access.
mBuffer[mTail % mSize] = value;
But it will simpify for instance the count of samples (even if your indexes wrap over the uint32_t max value):
int count = mTail - mHead;
It will also allow you to fully use the ring buffer, instead of loosing one element to differentiate the cases where the buffer is full or empty.

If speed is the most important thing for you and you can live with the fact that it is a) non portable (only Windows, although linux has the same basic functionality as well so that should work there as well) and b) only works in release builds (well has more to do with how VC++ allocates memory in debug mode - probably there's some compile flag for this?) you can use the following:
DWORD size = 64 * 1024; // HAS to be a multiple of 64k due to how win allocates memory
HANDLE mapped_memory = CreateFileMapping(INVALID_HANDLE_VALUE, NULL, PAGE_READWRITE, 0, size, NULL);
int *p1 = (int*)MapViewOfFile(mapped_memory, FILE_MAP_WRITE, 0, 0, size);
int *p2 = (int*)MapViewOfFile(mapped_memory, FILE_MAP_WRITE, 0, 0, size);
// p1 and p2 should be adjacent in memory, if not try again.. no idea if there's some
// better method under windows
Basically you now have two adjacent memory blocks in virtual memory that point to the same physical memory. Ie if you write through pdw1 you'll see the changes in pdw2 and vice-versa.
The advantage is that you can now more efficiently read and write to the buffer and also larger amounts than only one word at a time. You just have to decrement the pointers correctly - shouldn't be too hard to implement.
Edit: Now see that - there's even a POSIX implementation on wiki.

Related

how to implement a memory allocator

I'm trying to implement the freelist algorithm to allocate memory. The two functions I'm trying to write can be described as shown below.
// allocates a block of memory of at least size words and returns the address of that memory or 0 if no memory could be allocated.
int64_t *mymalloc(int64_t size)
// deallocates the memory stored at addr. the address will either be one allocated by mymalloc or the value 0.
void myfree(int64_t *addr)
The implementations of these functions should only use memory returned by the function pool(), whose signature is described below. Thus it cannot use the functions new, delete, malloc, calloc, realloc, etc.
// pool is a function that returns the address of a beginning // of a block of RAM that may be used for dynamic memory
// allocation. The size of the pool in bytes is stored in the // first word, which can be assumed to be a multiple of 8.
// When pool() is called, the first word isn't always overwritten with its size.
// Each word is an int64_t *, and so is 8 bytes.
// Assume this function works.
int64_t *pool();
I think defining some global variables like freelst, which points to the start of the freelst, may be helpful. It can be defined as
int64_t *freelst = pool();
I know that when allocating memory, there are some steps to follow:
The free list pointer should be updated accordingly.
The number of allocated blocks should be incremented.
The amount of memory allocated should be subtracted from the first word of the freelist, so that the first word always stores the size of memory available.
One needs to check if the current block of memory has been previously freed.
When deallocating memory, one needs to ensure addresses are inserted into the freelist in increasing order so that neighbours can be determined. If neighbours (which differ by 8) are free, they need to be merged, and as many times as necessary until no free neighbours are encountered to reduce fragmentation. Also, the second word of the freelst should be a pointer to the next word of the free
Below is some code I've come up with for this problem. It's incomplete, but the basic ideas are there.
#include <iostream>
#include <cstdint>
#include "pool.h" // place where pool is defined
const int NODE_SIZE = 8;
int64_t *freelst = pool();
int64_t *start_of_pool = freelst; // just keep this fixed I guess
// assume that the pool function works.
int64_t *mymalloc(int64_t size) {
int64_t *currentBlock = freelst;
while (currentBlock) {
if (*currentBlock >= size) { // if the currentBlock is large enough, set it to this value (we're doing first fit).
break;
}
currentBlock = currentBlock + 1; // since incrementing involves moving to the address that's one word past the current one.
}
// assuming we've found a large enough block, we now have to allocate it
if (currentBlock == 0) {
return 0; // I think this should occur because not enough memory was found
}
int64_t *prev_val = freelst; // save the previous value of the freelist
freelst = freelst + 1 + *currentBlock; // assuming *currentBlock is the size of currentBlock.
*freelst -= NODE_SIZE + *currentBlock; // update the size of the freelst here (though likely this was done incorrectly)
return currentBlock + 1; // return address one word after currentBlock
// is this all if we're trying to implement a linked list using raw pointers?
// I don't think so, but I'm not sure what else to add.
}
void myfree(int64_t *p) {
if (p == 0) {
return; // of course if we're freeing a nullptr, we should return 0.
}
// assume the freelst is already in ascending order of course.
// sort the freelst in linear time by positioning the currentBlock into the right place.
// the basic idea is to use insertion sort.
// find where the address p is in the free list.
// I think another method would be to update the prevBlock as the currentBlock is being updated.
int64_t *currentBlock = freelst;
int64_t *prevBlock = freelst;
while (currentBlock != 0 && currentBlock + 1 <= p) { // comparing addresses
prevBlock = currentBlock; // so it's set to the previous block
currentBlock = (int64_t *)*(currentBlock + 1); // set it to the next address
// as a linked list, I'm thinking of doing something like:
// prevBlock = currentBlock;
// currentBlock = currentBlock->next;
}
// after exiting, either currentBlock = 0, in which case p is the largest address,
// or currentBlock + 1 > p, so it's smaller than the current address.
if (currentBlock == 0) { // then p is the largest address
if ((int64_t *)*(prevBlock + 1) != currentBlock) throw std::invalid_argument("A likely error occurred as prevBlock + 1 != currentBlock.");
*(prevBlock + 1) = (int64_t)p;
*(p + 1) = 0;
// p->next = 0
// prevBlock->next = p;
} else {
if (prevBlock == currentBlock) { // in this case currentBlock was the start of the freelst
int64_t *temp = (int64_t *)*(currentBlock + 1);
*(prevBlock + 1) = (int64_t)p; // cast so it passes type-checking
*(p + 1) = (int64_t)temp;
// here I'm trying to mimic what's done for a linked list:
// int64_t *temp = currentBlock->next;
// prevBlock->next = p;
// p->next = temp;
} else {
*(prevBlock + 1) = (int64_t)p;
*(p + 1) = (int64_t)currentBlock;
// here's what I think might be the equivalent for a linked list:
// prevBlock->next = p;
// p->next = currentBlock;
}
}
if (currentBlock != 0) { // if it not null
if (currentBlock + 1 + *currentBlock == (int64_t *)(currentBlock + 1)) { // check if currentBlock is adjacent to prevBlock
*currentBlock += *(int64_t *)*(currentBlock + 1) + NODE_SIZE;
}
// link current block to next next block
*(currentBlock + 1) = (int64_t)((int64_t *)*(currentBlock + 1) + 1);
}
// assuming sorting was done correctly, check if addresses are adjacent
if (prevBlock + 1 + *prevBlock == currentBlock) { // if you add one word plus the size of the previous block to get the
// currentBlock
if (currentBlock == 0) throw std::invalid_argument("A likely error occurred. currentBlock was 0 even though it should have been defined.");
*prevBlock += *currentBlock + NODE_SIZE; // add the sizes of both the currentBlock and previous block,
// assuming they aren't null of course.
// so currentBlock->next->size + NODE_SIZE;
// link previous block to next block
*(prevBlock + 1) = (int64_t)(currentBlock + 1);
}
}
Any help as to how to implement these functions/cases to consider that I've missed with code that deals with them would be appreciated. I can also clarify things if necessary.
I tried looking at this website for some help too, but I'm still having issues.

how to implement a memory allocator
At high level, there are essentially two ways to acquire memory for a custom allocator:
Allocate memory using an implementation defined way. The exact details depend on the target system, so first step is to find out what system you are targeting.
Or allocate memory using a standard way (standard allocator, new, malloc, static storage, ...)
Once you've acquired the memory, you need some data structure to keep track of memory that has been allocated through the allocator. You seem to have roughly described the "free list" structure, which is commonly used for this purpose.

Allocating an Array in Memory Manager

I want to successfully allocate an Array in my Memory Manager. I am having a hard time getting the data setup successfully in my Heap. I don't know how to instantiate the elements of the array, and then set the pointer that is passed in to that Array. Any help would be greatly appreciated. =)
Basically to sum it up, I want to write my own new[#] function using my own Heap block instead of the normal heap. Don't even want to think about what would be required for a dynamic array. o.O
// Parameter 1: Pointer that you want to pointer to the Array.
// Parameter 2: Amount of Array Elements requested.
// Return: true if Allocation was successful, false if it failed.
template <typename T>
bool AllocateArray(T*& data, unsigned int count)
{
if((m_Heap.m_Pool == nullptr) || count <= 0)
return false;
unsigned int allocSize = sizeof(T)*count;
// If we have an array, pad an extra 16 bytes so that it will start the data on a 16 byte boundary and have room to store
// the number of items allocated within this pad space, and the size of the original data type so in a delete call we can move
// the pointer by the appropriate size and call a destructor(potentially a base class destructor) on each element in the array
allocSize += 16;
unsigned int* mem = (unsigned int*)(m_Heap.Allocate(allocSize));
if(!mem)
{
return false;
}
mem[2] = count;
mem[3] = sizeof(T);
T* iter = (T*)(&(mem[4]));
data = iter;
iter++;
for(unsigned int i = 0; i < count; ++i,++iter)
{
// I have tried a bunch of stuff, not sure what to do. :(
}
return true;
}
Heap Allocate function:
void* Heap::Allocate(unsigned int allocSize)
{
Header* HeadPtr = FindBlock(allocSize);
Footer* FootPtr = (Footer*)HeadPtr;
FootPtr = (Footer*)((char*)FootPtr + (HeadPtr->size + sizeof(Header)));
// Right Split Free Memory if there is enough to make another block.
if((HeadPtr->size - allocSize) >= MINBLOCKSIZE)
{
// Create the Header for the Allocated Block and Update it's Footer
Header* NewHead = (Header*)FootPtr;
NewHead = (Header*)((char*)NewHead - (allocSize + sizeof(Header)));
NewHead->size = allocSize;
NewHead->next = NewHead;
NewHead->prev = NewHead;
FootPtr->size = NewHead->size;
// Create the Footer for the remaining Free Block and update it's size
Footer* NewFoot = (Footer*)NewHead;
NewFoot = (Footer*)((char*)NewFoot - sizeof(Footer));
HeadPtr->size -= (allocSize + HEADANDFOOTSIZE);
NewFoot->size = HeadPtr->size;
// Turn new Header and Old Footer High Bits On
(NewHead->size |= (1 << 31));
(FootPtr->size |= (1 << 31));
// Return actual allocated memory's location
void* MemAddress = NewHead;
MemAddress = ((char*)MemAddress + sizeof(Header));
m_PoolSizeTotal = HeadPtr->size;
return MemAddress;
}
else
{
// Updating descriptors
HeadPtr->prev->next = HeadPtr->next;
HeadPtr->next->prev = HeadPtr->prev;
HeadPtr->next = NULL;
HeadPtr->prev = NULL;
// Turning Header and Footer High Bits On
(HeadPtr->size |= (1 << 31));
(FootPtr->size |= (1 << 31));
// Return actual allocated memory's location
void* MemAddress = HeadPtr;
MemAddress = ((char*)MemAddress + sizeof(Header));
m_PoolSizeTotal = HeadPtr->size;
return MemAddress;
}
}
Main.cpp
int* TestArray;
MemoryManager::GetInstance()->CreateHeap(1); // Allocates 1MB
MemoryManager::GetInstance()->AllocateArray(TestArray, 3);
MemoryManager::GetInstance()->DeallocateArray(TestArray);
MemoryManager::GetInstance()->DestroyHeap();

As far as these two specific points:
Instantiate the elements of the array
Set the pointer that is passed in to that Array.
For (1): there is no definitive notion of "initializing" the elements of the array in C++. There are at least two reasonable behaviors, this depends on the semantics you want. The first is to simply zero the array (see memset). The other would be to call the default constructor for each element of the array -- I would not recommend this option as the default (zero argument) constructor may not exist.
EDIT: Example initialization using inplace-new
for (i = 0; i < len; i++)
new (&arr[i]) T();
For (2): It is not exactly clear what you mean by "and then set the pointer that is passed in to that Array." You could "set" the memory returned as data = static_cast<T*>(&mem[4]), which you already do.
A few other words of cautioning (having written my own memory managers), be very careful about byte alignment (reinterpret_cast(mem) % 16); you'll want to ensure you are returning points that are word (or even 16 byte) aligned. Also, I would recommend using inttypes.h to explicitly use uint64_t to be explicit about sizing -- current it looks like this allocator will break for >4GB allocations.
EDIT:
Speaking from experiment -- writing a memory allocator is a very difficult thing to do, and it is even more painful to debug. As commenters have stated, a memory allocator is specific to the kernel -- so information about your platform would be very helpful.

Branchless memory manager?

Anyone thought about how to write a memory manager (in C++) that is completely branch free? I've written a pool, a stack, a queue, and a linked list (allocating from the pool), but I am wondering how plausible it is to write a branch free general memory manager.
This is all to help make a really reusable framework for doing solid concurrent, in-order CPU, and cache friendly development.
Edit: by branchless I mean without doing direct or indirect function calls, and without using ifs. I've been thinking that I can probably implement something that first changes the requested size to zero for false calls, but haven't really got much more than that.
I feel that it's not impossible, but the other aspect of this exercise is then profiling it on said "unfriendly" processors to see if it's worth trying as hard as this to avoid branching.

While I don't think this is a good idea, one solution would be to have pre-allocated buckets of various log2 sizes, stupid pseudocode:
class Allocator {
void* malloc(size_t size) {
int bucket = log2(size + sizeof(int));
int* pointer = reinterpret_cast<int*>(m_buckets[bucket].back());
m_buckets[bucket].pop_back();
*pointer = bucket; //Store which bucket this was allocated from
return pointer + 1; //Dont overwrite header
}
void free(void* pointer) {
int* temp = reinterpret_cast<int*>(pointer) - 1;
m_buckets[*temp].push_back(temp);
}
vector< vector<void*> > m_buckets;
};
(You would of course also replace the std::vector with a simple array + counter).
EDIT: In order to make this robust (i.e. handle the situation where the bucket is empty) you would have to add some form of branching.
EDIT2: Here's a small branchless log2 function:
//returns the smallest x such that value <= (1 << x)
int
log2(int value) {
union Foo {
int x;
float y;
} foo;
foo.y = value - 1;
return ((foo.x & (0xFF << 23)) >> 23) - 126; //Extract exponent (base 2) of floating point number
}
This gives the correct result for allocations < 33554432 bytes. If you need larger allocations you'll have to switch to doubles.
Here's a link to how floating point numbers are represented in memory.

The only way I know to create a truly branchless allocator is to reserve all the memory it will potentially use in advance. Otherwise there's always going to be some hidden code somewhere to see if we're exceeding some current capacity whether it's in a hidden push_back in a vector checking if the size exceeds capacity used to implement it or something of that sort.
Here is one such crude example of a fixed alloc which has a completely branchless malloc and free method.
class FixedAlloc
{
public:
FixedAlloc(int element_size, int num_reserve)
{
element_size = max(element_size, sizeof(Chunk));
mem = new char[num_reserve * element_size];
char* ptr = mem;
free_chunk = reinterpret_cast<Chunk*>(ptr);
free_chunk->next = 0;
Chunk* last_chunk = free_chunk;
for (int j=1; j < num_reserve; ++j)
{
ptr += element_size;
Chunk* chunk = reinterpret_cast<Chunk*>(ptr);
chunk->next = 0;
last_chunk->next = chunk;
last_chunk = chunk;
}
}
~FixedAlloc()
{
delete[] mem;
}
void* malloc()
{
assert(free_chunk && free_chunk->next && "Reserve memory exhausted!");
Chunk* chunk = free_chunk;
free_chunk = free_chunk->next;
return chunk->mem;
}
void free(void* mem)
{
Chunk* chunk = static_cast<Chunk*>(mem);
chunk->next = free_chunk;
free_chunk = chunk;
}
private:
union Chunk
{
Chunk* next;
char mem[1];
};
char* mem;
Chunk* free_chunk;
};
Since it's totally branchless, it simply segfaults if you try to allocate more memory than initially reserved. It also has undefined behavior for trying to free a null pointer. I also avoided dealing with alignment for the sake of a simpler example.

lock free arena allocator implementation - correct?

for a simple pointer-increment allocator (do they have an official name?) I am looking for a lock-free algorithm. It seems trivial, but I'd like to get soem feedback whether my implementaiton is correct.
not threadsafe implementation:
byte * head; // current head of remaining buffer
byte * end; // end of remaining buffer
void * Alloc(size_t size)
{
if (end-head < size)
return 0; // allocation failure
void * result = head;
head += size;
return head;
}
My attempt at a thread safe implementation:
void * Alloc(size_t size)
{
byte * current;
do
{
current = head;
if (end - current < size)
return 0; // allocation failure
} while (CMPXCHG(&head, current+size, current) != current));
return current;
}
where CMPXCHG is an interlocked compare exchange with (destination, exchangeValue, comparand) arguments, returning the original value
Looks good to me - if another thread allocates between the get-current and cmpxchg, the loop attempts again. Any comments?

Your current code appears to work. Your code behaves the same as the below code, which is a simple pattern that you can use for implementing any lock-free algorithm that operates on a single word of data without side-effects
do
{
original = *data; // Capture.
result = DoOperation(original); // Attempt operation
} while (CMPXCHG(data, result, original) != original);
EDIT: My original suggestion of interlocked add won't quite work here because you support trying to allocate and failing if not enough space left. You've already modified the pointer and causing subsequent allocs to fail if you used InterlockedAdd.

How to know if the the value of an array is composed by zeros?

Hey, if you can get a more descriptive tittle please edit it.
I'm writing a little algorithm that involves checking values in a matrix.
Let's say:
char matrix[100][100];
char *ptr = &matrix[0][0];
imagine i populate the matrix with a couple of values (5 or 6) of 1, like:
matrix[20][35]=1;
matrix[67][34]=1;
How can I know if the binary value of an interval of the matrix is zero, for example (in pseudo code)
if((the value from ptr+100 to ptr+200)==0){ ... // do something
I'm trying to pick up on c/c++ again. There should be a way of picking those one hundred bytes (which are all next to each other) and check if their value is all zeros without having to check on by one.(considering char is one byte)

You can use std::find_if.
bool not_0(char c)
{
return c != 0;
}
char *next = std::find_if(ptr + 100, ptr + 200, not_0);
if (next == ptr + 200)
// all 0's
You can also use binders to remove the free function (although I think binders are hard to read):
char *next = std::find_if(ptr + 100, ptr + 200,
std::bind2nd(std::not_equal_to<char>(), 0));
Dang, I just notice request not to do this byte by byte. find_if will still do byte by byte although it's hidden. You will have to do this 1 by 1 although using a larger type will help. Here's my final version.
template <class T>
bool all_0(const char *begin, const char *end, ssize_t cutoff = 10)
{
if (end - begin < cutoff)
{
const char *next = std::find_if(begin, end,
std::bind2nd(std::not_equal_to<char>(), 0));
return (next == end);
}
else
{
while ((begin < end) && ((reinterpret_cast<uintptr_t>(begin) % sizeof(T)) != 0))
{
if (*begin != '\0')
return false;
++begin;
}
while ((end > begin) && ((reinterpret_cast<uintptr_t>(end) % sizeof(T)) != 0))
{
--end;
if (*end != '\0')
return false;
}
const T *nbegin = reinterpret_cast<const T *>(begin);
const T *nend = reinterpret_cast<const T *>(end);
const T *next = std::find_if(nbegin, nend,
std::bind2nd(std::not_equal_to<T>(), 0));
return (next == nend);
}
}
What this does is first checks to see if the data is long enough to make it worth the more complex algorithm. I'm not 100% sure this is necessary but you can tune what is the minimum necessary.
Assuming the data is long enough it first aligns the begin and end pointers to match the alignment of the type used to do the comparisons. It then uses the new type to check the bulk of the data.
I would recommend using:
all_0<int>(); // 32 bit platforms
all_0<long>(); // 64 bit LP64 platforms (most (all?) Unix platforms)
all_0<long long>() // 64 bit LLP64 platforms (Windows)

There's no built-in language feature to do that, nor is there a standard library function to do it. memcmp() could work, but you'd need a second array of all zeroes to compare against; that array would have to be large, and you'd also eat up unnecessary memory bandwidth in doing the comparison.
Just write the function yourself, it's not that hard. If this truly is the bottleneck of your application (which you should only conclude of profiling), then rewrite that function in assembly.

you tagged this C++, so you can use a pointer as an iterator, and use an stl algorithm. std::max. Then see if the max is 0 or not.

You could cast your pointer as an int * and then check four bytes at a time rather than one.

There's no way to tell whether an array has any value other than zero other than by checking all elements one by one. But if you start with an array that you know has all zeros, then you can maintain a flag that states the array's zero state.
std::vector<int> vec(SIZE);
bool allzeroes = true;
// ...
vec[SIZE/2] = 1;
allzeroes = false;
// ...
if( allzeroes ) {
// ...
}

Reserve element 0 of your array, to be set to all zeros.
Use memcmp to compare the corresponding ranges in the two elements.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How can I improve the performance of my ring buffer code? - c++

If you can make mSize a power of two, you can replace (mTail - mHead + mSize) % mSize by (mTail - mHead) & (mSize-1) and (mHead - mTail + mSize - 1) % mSize by (mHead - mTail - 1) & (mSize - 1)

I think the problem is not their complexity, they are just basic integer arithmetic, but how many times they are called. Is there any possibility of doing "batch" (inserting or retrieving various values at once) updates on the buffer? That way you could save some calculations.

Related

how to implement a memory allocator

Allocating an Array in Memory Manager

Branchless memory manager?

lock free arena allocator implementation - correct?

How to know if the the value of an array is composed by zeros?

Categories

Resources