C++ pool allocator vs static allocation , cache performance - c++

Given that I have two parallel and identically sized arrays of the following structs:
struct Matrix
{
float data[16];
};
struct Vec4
{
float data[4];
}
//Matrix arrM[256]; //for illustration
//Vec4 arrV[256];
Lets say I wish to iterate over the two arrays sequentially as fast as possible. Lets say the function is something like:
for (int i=0; i < 256; ++i)
{
readonlyfunc(arrMPtr[i].data);
readonlyfunc(arrVPtr[i].data
}
Assuming that my allocations are aligned for each array, both in the case of statically allocated or heap memory. Assuming that my cache line size 64 bytes.
Would I achieve the same cache locality and performance if I were to store my data as:
A)
//aligned
static Matrix arrM[256];
static Vec4 arrV[256];
Matrix* arrMPtr = arrM[0];
Vec4* arrVPtr = arrV[0];
vs
B)
//aligned
char* ptr = (char*) malloc(256*sizeof(Matrix)+256*sizeof(Vec4));
Matrix* arrMPtr = (Matrix*) ptr;
Vec4* arrVPtr = (Vec4*) ptr+256*sizeof(Matrix);

How the memory is allocated (heap or statically allocated) makes no difference to the memory's ability to be cached. Since both of these data structures are fairly large (1024 and 4096 bytes, respectively), the exact alignment of the first and last elements probably doesn't matter either (but it does matter if you are using SSE instructions to access the content!).
Whether the memory is close together or not won't make a huge difference, as long as the allocation is small enough to easily fit in the cache, but big enough to take up multiple cache-lines.
You may find that using a structure with 20 float values works out better, if you are working sequentially through both arrays. But that only works if you don't ever need to do other things with the data where having a single array makes more sense.
There may be a difference in the compiler's ability to translate the code to avoid an extra memory access. This will clearly depend on the actual code (e.g. will the compiler inline function containing the for-loop, does it inline the readonlyfunc code, etc, etc. If so, the static allocation can be translated from the pointer variant (which loads the address of the pointer to get the address of the data) into a constant address calculation. It probably doesn't make a huge difference in such a large loop as this.
Always, when it comes to performance, sometimes small things can make big differences, so if this is really important, do some experiments, using YOUR compiler, YOUR actual code. We can only give relatively speculative advice, based on our experience. Different compilers do different things with the same code, different processors do different things with the same machine code (both different actual architectures (whether it's instruction set architecture ARM vs X86, or implementation of the architecture such as AMD Opteron vs Intel Atom or ARM Cortex A15 vs Cortex M3). Memory configurations on your particular system will also affect things, how big caches are, etc, etc.

It's impossible to say without knowning more about what your doing and testing. It might be more efficient to use a struct of arrays.
struct MatrixVec
{
float m[16];
float v[4];
};
One important point is that malloc allocates memory from the heap whereas static arrays are allocated on the stack. The stack is already likely in the L1 cache whereas the memory from the heap will have to be read in. Instead you can try a less known function for dynamic memory allocation called alloca which allocates memory on the stack. In your case you could try
char* ptr = (char*) alloca(256*sizeof(Matrix)+256*sizeof(Vec4))
See Agner Fog's Optimizing software in C++. See the section "9.6 Dynamic memory allocation". Here are the advantages he lists for alloca compared to malloc
There is very little overhead to the allocation process because the microprocessor
has hardware support for the stack.
The memory space never becomes fragmented thanks to the first-in-last-out nature
of the stack.
Deallocation has no cost because it goes automatically when the function returns.
There is no need for garbage collection.
The allocated memory is contiguous with other objects on the stack, which makes
data caching very efficient.

Related

Dynamic allocation store data in random location in the heap?

I know that local variables will be stored on the stack orderly.
but, when i dynamically allocate variable in the heap memory in c++ like this.
int * a = new int{1};
int * a2 = new int{2};
int * a3 = new int{3};
int * a4 = new int{4};
Question 1 : are these variable stored in contiguous memory location?
Question 2 : if not, is it because dynamic allocation store variables in random location in the heap memory?
Question3 : so does dynamic allocation increase possibility of cache miss and has low spatial locality?
Part 1: Are separate allocations contiguous?
The answer is probably not. How dynamic allocation occurs is implementation dependent. If you allocate memory like in the above example, two separate allocations might be contiguous, but there is no guarantee of this happening (and it should never be relied on to occur).
Different implementations of c++ use different algorithms for deciding how memory is allocated.
Part 2: Is allocation random?
Somewhat; but not entirely. Memory doesn’t get allocated in an intentionally random fashion. Oftentimes memory allocators will try to allocate blocks of memory near each other in order to minimize page faults and cache misses, but it’s not always possible to do so.
Allocation happens in two stages:
The allocator asks for a large chunk of memory from the OS
The takes pieces of that large chunk and returns them whenever you call new, until you ask for more memory than it has to give, in which case it asks for another large chunk from the OS.
This second stage is where an implementation can make attempts to give things you memory that’s near other recent allocations, however it has little control over the first stage (and the OS usually just provides whatever memory is available, without any knowledge of other allocations by your program).
Part 3: avoiding cache misses
If cache misses are a bottleneck in your code,
Try to reduce the amount of indirection (by having arrays store objects by value, rather than by pointer);
Ensure that the memory you’re operating on is as contiguous as the design permits (so use a std::array or std::vector, instead of a linked list, and prefer a few big allocations to lots of small ones); and
Try to design the algorithm so that it has to jump around in memory as little as possible.
A good general principle is to just use a std::vector of objects, unless you have a good reason to use something fancier. Because they have better cache locality, std::vector is faster at inserting and deleting elements than std::list, even up to dozens or even hundreds of elements.
Finally: try to take advantage of the stack. Unless there’s a good reason for something to be a pointer, just declare as a variable that lives on the stack. When possible,
Prefer to use MyClass x{}; instead of MyClass* x = new MyClass{};, and
Prefer std::vector<MyClass> instead of std::vector<MyClass*>.
By extension, if you can use static polymorphism (i.e, templates), use that instead of dynamic polymorphism.
IMHO this is Operating System specific / C++ standard library implementation.
new ultimately uses lower-level virtual memory allocation services and allocating several pages at once, using system calls like mmap and munmap. The implementation of new could reuse previously freed memory space when relevant.
The implementation of new could use various and different strategies for "large" and "small" allocations.
In the example you gave the first new results in a system call for memory allocation (usually several pages), the allocated memory could be large enough so that subsequent new calls results in contiguous allocation..But this depends on the implementation
In short:
not at all (there is padding due to alignment, heap housekeeping data, allocated chunks may be reused, etc.),
not at all (AFAIK, heap algorithms are deterministic without any randomness),
generally yes (e.g., memory pooling might help here).

Does allocation on the heap affect the access performance?

As known:
ptr = malloc(size);
or in C++
ptr = new Klass();
will allocate size bytes on the heap. It is less efficient than on the stack.
But after the allocation, when we access it:
foo(*ptr);
or
(*ptr)++;
Does it have the same performance as data on the stack, or still slower?
The only way to definitively answer this question is to code up both versions and measure their performance under multiple scenarios (different allocation sizes, different optimiziation settings, etc). This sort of thing depends heavily on a lot of different factors, such as optimization settings, how the operating system manages memory, the size of the block being allocated, locality of accesses, etc. Never blindly assume that one method is more "efficient" than another in all circumstances.
Even then, the results will only be applicable for your particular system.
It really depends on what you are comparing and how.
If you mean is
ptr = malloc(10 * sizeof(int));
slower than:
int arr[10]
ptr = arr;
and then using ptr to access the integers it points at?
Then no.
If you are referring to using arr[0] instead of *ptr in the second case, possibly, as the compiler has to read the value in ptr to find the address of the actual variable. In many cases, however, it will "know" the value inside ptr, so won't need to read the pointer in itself.
If we are comparing foo(ptr) and foo(arr) it won't make any difference at all.
[There may be some penalty in actually allocating on the heap in that the memory has to be "committed" on the first use. But that is at most once for each 4KB, and we can probably ignore that in most cases].
Efficiency considerations are important when comparing an algorithm that runs in O(n^2) time versus O(nlogn), etc.
Comparing memory storage accesses, both algorithms are O(n) or O(k) and usually is is NOT possible to measure any difference.
However, if you are writing some code for the kernel that is frequently invoked than a small difference might become measurable.
In the context of this question, the real answer is that it really doesn't matter, use whichever storage makes your program easy to read and maintain. Because, in the long run the cost of paying humans to read your code is more than the cost of a running a cpu for a few more/or less instructions.
Stack is much faster than Heap since it involves as simple as moving the stack pointer. stack is of fixed size. in contrast to Heap, user need to manually allocate and de-allocate the memory.

Smaller per allocation overhead in allocators library

I'm currently writing a memory management library for c++ that is based around the concept of allocators. It's relatively simple, for now, all allocators implement these 2 member functions:
virtual void * alloc( std::size_t size ) = 0;
virtual void dealloc( void * ptr ) = 0;
As you can see, I do not support alignment in the interface but that's actually my next step :) and the reason why I'm asking this question.
I want the allocators to be responsible for alignment because each one can be specialized. For example, the block allocator can only return block-sized aligned memory so it can handle failure and return NULL if a different alignment is asked for.
Some of my allocators are in fact sub-allocators. For example, one of them is a linear/sequential allocator that just pointer-bumps on allocation. This allocator is constructed by passing in a char * pBegin and char * pEnd and it allocates from within that region in memory. For now, it works great but I get stuff that is 1-byte aligned. It works on x86 but I heard it can be disastrous on other CPUs (consoles?). It's also somewhat slower for reads and writes on x86.
The only sane way I know of implementing aligned memory management is to allocate an extra sizeof( void * ) + (alignement - 1) bytes and do pointer bit-masking to return the aligned address while keeping the original allocated address in the bytes before the user-data (the void * bytes, see above).
OK, my question...
That overhead, per allocation, seems big to me. For 4-bytes alignment, I would have 7 bytes of overhead on a 32-bit cpu and 11 bytes on a 64-bit one. That seems like a lot.
First, is it a lot? Am I on par with other memory management libs you might have used in the past or are currently using? I've looked into malloc and it seems to have a minimum of 16-byte overhead, is that right?
Do you know of a better way, smaller overhead, of returning aligned memory to my lib's users?
You could store an offset, rather than a pointer, which would only need to be large enough to store the largest supported alignment. A byte might even be sufficient if you only support smallish alignments.
How about you implement a buddy system which can be x-byte aligned depending on your requirement.
General Idea:
When your lib is initialized, allocate a big chunk of memory. For our example lets assume 16B. (Only this block needs to be aligned, the algorithm will not require you to align any other block)
Maintain lists for memory chunks of power 2. i.e 4B, 8B, 16B, ... 64KB, ... 1MB, 2MB, ... 512MB.
If a user asks for 8B of data, check the list for 8B, if not available, check list of 16B and split it into 2 blocks of 8B. Give one back to user and the other, gets appended to the list of 8B.
If the user asks for 16B, check if you have at least 2 8B available. If yes, combine them and give back to user. If not, the system does not have enough memory.
Pros:
No internal or external fragmentation.
No alignment required.
Fast access to memory chunks as they are pre-allocated.
If the list is an array, direct access to memory chunks of different size
Cons:
Overhead for list of memory.
If the list is a linked-list, traversal would be slow.

stl vector size on 64-bit machines

I have an application that will use millions of vectors.
It appears that most implementations of std::vector use 4 pointers (_First, _Last, _End, and _Alloc), which consumes 32 bytes on 64-bit machines. For most "practical" use cases of vector, one could probably get away with a single pointer and two 'unsigned int' fields to store the current size & allocated size, respectively. Ignoring the potential challenge of supporting customized allocation (instead of assuming that allocations must go through the global new & delete operator), it seems that it is possible to build an STL compliant vector class that uses only 16 bytes (or at worst 24 bytes to support the _Alloc pointer).
Before I start coding this up, 1) are there any pitfalls I should be aware of and 2) does an open source implementation exist?
You could accomplish something like this -- but it's not likely you'd gain all that much.
First, there's the performance aspect. You are trading time for memory consumption. Whatever memory you save is going to be offset by having to do the addition and a multiply on every call to end (okay, if it's a vector where sizeof(vector<t>::value_type) == 1 the multiply can be optimized out). Note that most handwritten looping code over vectors calls end on every loop iteration. On modern CPUs that's actually going to be a major win, because it allows the processor to keep more things in cache; unless those couple of extra instructions in an inner loop force the processor to swap things in the instruction cache too often)
Moreover, the memory savings is likely to be small in terms of the overall memory use in the vector, for the following reasons:
Memory manager overhead. Each allocation from the memory manager, (which vector of course needs) is going to add 16-24 bytes of overhead on its own in most memory manager implementations. (Assuming something like dlmalloc (UNIX/Linux/etc.) or RtlHeap (Windows))
Overprovisioning load. In order to achieve amortized constant insertion and removal at the end, when vector resizes, it resizes to some multiple of the size of the data in the vector. This means that the typical memory capacity vector allocates is enough for 1.6 (MSVC++) or 2 (STLPort, libstdc++) times the number of elements actually stored in the vector.
Alignment restrictions. If you are putting those many vectors into an array (or another vector), then keep in mind the first member of that vector is still a pointer to the allocated memory block. This pointer generally needs to be 8 byte aligned anyway -- so the 4 bytes you save are lost to structure padding in arrays.
I'd use the plain implementation of vector for now. If you run your code through a memory profiler and find that a significant savings would be made by getting rid of these couple of pointers, then you're probably off implementing your own optimized class which meets your performance characteristics rather than relying on the built in vector implementation. (An example of one such optimized class is std::string on those platforms that implement the small string optimization)
(Note: the only compiler of which I am aware that optimizes out the Alloc pointer is VC11, which isn't yet released. Though Nim says that the current prerelease version of libstdc++ does it as well...)
Unless these vectors are going to have contents that are extremely small, the difference between 16 vs. 32 bytes to hold the contents will be a small percentage of the total memory consumed by them. It will require lots of effort to reinvent this wheel, so be sure you're getting an adequate pay off for all that work.
BTW, there's value in education too, and you will learn a lot by doing this. If you choose to proceed, you might consider writing a test suite first, and exercise it on the current implementation and then on the one you invent.
To answer if it is worth the effort, find or write a compatible implementation that fits your needs (maybe there are other things in std::vector that you do not need), and compare the performance with std::vector<your_type> on relevant platforms. Your suggestion can at least improve performance of for the move constructor, as well as the move assignment operator:
typedef int32_t v4si __attribute__ ((vector_size (16)));
union
{
v4si data;
struct
{
T* pointer;
uint32_t length;
uint32_t capacity;
} content;
} m_data;
This only covers "sane" T (noexcept move semantices). https://godbolt.org/g/d5yU3o

Access cost of dynamically created objects with dynamically allocated members

I'm building an application which will have dynamic allocated objects of type A each with a dynamically allocated member (v) similar to the below class
class A {
int a;
int b;
int* v;
};
where:
The memory for v will be allocated in the constructor.
v will be allocated once when an object of type A is created and will never need to be resized.
The size of v will vary across all instances of A.
The application will potentially have a huge number of such objects and mostly need to stream a large number of these objects through the CPU but only need to perform very simple computations on the members variables.
Could having v dynamically allocated could mean that an instance of A and its member v are not located together in memory?
What tools and techniques can be used to test if this fragmentation is a performance bottleneck?
If such fragmentation is a performance issue, are there any techniques that could allow A and v to allocated in a continuous region of memory?
Or are there any techniques to aid memory access such as pre-fetching scheme? for example get an object of type A operate on the other member variables whilst pre-fetching v.
If the size of v or an acceptable maximum size could be known at compile time would replacing v with a fixed sized array like int v[max_length] lead to better performance?
The target platforms are standard desktop machines with x86/AMD64 processors, Windows or Linux OSes and compiled using either GCC or MSVC compilers.
If you have a good reason to care about performance...
Could having v dynamically allocated could mean that an instance of A and its member v
are not located together in memory?
If they are both allocated with 'new', then it is likely that they will be near one another. However, the current state of memory can drastically affect this outcome, it depends significantly on what you've been doing with memory. If you just allocate a thousand of these things one after another, then the later ones will almost certainly be "nearly contiguous".
If the A instance is on the stack, it is highly unlikely that its 'v' will be nearby.
If such fragmentation is a performance issue, are there any techniques that could
allow A and v to allocated in a continuous region of memory?
Allocate space for both, then placement new them into that space. It's dirty, but it should typically work:
char* p = reinterpret_cast<char*>(malloc(sizeof(A) + sizeof(A::v)));
char* v = p + sizeof(A);
A* a = new (p) A(v);
// time passes
a->~A();
free(a);
Or are there any techniques to aid memory access such as pre-fetching scheme?
Prefetching is compiler and platform specific, but many compilers have intrinsics available to do it. Mind- it won't help a lot if you're going to try to access that data right away, for prefetching to be of any value you often need to do it hundreds of cycles before you want the data. That said, it can be a huge boost to speed. The intrinsic would look something like __pf(my_a->v);
If the size of v or an acceptable maximum size could be known at compile time
would replacing v with a fixed sized array like int v[max_length] lead to better
performance?
Maybe. If the fixed size buffer is usually close to the size you'll need, then it could be a huge boost in speed. It will always be faster to access one A instance in this way, but if the buffer is unnecessarily gigantic and largely unused, you'll lose the opportunity for more objects to fit into the cache. I.e. it's better to have more smaller objects in the cache than it is to have a lot of unused data filling the cache up.
The specifics depend on what your design and performance goals are. An interesting discussion about this, with a "real-world" specific problem on a specific bit of hardware with a specific compiler, see The Pitfalls of Object Oriented Programming (that's a Google Docs link for a PDF, the PDF itself can be found here).
Could having v dynamically allocated could mean that an instance of A and its member v are not located together in memory?
Yes, it that will be likely.
What tools and techniques can be used to test if this fragmentation is a performance bottleneck?
cachegrind, shark.
If such fragmentation is a performance issue, are there any techniques that could allow A and v to allocated in a continuous region of memory?
Yes, you could allocate them together, but you should probably see if it's an issue first. You could use arena allocation, for example, or write your own allocators.
Or are there any techniques to aid memory access such as pre-fetching scheme? for example get an object of type A operate on the other member variables whilst pre-fetching v.
Yes, you could do this. The best thing to do would be to allocate regions of memory used together near each other.
If the size of v or an acceptable maximum size could be known at compile time would replacing v with a fixed sized array like int v[max_length] lead to better performance?
It might or might not. It would at least make v local with the struct members.
Write code.
Profile
Optimize.
If you need to stream a large number of these through the CPU and do very little calculation on each one, as you say, why are we doing all this memory allocation?
Could you just have one copy of the structure, and one (big) buffer of v, read your data into it (in binary, for speed), do your very little calculation, and move on to the next one.
The program should spend almost 100% of time in I/O.
If you pause it several times while it's running, you should see it almost every time in the process of calling a system routine like FileRead. Some profilers might give you this information, except they tend to be allergic to I/O time.