Allocating aligned memory for larger arrays

Allocating aligned memory for larger arrays - c++

In my program I want to allocate 32 byte aligned memory to use SSE/AVX. The amount I want to allocate is somewhere around 2000*1300*17*17*4(large data set). I tried using functions _aligned_malloc() and _mm_malloc but for larger sizes it doesn't allocate memory and results in a access violation exception. If the amount allocated is small like around 512*320*4*17*17(small data set) then the code work fine.
Here these functions return a null pointer when allocation is done for large data set.But works fine when input data size is small. Also here if I just use unaligned memory allocation using new then code works fine for large data set too.
Finally Can someone tell me Is there any significant performance gains in using aligned memory for AVX.
Edit: After some research according to this post it says that new allocate memory from free store and malloc() allocate memory from heap. Here I am exceeding maximum heap size as _aligned_malloc() return errno 12 which means ENOMEM in that case Can someone tell me a work around for this.

On memory allocation:
I seems you are actually trying to alocate 2000*1300*17*17*4 32 bytes elements. This is means you are trying to allocate 96 GB while your system has only 12 GB memory.
Since new is working but malloc not it seems your local implementation of new seems to be able to allocate huge amounts of virtual memory. Malloc allocates from the heap which means it is usally limited to the physical amount of memory you've got. That's the reason it fails.
As the dataset is bigger than your main memory you might want to allocate the memory using mmap which maps a file into virtual memory making it accessable as if it was in physical memory (but it will only partially be cached in memory). I'm not sure if it's guaranteed but mmap usally aligns on optimal page size boundary (almost always 4096 byte).
Anyway you will have a huge performance loss due to the fact that your disk is way slower than your RAM. This is so serious that using AVX will probably not speed up anything at all.
On the performance loss of using unaligned memory:
On modern hardware (say Intel's Haswell onwards I think) this depends on your access patterns. Unaligned access should have almost no performance overhead on iterating over the array in memory order (each cache line will still be loaded only once). If you access it in random order than you will often cross the 64 byte cache line boundry. This means your processor will have to load 2 lines into cache and remove 2 lines from the cache instead of only one. While this might be a serious problem for some situations in your case the disk will slows things down so much that you will barely notice this.
Addtional tips (or a shot in the dark):
The way you gave the size of the array (2000*1300*17*17*4) suggests that you are using a multidimensional array (e.g. auto x = new __m256[2000][1300][17][17][4]). So some tipps on that:
Iterate through it mostly sequential
Check if it is sparse (meaning some of the memory will never be accessed) and shrink it if possible.
You could try to flatten the array and do more complex index calculation yourself in order to reduce the amount of memory need. If you get it to fit completely into your RAM you can start to optimise your code (using AVX and/or aligned memory).
"Total paging file size for all drives is 15247MB" suggests that you actually using only parts of that 96 GB so there might be a way to further reduce your usage.
In that case you might also want to ask another question on how to reduce the memory usage with more info on what you are doing.

Related

Writing a memory manager and defragmenting memory

The idea is writing a memory manager that allocates a bunch of memory at a time to minimize malloc and free calls, i've tried writing this by my self two times but both times i ran into the problem of defragmenting memory.
You could just check if a block is empty every so often, and if it is empty delete it. But let's say your blocks are 100 bytes each, first you allocate 20 bytes of memory, this will create a new 100 byte block because no blocks exist yet, then you allocate 80 bytes and this fills the first block, then you allocate another 20 bytes and this will create another new block because this first block is full, then you free the second allocation (80 bytes) and that leaves you with two blocks of which only the first 20 bytes are used, this means you have 100 bytes allocated that could be freed by moving the 20 bytes from the second block into the first block and deleting the second block.
These are the problems i ran into:
you can't move the memory around because this means all pointers to that memory will have to be updated, and for that to happen you need to know their addresses, which you don't;
100 bytes is a very small block size, what if i want to store a very low-res (64,64) ARGB image in memory? This will use 16KB of memory and moving all of that might be even slower than just not writing a memory manager at all.
Is it even worth it writing a custom memory manager after all that?

Is it even worth it writing a custom memory manager after all that?
This is asking for an opinion, but I'll try to give a factual answer.
The memory allocators that come with most operating systems and language support libraries are generally very high quality and are designed to address the types of problems you encountered (fragmentation and performance) as well as others. They are about as good as general purpose memory allocators can be.
You can do (a little) better than the provided memory allocator if your application has a particular allocation pattern that can be exploited. That's rare, but you can generally take advantage of it by making something substantially simpler than a general purpose memory manager.
you can't move the memory around
True. Most modern systems don't even try to move memory around--they try to avoid fragmentation to begin with (typically by clustering similarly sized allocations).
Old systems (ones without virtual memory managers) sometimes used memory managers that had an extra layer of indirection. Instead of returning a pointer to the allocated memory, the allocator would return an "handle", which could be as simple as an index into a table maintained by the memory manager. When the user wanted to actually access the memory, they would "lock" it. The memory manager was free to move around memory that wasn't locked (e.g., to eliminate fragmentation) because the handles gave an extra level of indirection.
what if i want to store a very low-res (64,64) ARGB image
Most memory managers provide a range of sizes so a large allocation wouldn't be split across n smaller blocks. Most will punt very large allocations to the system allocator, which, on a virtual memory operating system, can generally solve the problem unless the process address space is overly fragmented.

Why the growth direction of heap address become opposite when allocating larger space?

I was doing some experiments on the heap address growth, and something interesting happened.
(OS: CentOS, )
But I don't understand, why this happened? Thanks!
This is what I did first:
double *ptr[1000];
for (int i=0;i<1000;i++){
ptr[i] = new double[**10000**];
cout << ptr[i] << endl;
}
The output is incremental(for the last few lines):
....
....
0x2481be0
0x2495470
0x24a8d00
0x24bc590
0x24cfe20
0x24e36b0
0x24f6f40
0x250a7d0
0x251e060
Then I changed 10000 to 20000:
double *ptr[1000];
for (int i=0;i<1000;i++){
ptr[i] = new double[**20000**];
cout << ptr[i] << endl;
}
The address became more like the address of stack space(and decremental):
....
....
0x7f69c4d8a010
0x7f69c4d62010
0x7f69c4d3a010
0x7f69c4d12010
0x7f69c4cea010
0x7f69c4cc2010
0x7f69c4c9a010
0x7f69c4c72010
0x7f69c4c4a010
0x7f69c4c22010
0x7f69c4bfa010
0x7f69c4bd2010
0x7f69c4baa010
0x7f69c4b82010

Different environments/implementations allocate memory using different strategies, so there is no one correct rule. However, a common pattern is to use different allocation strategies for small objects vs. large objects.
Often, a runtime will have multiple heaps for objects of different sizes, which are optimized for different usage patterns. For example, small objects tend to be allocated often and deleted quickly, while large objects tend to be created rarely and have a long life.
If you use a single heap for everything, then a few small objects will be quickly peppered throughout your memory space, leaving lots of medium sized blocks available but few or no large blocks needed for larger objects. This is referred to as memory fragmentation, and can cause your allocation to fail even if nominally your app has tons of memory available.
Another reason to use different heaps is to use a different usage tracking method for different object sizes. For example, an implementation might request a new memory block from the OS for large objects, and for small objects, use a few smaller OS memory blocks with sub-allocations handled by the C runtime heap manager. Memory usage tracking mechanisms that are very effective for large objects can be very expensive for smaller ones because the memory used for tracking usage becomes a significant fraction of the actual memory used by each object.
In your case, my guess is that the runtime is allocating small objects at the beginning of the memory space, bottom-up, and larger ones near the end, top-down, to avoid fragmentation.

You're not going to get a great answer here, because the new function can choose any method it wants to allocate memory. My guess would be that the algorithm here broke the pool into small and large allocation pools, and the big allocation pool grows downward so they can meet in the middle (so as to not waste any space).

On UNIX, allocators use sbrk(2) and mmap(2) to get memory from the OS. The addresses returned by sbrk are well defined, but the addresses from mmap are "whatever is available". On Windows, allocators use VirtualAlloc() which is kind of like mmap.

Implementations are free to have hybrids of different allocation schemes. In C++, it's normal for there to be thousands - even millions - of relatively small objects, so it can make sense for the library's memory allocation routines to make sure they pack well and are very lightweight. Your allocations for 10000 doubles do that: they're 80016 bytes apart - 80000 for 10000 8 byte variables and just 16 bytes padding. Node specifically that the size has no relationship to powers of two, whereas when allocation 20000 doubles they're decrementing by 163840 bytes each time... weirdly, exactly 10 * 2^14. That suggests to me that the former allocations are being satisfied from one heap designed to support efficient small-object allocation by the C++ allocation new function, while the latter has crossed a theshold and is probably being sent to malloc for memory coming from a distinct heap, with much more waste.

You were lucky in the sense that the sizes of 10000 doubles and 20000 doubles happen to lie on the opposite sides of a critical thresholds called MMAP_THRESHOLD.
MMAP_THRESHOLD is 128KB by default. So, 80KB (i.e., 10000 doubles) mem alloc requests are serviced over heap, and whereas 160KB (20000 doubles) mem alloc requests are serviced by anonymous memory mapping (through mmap sys call). (Note that using mem mapping for large mem alloc may incur additional penalties due to its different underlying mem alloc handling mechanism. You may want to tune MMAP_THRESHOLD for optimal performance of your apps.)
In Linux Man for malloc:
Normally, malloc() allocates memory from the heap, and adjusts the size of the heap as required, using sbrk(2). When allocating blocks of memory larger than MMAP_THRESHOLD bytes, the glibc malloc() implementation allocates the memory as a private anonymous mapping using mmap(2). MMAP_THRESHOLD is 128 kB by default, but is adjustable using mallopt(3). Allocations performed using mmap(2) are unaffected by the RLIMIT_DATA resource limit (see getrlimit(2)).

why does dynamic memory allocation fail after 600MB?

i implemented a bloom filter(bit table) using three dimension char array it works well until it reaches at a point where it can no more allocate memory and gives a bad_alloc message. It gives me this error on the next expand request after allocating 600MB.
The bloom filter(the array) is expected to grow as big as 8 to 10GB.
Here is the code i used to allocate(expand) the bit table.
unsigned char ***bit_table_=0;
unsigned int ROWS_old=5;
unsigned int EXPND_SIZE=5;
void expand_bit_table()
{
FILE *temp;
temp=fopen("chunk_temp","w+b");
//copy old content
for(int i=0;i<ROWS_old;++i)
for(int j=0;j<ROWS;++j)
fwrite(bit_table_[i][j],COLUMNS,1,temp);
fclose(temp);
//delete old table
chunk_delete_bit_table();
//create expanded bit table ==> add EXP_SIZE more rows
bit_table_=new unsigned char**[ROWS_old+EXPND_SIZE];
for(int i=0;i<ROWS_old+EXPND_SIZE;++i)
{
bit_table_[i]=new unsigned char*[ROWS];
for(int k=0;k<ROWS;++k)
bit_table_[i][k]=new unsigned char[COLUMNS];
}
//copy back old content
temp=fopen("chunk_temp","r+b");
for(int i=0;i<ROWS_old;++i)
{
fread(bit_table_[i],COLUMNS*ROWS,1,temp);
}
fclose(temp);
//set remaining content of bit_table_to 0
for(int i=ROWS_old;i<ROWS_old+EXPND_SIZE;++i)
for(int j=0;j<ROWS;++j)
for(int k=0;k<COLUMNS;++k)
bit_table_[i][j][k]=0;
ROWS_old+=EXPND_SIZE;
}
What is the maximum allowable size for an array and if this is not the issue what can i do about it.
EDIT:
It is developed using a 32 bit platform.
It is run on 64 bit platform(server) with 8GB RAM.

A 32-bit program must allocate memory from the virtual memory address space. Which stores chunks of code and data, memory is allocated from the holes between them. Yes, the maximum you can hope for is around 650 megabytes, the largest available hole. That goes rapidly down from there. You can solve it by making your data structure smarter, like a tree or list instead of one giant array.
You can get more insight in the virtual memory map of your process with the SysInternals' VMMap utility. You might be able to change the base address of a DLL so it doesn't sit plumb in the middle of an otherwise empty region of the address space. Odds that you'll get much beyond 650 MB are however poor.
There's a lot more breathing room on a 64-bit operating system, a 32-bit process has a 4 gigabyte address space since the operating system components run in 64-bit mode. You have to use the /LARGEADDRESSAWARE linker option to allow the process to use it all. Still, that only works on a 64-bit OS, your program is still likely to bomb on a 32-bit OS. When you really need that much VM, the simplest approach is to just make a 64-bit OS a prerequisite and build your program targeting x64.

A 32-bit machine gives you a 4GB address space.
The OS reserves some of this (half of it by default on Windows, giving you 2GB to yourself. I'm not sure about Linux, but I believe it reserves 1GB)
This means you have 2-3 GB to your own process.
Into this space, several things need to fit:
your executable (as well as all dynamically linked libraries) are memory-mapped into it
each thread needs a stack
the heap
and quite a few other nitty gritty bits.
The point is that it doesn't really matter how much memory you end up actually using. But a lot of different pieces have to fit into this memory space. And since they're not packed tightly into one end of it, they fragment the memory space. Imagine, for simplicity, that your executable is mapped into the middle of this memory space. That splits your 3GB into two 1.5GB chunks. Now say you load two dynamic libraries, and they subdivide those two chunks into four 750MB ones. Then you have a couple of threads, each needing further chunks of memory, splitting up the remaining areas further. Of course, in reality each of these won't be placed at the exact center of each contiguous block (that'd be a pretty stupid allocation strategy), but nevertheless, all these chunks of memory subdivide the available memory space, cutting it up into many smaller pieces.
You might have 600MB memory free, but you very likely won't have 600MB of contiguous memory available. So where a single 600MB allocation would almost certainly fail, six 100MB allocations may succeed.
There's no fixed limit on how big a chunk of memory you can allocate. The answer is "it depends". It depends on the precise layout of your process' memory space. But on a 32-bit machine, you're unlikely to be able to allocate 500MB or more in a single allocation.

The maximum in-memory data a 32-bit process can access is 4GB in theory (in practice it will be somewhat smaller). So you cannot have 10GB data in memory at once (even with the OS supporting more). Also, even though you are allocating the memory dynamically, the free store available is further limited by the stack size.
The actual memory available to the process depends on the compiler settings that generates the executable.
If you really do need that much, consider persisting (parts of) the data in the file system.

Memory calculation of objects inaccurate?

I'm creating a small cache daemon, and I want to limit its memory usage to approximately a specified amount. However, there seems to be an issue just trying to calculate how much memory is in use.
Every time a CacheEntry object is created, it adds the size of a CacheEntry object (apparently 64 bytes) plus the number of bytes used in internal arrays to the counter for how many bytes are in use. When the CacheEntry object is deleted, it subtracts that amount. I can confirm that the math, at least, is correct.
However, when run inside NetBeans, the memory profiler reports vastly different numbers. Roughly twice as high, to be specific. It is not a memory leak, and it is specifically related to the amount of CacheEntry objects currently in existence. Increasing the amount of data stored in the internal arrays actually brings the numbers closer together (as opposed to further apart, if that were being improperly calculated); from this, I have concluded that the overhead of having a CacheEntry object in memory is almost twice what sizeof() is reporting. It does not rise in steps or "chunks".
Is there some common reason why this might happen?
UPDATE: Just to check, I ran my tests without a profiler in place. Linux reports the same VmHWM/VmRSS either way, so the memory profiler is definitely not affecting the calculations.

Perhaps the profiler is adding reference objects to track the objects? Do you see the same results when you run the application in release vs Debug?

Is there some common reason why this might happen?
Yeah, that could be internal fragmentation and overhead of the memory manager. If your data type is small (eg. sizeof(CacheEntry) is 8 bytes), newing such data type might produce a bigger chunk of memory. It is partly used for malloc's internal bookkeeping (it usually stores the size of the block somewhere), partly for padding needed to align your data type on its natural boundary (eg. 8 bytes data + 4 bytes bookkeeping + 4 bytes padding needed to align the whole thing on 8-byte boundary).
You can solve it by allocating from a single continuous array of CacheEntry (eg. CacheEntry array[1000] takes exactly 1000*sizeof(CacheEntry) bytes). You'd have to track the usage of the individual elements in the array, but that should be doable without additional memory. (eg. by running a free-list of entries in the place of the free entries).

This memory bloat is caused by use of new, specifically on relatively small objects. On Windows, dynamically allocated memory incurs a 16- or 24-byte overhead each time; I haven't found the exact numbers for Linux, but it's roughly the same. This is because each allocated chunk needs to record its location and size (possibly more than once) so that it can be accurately freed later.
As far as I'm aware, the running program also does not know exactly how much overhead is involved in this, at least in any way accessible to the programmer.
Generally speaking, large quantities of small objects should use a memory pool, both for speed and memory conservation.

Extreme memory usage for individual dynamic allocation

here's a simple test I did on MSVC++ 2010 under windows 7:
// A struct with sizeof(s) == 4, e.g 4 bytes
struct s
{
int x;
};
// Allocate 1 million structs
s* test1 = new s[1000000];
// Memory usage show that the increase in memory is roughly 4 bytes * 1000000 - As expected
// NOW! If I run this:
for (int i = 0; i < 1000000; i++)
new s();
// The memory usage is disproportionately large. When divided by 1000000, indicates 64 bytes per s!!!
Is this a common knowledge or am I missing something? Before I always used to create objects on the fly when needed. For example new Triangle() for every triangle in a mesh, etc.
Is there indeed order of magnitude overhead for dynamic memory allocation of individual instances?
Cheers
EDIT:
Just compiled and ran same program at work on Windows XP using g++:
Now the overhead is 16 bytes, not 64 as observed before! Very interesting.

Not necessarily, but the operating system will usually reserve memory on your behalf in whatever sized chunks it finds convenient; on your system, I'd guess it gives you multiples of 64 bytes per request.
There is an overhead associated with keeping track of the memory allocations, after all, and reserving very small amounts isn't worthwhile.

Is that for a debug build? Because in a debug build msvc will allocate "guards" around objects to see if you overwrite past your object boundary.

There is usually overhead with any single memory allocation. Now this is from my knowledge of malloc rather than new but I suspect it's the same.
A section of the memory arena, when carved out for an allocation of (say) 30 bytes, will typically have a header (e.g., 16 bytes, and all figures like that are examples only below, they may be different) and may be padded to a multiple of 16 bytes for easier arena management.
The header is usually important to allow the section to be re-integrated into the free memory pool when you're finished with it.
It contains information about the size of the block at a bare minimum and may have memory guards as well (to detect corruption of the arena).
So, when you allocate your one million structure array, you'll find that it uses an extra 16 bytes for the header (four million and sixteen bytes). When you try to allocate one million individual structures, each and every one of them will have that overhead.
I answered a related question here with more details. I suspect there will be more required header information for C++ since it will probably have to store the number of items over and above the section size (for proper destructor calls) but that's just supposition on my part. It doesn't affect the fact that accounting information of some sort is needed per allocated item.
If you really want to see what the space is being used for, you'll need to dig through the MSVC runtime source code.

You should check the malloc implementation. Probably this will clear things up.
Not sure though if MSVC++'s malloc can be viewed somewhere. If not, look at some other implementation, they are probably similar to some degree.
Don't expect the malloc implementation to be easy. It needs to search for some free space in the allocated virtual pages or allocate a new virtual page. And it must do this fast. As fast as possible. And it must be multithreading safe. Maybe your malloc implementation has some sort of bitvector where it safes which 64 bit chunks are free in some page and it just takes the next free chunk.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js