Create too large array in C++, how to solve? - c++

Recently, I work in C++ and I have to create a array[60.000][60.000]. However, i cannot create this array because it's too large. I tried float **array or even static float array but nothing is good. Does anyone have an ideas?
Thanks for your helps!

A matrix of size 60,000 x 60,000 has 3,600,000,000 elements.
You're using type float so it becomes:
60,000 x 60,000 * 4 bytes = 14,400,000,000 bytes ~= 13.4 GB
Do you even have that much memory in your machine?
Note that the issue of stack vs heap doesn't even matter unless you have enough memory to begin with.
Here's a list of possible problems:
You don't have enough memory.
If the matrix is declared globally, you'll exceed the maximum size of the binary.
If the matrix is declared as a local array, then you will blow your stack.
If you're compiling for 32-bit, you have far exceeded the 2GB/4GB addressing limit.

Does "60.000" actually mean "60000"? If so, the size of the required memory is 60000 * 60000 * sizeof(float), which is roughly 13.4 GB. A typical 32-bit process is limited to only 2 GB, so it is clear why it doesn't fit.
On the other hand, I don't see why you shouldn't be able to fit that into a 64-bit process, assuming your machine has enough RAM.

Allocate the memory at runtime -- consider using a memory mapped file as the backing. Like everyone says, 14 gigs is a lot of memory. But it's not unreasonable to find a computer with 14GB of memory, nor is it unreasonable to page the memory as necessary.
With a matrix of this size, you will likely become very curious about memory access performance. Remember to consider the cache grain of your target architecture and if your target has a TLB you may be able to use larger pages to relieve some TLB pressure. Then again, if you don't have enough memory you'll likely care only about how fast your storage I/O is.
If it's not already obvious, you'll need an architecture that supports a 64-bit address space in order to access this memory directly/conveniently.

To initialise the 2D array of floats that you want, you will need:
60000 * 60000 * 4 bytes = 14400000000 bytes
Which is approximately 14GB of memory. That's a LOT of memory. To even hold that theoretically, you will need to be running a 64bit machine, not to mention one with quite a bit of RAM installed.
Furthermore, allocating this much memory is almost never necessary in most situations, are you sure no optimisations could be made here?
EDIT:
In light of new information from your comments on other answers: You only have 4GB memory (RAM). Your operating system is hence going to have to page at least 9GB on the Hard Drive, in reality probably more. But you also only have 20GB of Hard Drive space. This is barely enough to page all that data, especially if the disk is fragmented. Finally, (I could be wrong because you haven't stated explicitly) it is quite possible that you're running a 32bit machine. This isn't really capable of handling more than 4GB of memory at a time.

I had this problem too. I did a workaround where I chopped the array into sections (my biggest allowed array was float A_sub_matrix_20[62944560]). When I declared just one of these in main(), it seems to be put in RAM as I got a runtime exception as soon as main() starts. I was able to declare 20 buffers of that size as global variables which works (looks like in global form they are stored on the HDD - when I added A_sub_matrix_20[n] to the watch list in VisualStudio it gave a message "reading from file").

Related

Can I create an array exceeding RAM, if I have enough swap memory?

Let's say I have 8 Gigabytes of RAM and 16 Gigabytes of swap memory. Can I allocate a 20 Gigabyte array there in C? If yes, how is it possible? What would that memory layout look like?
[linux] Can I create an array exceeding RAM, if I have enough swap memory?
Yes, you can. Note that accessing swap is veerry slooww.
how is it possible
Allocate dynamic memory. The operating system handles the rest.
How would that memory layout look like?
On an amd64 system, you can have 256 TiB of address space. You can easily fit a contiguous block of 8 GiB in that space. The operating system divides the virtual memory into pages and copies the pages between physical memory and swap space as needed.
Modern operating systems use virtual memory. In Linux and most other OSes rach process has it's own address space according to the abilities of the architecture. You can check the size of the virtual address space in /proc/cpuinfo. For example you may see:
address sizes : 43 bits physical, 48 bits virtual
This means that virtual addresses use 48 bit. Half of that is reserved for the kernel so you only can use 47 bit, or 128TiB. Any memory you allocate will be placed somewhere in those 128 TiB of address space as if you actually had that much memory.
Linux uses demand page loading and per default over commits memory. When you say
char *mem = (char*)malloc(1'000'000'000'000);
what happens is that Linux picks a suitable address and just records that you have allocated 1'000'000'000'000 (rounded up to the nearest page) of memory starting at that point. (It does some sanity check that the amount isn't totally bonkers depending on the amount of physical memory that is free, the amount of swap that is free and the overcommit setting. Per default you can allocate a lot more than you have memory and swap.)
Note that at this point no physical memory and no swap space is connected to your allocated block at all. This changes when you first write to the memory:
mem[4096] = 0;
At this point the program will page fault. Linux checks the address is actually something your program is allowed to write to, finds a physical page and map it to &mem[4096]. Then it lets the program retry to write there and everything continues.
If Linux can't find a physical page it will try to swap something out to make a physical page available for your programm. If that also fails your program will receive a SIGSEGV and likely die.
As a result you can allocate basically unlimited amounts of memory as long as you never write to more than the physical memory and swap and support. On the other hand if you initialize the memory (explicitly or implicitly using calloc()) the system will quickly notice if you try to use more than available.
You can, but not with a simple malloc. It's platform-dependent.
It requires an OS call to allocate swapable memory (it's VirtualAlloc on Windows, for example, on Linux it should be mmap and related functions).
Once it's done, the allocated memory is divided into pages, contiguous blocks of fixed size. You can lock a page, therefore it will be loaded in RAM and you can read and modify it freely. For old dinosaurs like me, it's exactly how EMS memory worked under DOS... You address your swappable memory with a kind of segment:offset method: first, you divide your linear address by the page size to find which page is needed, then you use the remainder to get the offset within this page.
Once unlocked, the page remains in memory until the OS needs memory: then, an unlocked page will be flushed to disk, in swap, and discarded in RAM... Until you lock (and load...) it again, but this operation may requires to free RAM, therefore another process may have its unlocked pages swapped BEFORE your own page is loaded again. And this is damnly SLOOOOOOW... Even on a SSD!
So, it's not always a good thing to use swap. A better way is to use memory mapped files - perfect for reading very big files mostly sequentially, with few random accesses - if it can suits your needs.

Allocating aligned memory for larger arrays

In my program I want to allocate 32 byte aligned memory to use SSE/AVX. The amount I want to allocate is somewhere around 2000*1300*17*17*4(large data set). I tried using functions _aligned_malloc() and _mm_malloc but for larger sizes it doesn't allocate memory and results in a access violation exception. If the amount allocated is small like around 512*320*4*17*17(small data set) then the code work fine.
Here these functions return a null pointer when allocation is done for large data set.But works fine when input data size is small. Also here if I just use unaligned memory allocation using new then code works fine for large data set too.
Finally Can someone tell me Is there any significant performance gains in using aligned memory for AVX.
Edit: After some research according to this post it says that new allocate memory from free store and malloc() allocate memory from heap. Here I am exceeding maximum heap size as _aligned_malloc() return errno 12 which means ENOMEM in that case Can someone tell me a work around for this.
On memory allocation:
I seems you are actually trying to alocate 2000*1300*17*17*4 32 bytes elements. This is means you are trying to allocate 96 GB while your system has only 12 GB memory.
Since new is working but malloc not it seems your local implementation of new seems to be able to allocate huge amounts of virtual memory. Malloc allocates from the heap which means it is usally limited to the physical amount of memory you've got. That's the reason it fails.
As the dataset is bigger than your main memory you might want to allocate the memory using mmap which maps a file into virtual memory making it accessable as if it was in physical memory (but it will only partially be cached in memory). I'm not sure if it's guaranteed but mmap usally aligns on optimal page size boundary (almost always 4096 byte).
Anyway you will have a huge performance loss due to the fact that your disk is way slower than your RAM. This is so serious that using AVX will probably not speed up anything at all.
On the performance loss of using unaligned memory:
On modern hardware (say Intel's Haswell onwards I think) this depends on your access patterns. Unaligned access should have almost no performance overhead on iterating over the array in memory order (each cache line will still be loaded only once). If you access it in random order than you will often cross the 64 byte cache line boundry. This means your processor will have to load 2 lines into cache and remove 2 lines from the cache instead of only one. While this might be a serious problem for some situations in your case the disk will slows things down so much that you will barely notice this.
Addtional tips (or a shot in the dark):
The way you gave the size of the array (2000*1300*17*17*4) suggests that you are using a multidimensional array (e.g. auto x = new __m256[2000][1300][17][17][4]). So some tipps on that:
Iterate through it mostly sequential
Check if it is sparse (meaning some of the memory will never be accessed) and shrink it if possible.
You could try to flatten the array and do more complex index calculation yourself in order to reduce the amount of memory need. If you get it to fit completely into your RAM you can start to optimise your code (using AVX and/or aligned memory).
"Total paging file size for all drives is 15247MB" suggests that you actually using only parts of that 96 GB so there might be a way to further reduce your usage.
In that case you might also want to ask another question on how to reduce the memory usage with more info on what you are doing.

Allocate more than 2GB on the heap using c++ on a 32bit linux kernel

This seems to be a very common problem, but still I haven't found a definite answer.
I have access to a server which runs linux, has 16 GB of RAM and a 16-core (64bit) CPU
(/proc/cpuinfo gives "Intel(R) Xeon(R) CPU E5520 # 2.27GHz"). However, the kernel
is 32bit (uname -m gives i686). Of course, I have no root access, so I cannot change
that.
I am running a C++-program I wrote, which does some memory-hungry calculations, so I
I need a big heap - but whenever I try to allocate more than 2GB, I get a badalloc,
although ulimit returns "unlimited".
For simplicity, let's just say my program is this:
#include <iostream>
#include <vector>
int main() {
int i = 0;
std::vector<std::vector<int> > vv;
for (;;) {
++i;
vv.resize(vv.size() + 1);
std::vector<int>* v = &(vv.at(vv.size() - 1));
v->resize(1024 * 1024 * 128);
std::cout << i * 512 << " MB.\n";
}
return 0;
}
After compiling with g++ (no flags), the output is:
512 MB.
1024 MB.
1536 MB.
2048 MB.
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted
As far as I understand, this is a limitation of 32bit systems, apparently because a 32bit pointer
can just hold 2^32 different adresses (Am I right to suppose that if I compiled the same program
on the same server, but that server ran a 64bit kernel, than the program could allocate more than
2GB?)
This is not a question of whether the allocated memory is contiguous, if I slice thinner in the
example program, the same problem occurs, and in my actual program, there are no big blocks of
memory, just many small ones.
So my question is:
Is there anything I can do? I cannot change the OS, but of course I could modify my source code,
use different compiler options etc. ...
but I do need more than 2GB of memory, all at once, and there is no obvious way to further optimise
the algorithm the program uses.
If the answer is a clear "no", it would still be good to know that.
Thanks,
Lukas
If you need the memory all at once, no, there is really no way to do it without changing to a 64-bit kernel (which, yes, would allow you to allocate more memory in a single process)
That said, if you don't need the memory all at once, but instead just with fast access, you can always offload parts of the memory storage to another process.
That could for example work by storing the data in the other process, and have your process temporarily map shared memory from that process into its own memory space when required. It will still be stored in-memory, but you'll have some overhead when switching memory range. If the overhead is acceptable or not would depend on your memory access patterns.
It's not a very straight forward approach, but without changing the kernel to give you a 64-bit address space, it sounds like you're in a bit of a bind.
EDIT: You may be able to raise the limit a little above 2GB by reconfiguring the kernel, but that only means you'll hit the hard limit there instead. Also, that would require root access.
The answer is a clear "No". If you can't change the OS, you can't raise the limit. The limit applies for a whole process and obviously no single allocation can exceed the per-process limit.
Depending on options used to compile the kernel, the limit could be higher than 2GB, but will certainly be less than 4GB. But I guess recompiling the kernel also counts as "changing the OS".
See this discussion
And yes, on a 64-bit OS that limit would go away.

why does dynamic memory allocation fail after 600MB?

i implemented a bloom filter(bit table) using three dimension char array it works well until it reaches at a point where it can no more allocate memory and gives a bad_alloc message. It gives me this error on the next expand request after allocating 600MB.
The bloom filter(the array) is expected to grow as big as 8 to 10GB.
Here is the code i used to allocate(expand) the bit table.
unsigned char ***bit_table_=0;
unsigned int ROWS_old=5;
unsigned int EXPND_SIZE=5;
void expand_bit_table()
{
FILE *temp;
temp=fopen("chunk_temp","w+b");
//copy old content
for(int i=0;i<ROWS_old;++i)
for(int j=0;j<ROWS;++j)
fwrite(bit_table_[i][j],COLUMNS,1,temp);
fclose(temp);
//delete old table
chunk_delete_bit_table();
//create expanded bit table ==> add EXP_SIZE more rows
bit_table_=new unsigned char**[ROWS_old+EXPND_SIZE];
for(int i=0;i<ROWS_old+EXPND_SIZE;++i)
{
bit_table_[i]=new unsigned char*[ROWS];
for(int k=0;k<ROWS;++k)
bit_table_[i][k]=new unsigned char[COLUMNS];
}
//copy back old content
temp=fopen("chunk_temp","r+b");
for(int i=0;i<ROWS_old;++i)
{
fread(bit_table_[i],COLUMNS*ROWS,1,temp);
}
fclose(temp);
//set remaining content of bit_table_to 0
for(int i=ROWS_old;i<ROWS_old+EXPND_SIZE;++i)
for(int j=0;j<ROWS;++j)
for(int k=0;k<COLUMNS;++k)
bit_table_[i][j][k]=0;
ROWS_old+=EXPND_SIZE;
}
What is the maximum allowable size for an array and if this is not the issue what can i do about it.
EDIT:
It is developed using a 32 bit platform.
It is run on 64 bit platform(server) with 8GB RAM.
A 32-bit program must allocate memory from the virtual memory address space. Which stores chunks of code and data, memory is allocated from the holes between them. Yes, the maximum you can hope for is around 650 megabytes, the largest available hole. That goes rapidly down from there. You can solve it by making your data structure smarter, like a tree or list instead of one giant array.
You can get more insight in the virtual memory map of your process with the SysInternals' VMMap utility. You might be able to change the base address of a DLL so it doesn't sit plumb in the middle of an otherwise empty region of the address space. Odds that you'll get much beyond 650 MB are however poor.
There's a lot more breathing room on a 64-bit operating system, a 32-bit process has a 4 gigabyte address space since the operating system components run in 64-bit mode. You have to use the /LARGEADDRESSAWARE linker option to allow the process to use it all. Still, that only works on a 64-bit OS, your program is still likely to bomb on a 32-bit OS. When you really need that much VM, the simplest approach is to just make a 64-bit OS a prerequisite and build your program targeting x64.
A 32-bit machine gives you a 4GB address space.
The OS reserves some of this (half of it by default on Windows, giving you 2GB to yourself. I'm not sure about Linux, but I believe it reserves 1GB)
This means you have 2-3 GB to your own process.
Into this space, several things need to fit:
your executable (as well as all dynamically linked libraries) are memory-mapped into it
each thread needs a stack
the heap
and quite a few other nitty gritty bits.
The point is that it doesn't really matter how much memory you end up actually using. But a lot of different pieces have to fit into this memory space. And since they're not packed tightly into one end of it, they fragment the memory space. Imagine, for simplicity, that your executable is mapped into the middle of this memory space. That splits your 3GB into two 1.5GB chunks. Now say you load two dynamic libraries, and they subdivide those two chunks into four 750MB ones. Then you have a couple of threads, each needing further chunks of memory, splitting up the remaining areas further. Of course, in reality each of these won't be placed at the exact center of each contiguous block (that'd be a pretty stupid allocation strategy), but nevertheless, all these chunks of memory subdivide the available memory space, cutting it up into many smaller pieces.
You might have 600MB memory free, but you very likely won't have 600MB of contiguous memory available. So where a single 600MB allocation would almost certainly fail, six 100MB allocations may succeed.
There's no fixed limit on how big a chunk of memory you can allocate. The answer is "it depends". It depends on the precise layout of your process' memory space. But on a 32-bit machine, you're unlikely to be able to allocate 500MB or more in a single allocation.
The maximum in-memory data a 32-bit process can access is 4GB in theory (in practice it will be somewhat smaller). So you cannot have 10GB data in memory at once (even with the OS supporting more). Also, even though you are allocating the memory dynamically, the free store available is further limited by the stack size.
The actual memory available to the process depends on the compiler settings that generates the executable.
If you really do need that much, consider persisting (parts of) the data in the file system.

Memory calculation of objects inaccurate?

I'm creating a small cache daemon, and I want to limit its memory usage to approximately a specified amount. However, there seems to be an issue just trying to calculate how much memory is in use.
Every time a CacheEntry object is created, it adds the size of a CacheEntry object (apparently 64 bytes) plus the number of bytes used in internal arrays to the counter for how many bytes are in use. When the CacheEntry object is deleted, it subtracts that amount. I can confirm that the math, at least, is correct.
However, when run inside NetBeans, the memory profiler reports vastly different numbers. Roughly twice as high, to be specific. It is not a memory leak, and it is specifically related to the amount of CacheEntry objects currently in existence. Increasing the amount of data stored in the internal arrays actually brings the numbers closer together (as opposed to further apart, if that were being improperly calculated); from this, I have concluded that the overhead of having a CacheEntry object in memory is almost twice what sizeof() is reporting. It does not rise in steps or "chunks".
Is there some common reason why this might happen?
UPDATE: Just to check, I ran my tests without a profiler in place. Linux reports the same VmHWM/VmRSS either way, so the memory profiler is definitely not affecting the calculations.
Perhaps the profiler is adding reference objects to track the objects? Do you see the same results when you run the application in release vs Debug?
Is there some common reason why this might happen?
Yeah, that could be internal fragmentation and overhead of the memory manager. If your data type is small (eg. sizeof(CacheEntry) is 8 bytes), newing such data type might produce a bigger chunk of memory. It is partly used for malloc's internal bookkeeping (it usually stores the size of the block somewhere), partly for padding needed to align your data type on its natural boundary (eg. 8 bytes data + 4 bytes bookkeeping + 4 bytes padding needed to align the whole thing on 8-byte boundary).
You can solve it by allocating from a single continuous array of CacheEntry (eg. CacheEntry array[1000] takes exactly 1000*sizeof(CacheEntry) bytes). You'd have to track the usage of the individual elements in the array, but that should be doable without additional memory. (eg. by running a free-list of entries in the place of the free entries).
This memory bloat is caused by use of new, specifically on relatively small objects. On Windows, dynamically allocated memory incurs a 16- or 24-byte overhead each time; I haven't found the exact numbers for Linux, but it's roughly the same. This is because each allocated chunk needs to record its location and size (possibly more than once) so that it can be accurately freed later.
As far as I'm aware, the running program also does not know exactly how much overhead is involved in this, at least in any way accessible to the programmer.
Generally speaking, large quantities of small objects should use a memory pool, both for speed and memory conservation.