Our application allocates large std::vector<> of geometric coordinates -
it must be a vector (which means contiguous) because it eventually sent to OpenGL to draw model.
And Open GL works with contiguous data.
At some moment allocation fails which means that reserving memory throws std::bad_alloc exception.
However there is a lot of memory still free at this moment.
Problem is that contiguous block can not be allocated.
So the 1st two questions are:
Is there any way to control way in which CRT allocates memory? Or a way to defragment it (crazy idea))..
Maybe there is a way to check either run time can allocate block of memory of some size (not using try/catch).
Problem above was partially solved by fragmenting this one large vector to several vectors and calling OpenGL once for each of them.
However still there is a question how to define size of each smaller vector - if there are a lot of them with fairly small size we almost sure fit memory but there will be a lot of calls to OpenGL which will slow down visualization.
You can't go beyond ~600MiB of contiguous memory in a 32-bit address space. Compile as 64-bit and run it on a 64-bit platform to get around this (hopefully forever).
That said, if you have such demanding memory requirements, you should look into a custom allocator. You can use a disk-backed allocation that will appear to the vector as memory-based storage. You can mmap the file for OpenGL.
If heap fragmentation really is your peoblem, and you're running on Windows then you might like to investigate the Low Fragmentation Heap options at http://msdn.microsoft.com/en-us/library/windows/desktop/aa366750(v=vs.85).aspx
Related
In a MPI PIC code I am writing, the array size I actually need in storing particles in a processor fluctuates with time, with size changing between [0.5n : 1.5n], where n is an average size.
Presently, I allocate arrays of the largest size, i.e, 1.5*n, in this case, for once in each processor and use them without changing thier size afterward.
I am considering an alternative way: i.e., re-allocating all the arrays each time step with their correct sizes, so that I can save memory. But I worry whether re-allocating arrays is expensive and this overhead will slow the code substantially.
Can this issue be verified only by actually profiling the code, or, there is a simple principle inicating that the allocation operating is cheap enough so that we do not need worry about its overhead?
Someone said:
"ALLOCATE does not imply physical memory allocation. For example, you can ALLOCATE an array up to the size of your virtual memory limit, then use it as a sparse array, using physical memory pages only as the space is addressed."
Is this true in Fortran?
There is no single correct answer to this question. And a complete answer would need to explain how a typical Fortran memory allocator works, AND how typical virtual memory systems work. (That is too broad for a StackOverflow Q&A.)
But here are a couple of salient points.
When you reallocate an array you have the overhead of copying the data in the old array to the new array.
Reallocating an array doesn't necessarily reduce your processes actual memory usage. Memory is requested from the OS in large regions (memory segments) and the Fortran allocator then manages the memory it has been given and responds to the application's allocate and deallocate requests. When an array is deallocated, the memory can't be handed back to the OS because there will most likely be other allocated arrays in the same region.
In fact, repeated allocation and deallocation of variable sized arrays can lead to fragmentation ... which further increases memory usage.
What does this mean for you?
That's not clear. It will depend on exactly what your application's memory usage patterns are. And it will depend on how your Fortran runtime's memory allocator works.
But my gut feeling is that you are probably better off NOT trying to dynamically resize arrays to (just) save memory.
Someone said: "ALLOCATE does not imply physical memory allocation. For example, you can ALLOCATE an array up to the size of your virtual memory limit, then use it as a sparse array, using physical memory pages only as the space is addressed."
That is true, but it is not the complete picture.
You also need to consider what happens when an application's virtual memory usage exceeds the physical memory pages available. In that scenario, when the application tries to access a virtual memory page that is not in physical memory the OS virtual memory system needs to "page" another VM page out of physical RAM and "page" in the VM page that the application wants. This will entail writing the existing page (if it is dirty) to the paging device and then reading in the new one. This is going to take a significant length of time, and it will impact on application performance.
If the ratio of available physical RAM to the application's VM working set is too out of balance, the entire system can go into "virtual memory thrashing" ... which can lead to the machine becoming non-responsive and even crashing.
In short if you don't have enough physical RAM, using virtual memory to implement huge sparse arrays can be disaster prone.
It is worth noting that the compute nodes on a large-scale HPC cluster will often be configured with ZERO backing storage for VM swapping. If an application then attempts to use more RAM than is present on the compute node it will error out. Immediately.
Is this true in Fortran?
Yes. Fortran doesn't have any special magic ...
Fortran is no different than say,C , because Fortran allocate typically does not call any low-level system functions but tends to be implemented using malloc() under the hood.
"Is this true in Fortran?"
The lazy allocation you describe is highly system dependent. It is indeed valid on modern Linux. However, it does not mean that it is a good idea to just allocate several 1 TB arrays and than just using certain sections of them. Even if it works in practice on one computer it may very much fail on a different one or on a different operating system or CPU family.
Re-allocation takes time, but it is the way to go to keep your programs standard conforming and undefined-behaviour free. Reallocating every time step may easily bee too slow. But in your previous answer we have showed you that for continuously growing arrays you typically allocate in a geometric series, e.g. by doubling the size. That means that it will only be re-allocated logarithmically often if it grows linearly.
There may be a concern of exceeding the system memory when allocating to the new size and having two copies at the same size. This is only a concern when your consumption high anyway. C has realloc() (which may not help anyway) but Fortran has nothing similar.
Regarding the title question, not every malloc takes the same time. There are is internal bookkeeping involved and the implementations do differ. Some points are raised at https://softwareengineering.stackexchange.com/questions/319015/how-efficient-is-malloc-and-how-do-implementations-differ and also to some extent at Minimizing the amount of malloc() calls improves performance?
So I am currently trying to allocate dynamically a large array of elements in C++ (using "new"). Obviously, when "large" becomes too large (>4GB), my program crashes with a "bad_alloc" exception because it can't find such a large chunk of memory available.
I could allocate each element of my array separately and then store the pointers to these elements in a separate array. However, time is critical in my application so I would like to avoid as much cache misses as I can. I could also group some of these elements into blocks but what would be the best size for such a block?
My question is then: what is the best way (timewise) to allocate dynamically a large array of elements such that elements do not have to be stored contiguously but they must be accessible by index (using [])? This array is never going to be resized, no elements is going to be inserted or deleted of it.
I thought I could use std::deque for this purpose, knowing that the elements of an std::deque might or might not be stored contiguously in memory but I read there are concerns about the extra memory this container takes?
Thank you for your help on this!
If your problem is such that you actually run out of memory allocating fairly small blocks (as is done by deque) is not going to help, the overhead of tracking the allocations will only make the situation worse. You need to re-think your implementation such that you can deal with it in blocks that will still fit in memory. For such problems, if using x86 or x64 based hardware I would suggest blocks of at least 2 megabytes (the large page size).
Obviously, when "large" becomes too large (>4GB), my program crashes
with a "bad_alloc" exception because it can't find such a large chunk
of memory available.
You should be using 64-bit CPU and OS at this point, allocating huge contiguous chunk of memory should not be a problem, unless you are actually running out of memory. It is possible that you are building 32-bit program. In this case you won't be able to allocate more than 4 GB. You should build 64-bit application.
If you want something better than plain operator new, then your question is OS-specific. Look at API provided by your OS: on POSIX system you should look for mmap and for VirtualAlloc on Windows.
There are multiple problems with large allocations:
For security reasons OS kernel never gives you pages filled with garbage values, instead all new memory will be zero initialized. This means you don't have to initialize that memory as long as zeroes are exactly what you want.
OS gives you real memory lazily on first access. If you are processing large array, you might waste a lot of time taking page faults. To avoid this you can use MAP_POPULATE on Linux. On Windows you can try PrefetchVirtualMemory (but I am not sure if it can do the job). This should make init allocation slower, but should decrease total time spent in kernel.
Working with large chunks of memory wastes slots in Translation Lookaside Buffer (TLB). Depending on you memory access pattern, this can cause noticeable slowdown. To avoid this you can try using large pages (mmap with MAP_HUGETLB, MAP_HUGE_2MB, MAP_HUGE_1GB on Linux, VirtualAlloc and MEM_LARGE_PAGES). Using large pages is not easy, as they are usually not available by default. They also cannot be swapped out (always "locked in memory"), so using them requires privileges.
If you don't want to use OS-specific functions, the best you can find in C++ is std::calloc. Unlike std::malloc or operator new it returns zero initialized memory so you can probably avoid wasting time initializing that memory. Other than that, there is nothing special about that function. But this is the closest you can get while staying withing standard C++.
There are no standard containers designed to handle large allocations, moreover, all standard container are really really bad at handling those situations.
Some OSes (like Linux) overcommit memory, others (like Windows) do not. Windows might refuse to give you memory if it knows it won't be able to satisfy your request later. To avoid this you might want to increase your page file. Windows needs to reserve that space on disk beforehand, but it does not mean it will use it (start swapping). As actual memory is given to programs lazily, there are might be a lot of memory reserved for applications that will never be actually given to them.
If increasing page file is too inconvenient, you can try creating large file and map it into memory. That file will serve as a "page file" for your memory. See CreateFileMapping and MapViewOfFile.
The answer to this question is extremely application, and platform, dependent. These days if you just need a small integer factor greater than 4GB, you use a 64-bit machine, if possible. Sometimes reducing the size of the element in the array is possible as well. (E.g. using 16-bit fixed-point of half-float instead of 32-bit float.)
Beyond this, you are either looking at sparse arrays or out-of-core techniques. Sparse arrays are used when you are not actually storing elements at all locations in the array. There are many possible implementations and which is best depends on both the distribution of the data and the access pattern of the algorithm. See Eigen for example.
Out-of-core involves explicitly reading and writing parts of the array to/from disk. This used to be fairly common, but people work pretty hard to avoid doing this now. Applications that really require such are often built on top of a database or similar to handle the data management. In scientific computing, one ends up needing to distribute the compute as well as the data storage so there's a lot of complexity around that as well. For important problems the entire design may be driven by having good locality of reference.
Any sparse data structure will have overhead in how much space it takes. This can be fairly low, but it means you have to be careful if you actually have a dense array and are simply looking to avoid memory fragmentation.
If your problem can be broken into smaller pieces that only access part of the array at a time and the main issue is memory fragmentation making it hard to allocate one large block, then breaking the array in to pieces, effectively adding an outer vector of pointers, is a good bet. If you have random access to an array larger than 4 gigabytes and no way to localize the accesses, 64-bit is the way to go.
Depending on what you need the memory for and your speed concerns, and if you're using Linux, you can always try using mmap and simulate a sort of swap. It might be slower, but you can map very large sizes. See Mmap() an entire large file
Short background:
I'm developing a system that should run for months and using dynamic allocations.
The question:
I've heard that memory fragmentation slows down new and malloc operators because they need to "find" a place in one of the "holes" I've left in the memory instead of simply "going forward" in the heap.
I've read the following question:
What is memory fragmentation?
But none of the answers mentioned anything regarding performance, only failure allocating large memory chunks.
So does memory fragmentation make new take more time to allocate memory?
If yes, by how much? How do I know if new is having a "Hard time" finding memory on the heap ?
I've tried to find what are the data structures/algorithms GCC uses to find a "hole" in the memory to allocate inside. But couldn't find any descent explanation.
Memory allocation is platform specific, depending on the platform.
I would say "Yes, new takes time to allocate memory. How much time depends on many factors, such as algorithm, level of fragmentation, processor speed, optimizations, etc.
The best answer for how much time is taken, is to profile and measure. Write a simple program that fragments the memory, then measure the time for allocating memory.
There is no direct method for a program to find out the difficulty of finding available memory locations. You may be able to read a clock, allocate memory, then read again. Another idea is to set a timer.
Note: in many embedded systems, dynamic memory allocation is frowned upon. In critical systems, fragmentation can be the enemy. So fixed sized arrays are used. Fixed sized memory allocations (at compile time) remove fragmentation as an defect issue.
Edit 1: The Search
Usually, memory allocation requires a call to a function. The impact of the this is that the processor may have to reload its instruction cache or pipeline, consuming extra processing time. There also may be extra instruction for passing parameters such as the minimal size. Local variables and allocations at compile time usually don't need a function call for allocation.
Unless the allocation algorithm is linear (think array access), it will require steps to find an available slot. Some memory management algorithms use different strategies based on the requested size. For example, some memory managers may have separate pools for sizes of 64-bits or smaller.
If you think of a memory manager as having a linked list of blocks, the manager will need to find the first block greater than or equal in size to the request. If the block is larger than the requested size, it may be split and the left over memory is then created into a new block and added to the list.
There is no standard algorithm for memory management. They differ based on the needs of the system. Memory managers for platforms with restricted (small) sizes of memory will be different than those that have large amounts of memory. Memory allocation for critical systems may be different than those for non-critical systems. The C++ standard does not mandate the behavior of a memory manager, only some requirements. For example, the memory manager is allowed to allocate from a hard drive, or a network device.
The significance of the impact depends on the memory allocation algorithm. The best path is to measure the performance on your target platform.
An object tries to allocate more memory then the allowed virtual address space (2Gb on win32). The std::bad_alloc is caught and the the object released. Process memory usage drops and the process is supposed to continue; however, any subsequent memory allocation fails with another std::bad_alloc. Checking the memory usage with VMMap showed that the heap memory appears to be released but it is actually marked as private, leaving no free space. The only thing to do seems to quit and restart. I would understand a fragmentation problem but why can't the process have the memory back after the release?
The object is a QList of QLists. The application is multithreaded. I could make a small reproducer but I could reproduce the problem only once, while most of the times the reproduces can use again the memory that was freed.
Is Qt doing something sneaky? Or maybe is it win32 delaying the release?
As I understand your problem, you are allocating large amounts of memory from heap which fails at some point. Releasing the memory back to the process heap does not necesarily mean that the heap manager actually frees the virtual pages that contain only free blocks of the heap (due to performance reasons). So, if you try to allocate a virtual memory directly (VirtualAlloc or VirtualAllocEx), the attempt fails since nearly all memory is consumed by the heap manager that has no chance of knowing about your direct allocation attempt.
Well, what you can possibly do with this. You can create your own heap (HeapCreate) and limit its maximum size. That may be quite tricky, since you need to persuade Qt to use this heap.
When allocating large amounts of memory, I recommend using VirtualAlloc rather than heap functions. If the requested size is >= 512 KB, the heap mamanger actually uses VirtualAlloc to satisfy your request. However, I don't know if it actually releases the pages when you free the region, or whether it starts using it for satisfying other heap allocation requests.
The answer by Martin Drab put me on the right path. Investigating about the heap allocations I found this old message that clarifies what is going on:
The issue here is that the blocks over 512k are direct calls to
VirtualAlloc, and everything else smaller than this are allocated out
of the heap segments. The bad news is that the segments are never
released (entirely or partially) so ones you take the entire address
space with small blocks you cannot use them for other heaps or blocks
over 512 K.
The problem is not Qt-related but Windows-related; I could finally reproduce it with a plain std::vector of char arrays. The default heap allocator leaves the address space segments unaltered even after the correspondent allocation was explicitly released. The ratio is that the process might ask again buffers of a similar size and the heap manager will save time reusing existent address segments instead of compacting older ones to create new ones.
Please note this has nothing to do with the amount of physical nor virtual memory available. It's only the address space that remains segmented, even though those segments are free. This is a serious problem on 32 bit architectures, where the address space is only 2Gb large (can be 3).
This is why the memory was marked as "private", even after being released, and apparently not usable by the same process for average-sized mallocs even though the committed memory was very low.
To reproduce the problem, just create a huge vector of chunks smaller than 512Kb (they must be allocated with new or malloc). After the memory is filled and then released (no matter if the limit is reached and an exception caught or the memory is just filled with no error), the process won't be able to allocate anything bigger than 512Kb. The memory is free, it's assigned to the same process ("private") but all the buckets are too small.
But there are worse news: there is apparently no way to force a compaction of the heap segments. I tried with this and this but had no luck; there is no exact equivalent of POSIX fork() (see here and here). The only solution is to do something more low level, like creating a private heap and destroying it after the small allocations (as suggested in the message cited above) or implementing a custom allocator (there might be some commercial solution out there). Both quite infeasible for large, existent software, where the easiest solution is to close the process and restart it.
Suppose I have a memory pool object with a constructor that takes a pointer to a large chunk of memory ptr and size N. If I do many random allocations and deallocations of various sizes I can get the memory in such a state that I cannot allocate an M byte object contiguously in memory even though there may be a lot free! At the same time, I can't compact the memory because that would cause a dangling pointer on the consumers. How does one resolve fragmentation in this case?
I wanted to add my 2 cents only because no one else pointed out that from your description it sounds like you are implementing a standard heap allocator (i.e what all of us already use every time when we call malloc() or operator new).
A heap is exactly such an object, that goes to virtual memory manager and asks for large chunk of memory (what you call "a pool"). Then it has all kinds of different algorithms for dealing with most efficient way of allocating various size chunks and freeing them. Furthermore, many people have modified and optimized these algorithms over the years. For long time Windows came with an option called low-fragmentation heap (LFH) which you used to have to enable manually. Starting with Vista LFH is used for all heaps by default.
Heaps are not perfect and they can definitely bog down performance when not used properly. Since OS vendors can't possibly anticipate every scenario in which you will use a heap, their heap managers have to be optimized for the "average" use. But if you have a requirement which is similar to the requirements for a regular heap (i.e. many objects, different size....) you should consider just using a heap and not reinventing it because chances are your implementation will be inferior to what OS already provides for you.
With memory allocation, the only time you can gain performance by not simply using the heap is by giving up some other aspect (allocation overhead, allocation lifetime....) which is not important to your specific application.
For example, in our application we had a requirement for many allocations of less than 1KB but these allocations were used only for very short periods of time (milliseconds). To optimize the app, I used Boost Pool library but extended it so that my "allocator" actually contained a collection of boost pool objects, each responsible for allocating one specific size from 16 bytes up to 1024 (in steps of 4). This provided almost free (O(1) complexity) allocation/free of these objects but the catch is that a) memory usage is always large and never goes down even if we don't have a single object allocated, b) Boost Pool never frees the memory it uses (at least in the mode we are using it in) so we only use this for objects which don't stick around very long.
So which aspect(s) of normal memory allocation are you willing to give up in your app?
Depending on the system there are a couple of ways to do it.
Try to avoid fragmentation in the first place, if you allocate blocks in powers of 2 you have less a chance of causing this kind of fragmentation. There are a couple of other ways around it but if you ever reach this state then you just OOM at that point because there are no delicate ways of handling it other than killing the process that asked for memory, blocking until you can allocate memory, or returning NULL as your allocation area.
Another way is to pass pointers to pointers of your data(ex: int **). Then you can rearrange memory beneath the program (thread safe I hope) and compact the allocations so that you can allocate new blocks and still keep the data from old blocks (once the system gets to this state though that becomes a heavy overhead but should seldom be done).
There are also ways of "binning" memory so that you have contiguous pages for instance dedicate 1 page only to allocations of 512 and less, another for 1024 and less, etc... This makes it easier to make decisions about which bin to use and in the worst case you split from the next highest bin or merge from a lower bin which reduces the chance of fragmenting across multiple pages.
Implementing object pools for the objects that you frequently allocate will drive fragmentation down considerably without the need to change your memory allocator.
It would be helpful to know more exactly what you are actually trying to do, because there are many ways to deal with this.
But, the first question is: is this actually happening, or is it a theoretical concern?
One thing to keep in mind is you normally have a lot more virtual memory address space available than physical memory, so even when physical memory is fragmented, there is still plenty of contiguous virtual memory. (Of course, the physical memory is discontiguous underneath but your code doesn't see that.)
I think there is sometimes unwarranted fear of memory fragmentation, and as a result people write a custom memory allocator (or worse, they concoct a scheme with handles and moveable memory and compaction). I think these are rarely needed in practice, and it can sometimes improve performance to throw this out and go back to using malloc.
write the pool to operate as a list of allocations, you can then extended and destroyed as needed. this can reduce fragmentation.
and/or implement allocation transfer (or move) support so you can compact active allocations. the object/holder may need to assist you, since the pool may not necessarily know how to transfer types itself. if the pool is used with a collection type, then it is far easier to accomplish compacting/transfers.