How to make a block of memory allocated by malloc() or new:
immediately swapped out,
or lazily initialized.
In fact, I'm trying to reserve an address space. How to accomplish this?
PS. How to verify, from the user space, if a memory block is swapped out?
malloc is often implemented using mmap, so if you were to use malloc, you'd get the behavior you're after anyway. After all, why sould allocating memory force other pages out of cache when there's no guarantee that the new pages will be initialized immediately? I know that Open BSD implements malloc this way, and GNU's C lib uses mmap if your allocation is larger than some limit. I think it's just a couple pages.
I don't know about how Windows goes about all of this, but check the VirtualAlloc docs to see if it is specific about its purpose. If it documents that Windows' malloc caches its pages, then you have your answer and you should use VirtualAlloc.
To reserve a chunk of address space:
On unix, sbrk() or mmap().
On Windows, VirtualAlloc().
On Windows, you can do this with the VirtualAlloc function.
I don't know of any way to do it on Linux or OS X.
On Linux, BSD, or OS X, use malloc. I think the popular "jemalloc" implementation on FreeBSD uses a dedicated mmap for every region 1 MiB or larger. The smaller regions are still backed by mmap, so they still give most of the same behavior, but when you free the smaller regions you won't automatically unmap them. I think. The glibc "dlmalloc" implementation, which is used on Linux, also uses a dedicated mmap for allocations at least 1 MiB, but I think it uses sbrk for smaller regions. Mac OS X's malloc also uses mmap but I am not sure about the particular parameters.
A pointer that you get from a large malloc will point to a shared page in RAM filled with zero bytes. As soon as you write to a page in that region, a new page in physical RAM will be allocated and filled with zero bytes. So you see, the default behavior of malloc is already lazy. It's not that the pages are swapped out to start with, it's that they aren't even there to begin with.
If you are done with the data in a region, you can use madvise with MADV_FREE. This tells the kernel that it can free the related pages instead of swapping them out. The pages remain valid, and as soon as you write to them they'll turn back into normal pages. This is kind of like calling free and then malloc.
Summary: Just use malloc. It does what you want.
Related
I have an app that takes about 20 MB of ram. In an seldom used algorithm it (std::vector) temporarily allocates 250 MB. After the deallocation the systemmonitor still shows this usage. How can I release the memory back to the system?
You can't, and shouldn't.
Virtual memory allocation is complicated, and cannot be sufficiently understood by simply watching a number in System Monitor. It may appear as if a process is using more memory than it should, but this is just an artefact of the way virtual memory addressing works.
Rest assured, if you have freed this memory properly, and the OS really needed it back, it would be reassigned.
The only real actionable point here is to stop using System Monitor as if it were an accurate measure of physical RAM in use by your process!
Use mmap() or VirtualAlloc() to allocate and release memory. This returns it to the OS immediately.
In order to use with std::vector, you'll need to provide it a std::allocator. You might find it easier to hand-roll your own vector w/ placement new and direct destructor invocation.
Normally the system heap allocators handle this correctly for you; however it looks like you found a case where they do not.
I have a C++ application where I sometimes require a large buffer of POD types (e.g. an array of 25 billion floats) to be held in memory at once in a contiguous block. This particular memory organization is driven by the fact that the application makes use of some C APIs that operate on the data. Therefore, a different arrangement (such as a list of smaller chunks of memory like std::deque uses) isn't feasible.
The application has an algorithm that is run on the array in a streaming fashion; think something like this:
std::vector<float> buf(<very_large_size>);
for (size_t i = 0; i < buf.size(); ++i) do_algorithm(buf[i]);
This particular algorithm is the conclusion of a pipeline of earlier processing steps that have been applied to the dataset. Therefore, once my algorithm has passed over the i-th element in the array, the application no longer needs it.
In theory, therefore, I could free that memory in order to reduce my application's memory footprint as it chews through the data. However, doing something akin to a realloc() (or a std::vector<T>::shrink_to_fit()) would be inefficient because my application would have to spend its time copying the unconsumed data to the new spot at reallocation time.
My application runs on POSIX-compliant operating systems (e.g. Linux, OS X). Is there any interface by which I could ask the operating system to free only a specified region from the front of the block of memory? This would seem to be the most efficient approach, as I could just notify the memory manager that, for example, the first 2 GB of the memory block can be reclaimed once I'm done with it.
If your entire buffer has to be in memory at once, then you probably will not gain much from freeing it partially later.
The main point on this post is basically to NOT tell you to do what you want to do, because the OS will not unnecessarily keep your application's memory in RAM if it's not actually needed. This is the difference between "resident memory usage" and "virtual memory usage". "Resident" is what is currently used and in RAM, "virtual" is the total memory usage of your application. And as long as your swap partition is large enough, "virtual" memory is pretty much a non-issue. [I'm assuming here that your system will not run out of virtual memory space, which is true in a 64-bit application, as long as you are not using hundreds of terabytes of virtual space!]
If you still want to do that, and want to have some reasonable portability, I would suggest building a "wrapper" that behaves kind of like std::vector and allocates lumps of some megabytes (or perhaps a couple of gigabytes) of memory at a time, and then something like:
for (size_t i = 0; i < buf.size(); ++i) {
do_algorithm(buf[i]);
buf.done(i);
}
The done method will simply check if the value if i is (one element) past the end of the current buffer, and free it. [This should inline nicely, and produce very little overhead on the average loop - assuming elements are actually used in linear order, of course].
I'd be very surprised if this gains you anything, unless do_algorithm(buf[i]) takes quite some time (certainly many seconds, probably many minutes or even hours). And of course, it's only going to help if you actually have something else useful to do with that memory. And even then, the OS will reclaim memory that isn't actively used by swapping it out to disk, if the system is short of memory.
In other words, if you allocate 100GB, fill it, leave it sitting without touching, it will eventually ALL be on the hard-disk rather than in RAM.
Further, it is not at all unusual that the heap in the application retains freed memory, and that the OS does not get the memory back until the application exits - and certainly, if only parts of a larger allocation is freed, the runtime will not release it until the whole block has been freed. So, as stated at the beginning, I'm not sure how much this will actually help your application.
As with everything regarding "tuning" and "performance improvements", you need to measure and compare a benchmark, and see how much it helps.
Is it possible to partially free dynamically-allocated memory on a POSIX system?
You can not do it using malloc()/realloc()/free().
However, you can do it in a semi-portable way using mmap() and munmap(). The key point is that if you munmap() some page, malloc() can later use that page:
create an anonymous mapping using mmap();
subsequently call munmap() for regions that you don't need anymore.
The portability issues are:
POSIX doesn't specify anonymous mappings. Some systems provide MAP_ANONYMOUS or MAP_ANON flag. Other systems provide special device file that can be mapped for this purpose. Linux provides both.
I don't think that POSIX guarantees that when you munmap() a page, malloc() will be able to use it. But I think it'll work an all systems that have mmap()/unmap().
Update
If your memory region is so large that most pages surely will be written to swap, you will not loose anything by using file mappings instead of anonymous mappings. File mappings are specified in POSIX.
If you can do without the convenience of std::vector (which won't give you much in this case anyway because you'll never want to copy / return / move that beast anyway), you can do your own memory handling. Ask the operating system for entire pages of memory (via mmap) and return them as appropriate (using munmap). You can tell mmap via its fist argument and the optional MAP_FIXED flag to map the page at a particular address (which you must ensure to be not otherwise occupied, of course) so you can build up an area of contiguous memory. If you allocate the entire memory upfront, then this is not an issue and you can do it with a single mmap and let the operating system choose a convenient place to map it. In the end, this is what malloc does internally. For platforms that don't have sys/mman.h, it's not difficult to fall back to using malloc if you can live with the fact that on those platforms, you won't return memory early.
I'm suspecting that if your allocation sizes are always multiples of the page size, realloc will be smart enough not to copy any data. You'd have to try this out and see if it works (or consult your malloc's documentation) on your particular target platform, though.
mremap is probably what you need. As long as you're shifting whole pages, you can do a super fast realloc (actually the kernel would do it for you).
I have an app that takes about 20 MB of ram. In an seldom used algorithm it (std::vector) temporarily allocates 250 MB. After the deallocation the systemmonitor still shows this usage. How can I release the memory back to the system?
You can't, and shouldn't.
Virtual memory allocation is complicated, and cannot be sufficiently understood by simply watching a number in System Monitor. It may appear as if a process is using more memory than it should, but this is just an artefact of the way virtual memory addressing works.
Rest assured, if you have freed this memory properly, and the OS really needed it back, it would be reassigned.
The only real actionable point here is to stop using System Monitor as if it were an accurate measure of physical RAM in use by your process!
Use mmap() or VirtualAlloc() to allocate and release memory. This returns it to the OS immediately.
In order to use with std::vector, you'll need to provide it a std::allocator. You might find it easier to hand-roll your own vector w/ placement new and direct destructor invocation.
Normally the system heap allocators handle this correctly for you; however it looks like you found a case where they do not.
In an application i have to allocate two buffers of 480 MB each. Memory allocation is done using HeapAlloc method. The application works fine in the systems where not many applications are running. But in system where other applications are also running memory is not allocated because of non availability of contiguous memory. Even though the memory space(non contiguous) is available but it is not allocated.
Need help to allocate two buffers of 480 MB even if non contiguous memory is available.
The situation you describe is not possible in a full featured OS which gives each process its own address space. It doesn't matter how many other applications are running, they won't affect contiguity of the free address space in your process. And virtual memory can map discontiguous physical memory addresses to a contiguous range in virtual address space.
Only in an embedded system without a memory management unit could the existence of other tasks cause your program to suffer memory fragmentation.
HeapAlloc() suggests Windows, which does give a separate address space to each process. The most likely explanation there is that your private address space is fragmented by libraries (DLLs) loading in scattered locations. You can rebase the libraries you use to avoid this and provide larger contiguous blocks of address space.
You can use VirtualAlloc with fAllocation specified as MEM_LARGE_PAGES. This enables large page support, note that you must check GetLargePageMinimum to ensure that the system supports lage pages.
Also note that this is likely to be slow as this page details.
Large-page memory regions may be difficult to obtain after the system has been running for a long time because the physical space for each large page must be contiguous, but the memory may have become fragmented. Allocating large pages under these conditions can significantly affect system performance. Therefore, applications should avoid making repeated large-page allocations and instead allocate all large pages one time, at startup.
Use VirtualAlloc. The underlying memory that backs the virtual pages need not be contiguous and you will always have your full virtual address space (2GB on a 32 bit system, I think 8 or 16 TB on Windows x64, I can't remember.) HeapAlloc can become fragmented (through your process's use, not others.) Your address space can also become fragmented, so try allocating it early in your application. I actually don't recommend HeapAlloc for anything, you can just use new and delete (which call malloc and free) For large blocks like yours malloc will call VirtualAlloc on Windows.
My Windows/C++ application allocates ~1Gb of data in memory with the operator new and processes this data. After processing the data is deleted.
I noticed that if I run the processing again without exiting the application, the second call to the operatornew to allocate ~1Gb of data fails.
I would expect Windows to deliver the memory back. Could this be managed in a better way with some other Win32 calls etc.?
I don't think this is a Windows problem. Check if you used delete or delete[] correctly. Perhaps it would help if you post the code that is allocating/freeing the memory.
In most runtime environments memory allocated to an application from the operating system remains in the application, and is seldom returned back to the operating system. Freeing a memory block allows you to reuse the block from within the application, but does not free it to the operating system to make it available to other applications.
Microsoft's C runtime library tries to return memory back to the operating system by having _heapmin_region call _heap_free_region or _free_partial_region which call VirtualFree to release data to the operating system. However, if whole pages in the corresponding region are not empty, then they will not be freed. A common cause of this is the bookkeeping information and storage caching of C++ containers.
This could be due to memory fragmentation (in reality, address space fragmentation), where various factors have contributed to your program address space not having a 1gb contiguous hole available. In reality, I suspect a bug in your memory management (sorry) - have you run your code through leak detection?
Since you are using very large memory blocks, you should consider using VirtualAlloc() and VirtualFree(), as they allow you to allocate and free pages directly, without the overhead (in memory and time) of interacting with a heap manager.
Since you are using C++, it is worth noting that you can construct C++ objects in the memory you allocate this way by using placement new.
This problem is almost certainly memory fragmentation. On 32 bit Windows the largest contiguous region you can allocate is about 1.1GB (because various DLLs in your EXE preventa larger contiguous range being available). If after deallocating a memory allocation (or a DLL load, or a memory mapped file) ends up in the middle of your previous 1GB region then there will no longer be a 1GB region available for your next call to new to allocate 1GB. Thus it will fail.
You can visualize this process using VM Validator.