Is it possible to partially free dynamically-allocated memory on a POSIX system? - c++

I have a C++ application where I sometimes require a large buffer of POD types (e.g. an array of 25 billion floats) to be held in memory at once in a contiguous block. This particular memory organization is driven by the fact that the application makes use of some C APIs that operate on the data. Therefore, a different arrangement (such as a list of smaller chunks of memory like std::deque uses) isn't feasible.
The application has an algorithm that is run on the array in a streaming fashion; think something like this:
std::vector<float> buf(<very_large_size>);
for (size_t i = 0; i < buf.size(); ++i) do_algorithm(buf[i]);
This particular algorithm is the conclusion of a pipeline of earlier processing steps that have been applied to the dataset. Therefore, once my algorithm has passed over the i-th element in the array, the application no longer needs it.
In theory, therefore, I could free that memory in order to reduce my application's memory footprint as it chews through the data. However, doing something akin to a realloc() (or a std::vector<T>::shrink_to_fit()) would be inefficient because my application would have to spend its time copying the unconsumed data to the new spot at reallocation time.
My application runs on POSIX-compliant operating systems (e.g. Linux, OS X). Is there any interface by which I could ask the operating system to free only a specified region from the front of the block of memory? This would seem to be the most efficient approach, as I could just notify the memory manager that, for example, the first 2 GB of the memory block can be reclaimed once I'm done with it.

If your entire buffer has to be in memory at once, then you probably will not gain much from freeing it partially later.
The main point on this post is basically to NOT tell you to do what you want to do, because the OS will not unnecessarily keep your application's memory in RAM if it's not actually needed. This is the difference between "resident memory usage" and "virtual memory usage". "Resident" is what is currently used and in RAM, "virtual" is the total memory usage of your application. And as long as your swap partition is large enough, "virtual" memory is pretty much a non-issue. [I'm assuming here that your system will not run out of virtual memory space, which is true in a 64-bit application, as long as you are not using hundreds of terabytes of virtual space!]
If you still want to do that, and want to have some reasonable portability, I would suggest building a "wrapper" that behaves kind of like std::vector and allocates lumps of some megabytes (or perhaps a couple of gigabytes) of memory at a time, and then something like:
for (size_t i = 0; i < buf.size(); ++i) {
do_algorithm(buf[i]);
buf.done(i);
}
The done method will simply check if the value if i is (one element) past the end of the current buffer, and free it. [This should inline nicely, and produce very little overhead on the average loop - assuming elements are actually used in linear order, of course].
I'd be very surprised if this gains you anything, unless do_algorithm(buf[i]) takes quite some time (certainly many seconds, probably many minutes or even hours). And of course, it's only going to help if you actually have something else useful to do with that memory. And even then, the OS will reclaim memory that isn't actively used by swapping it out to disk, if the system is short of memory.
In other words, if you allocate 100GB, fill it, leave it sitting without touching, it will eventually ALL be on the hard-disk rather than in RAM.
Further, it is not at all unusual that the heap in the application retains freed memory, and that the OS does not get the memory back until the application exits - and certainly, if only parts of a larger allocation is freed, the runtime will not release it until the whole block has been freed. So, as stated at the beginning, I'm not sure how much this will actually help your application.
As with everything regarding "tuning" and "performance improvements", you need to measure and compare a benchmark, and see how much it helps.

Is it possible to partially free dynamically-allocated memory on a POSIX system?
You can not do it using malloc()/realloc()/free().
However, you can do it in a semi-portable way using mmap() and munmap(). The key point is that if you munmap() some page, malloc() can later use that page:
create an anonymous mapping using mmap();
subsequently call munmap() for regions that you don't need anymore.
The portability issues are:
POSIX doesn't specify anonymous mappings. Some systems provide MAP_ANONYMOUS or MAP_ANON flag. Other systems provide special device file that can be mapped for this purpose. Linux provides both.
I don't think that POSIX guarantees that when you munmap() a page, malloc() will be able to use it. But I think it'll work an all systems that have mmap()/unmap().
Update
If your memory region is so large that most pages surely will be written to swap, you will not loose anything by using file mappings instead of anonymous mappings. File mappings are specified in POSIX.

If you can do without the convenience of std::vector (which won't give you much in this case anyway because you'll never want to copy / return / move that beast anyway), you can do your own memory handling. Ask the operating system for entire pages of memory (via mmap) and return them as appropriate (using munmap). You can tell mmap via its fist argument and the optional MAP_FIXED flag to map the page at a particular address (which you must ensure to be not otherwise occupied, of course) so you can build up an area of contiguous memory. If you allocate the entire memory upfront, then this is not an issue and you can do it with a single mmap and let the operating system choose a convenient place to map it. In the end, this is what malloc does internally. For platforms that don't have sys/mman.h, it's not difficult to fall back to using malloc if you can live with the fact that on those platforms, you won't return memory early.
I'm suspecting that if your allocation sizes are always multiples of the page size, realloc will be smart enough not to copy any data. You'd have to try this out and see if it works (or consult your malloc's documentation) on your particular target platform, though.

mremap is probably what you need. As long as you're shifting whole pages, you can do a super fast realloc (actually the kernel would do it for you).

Related

What part of the process virtual memory does Windows Task Manager display

My question is a bit naive. I'm willing to have an overview as simple as possible and couldn't find any resource that made it clear to me. I am a developer and I want to understand what exactly is the memory displayed in the "memory" column by default in Windows Task Manager:
To make things a bit simpler, let's forget about the memory the process shares with other processes, and imagine the shared memory is negligible. Also I'm focussed on the big picture and mainly care for things at GB level.
As far as I know, the memory reserved by the process called "virtual memory", is partly stored in the main memory (RAM), partly on the disk. The system decides what goes where. The system basically keeps in RAM the parts of the virtual memory that is accessed sufficiently frequently by the process. A process can reserve more virtual memory than RAM available in the computer.
From a developer point of view, the virtual memory may only be partially allocated by the program through its own memory manager (with malloc() or new X() for example). I guess the system has no awareness of what part of the virtual memory is allocated since this is handled by the process in a "private" way and depends on the language, runtime, compiler... Q: Is this correct?
My hypothesis is that the memory displayed by the task manager is essentially the part of the virtual memory being stored in RAM by the system. Q: Is it correct? And is there a simple way to know the total virtual memory reserved by the process?
Memory on windows is... extremely complicated and asking 'how much memory does my process use' is effectively a nonsensical question. TO answer your questions lets get a little background first.
Memory on windows is allocated via ptr = VirtualAlloc(..., MEM_RESERVE, ...) and committed later with VirtualAlloc(ptr+n, MEM_COMMIT, ...).
Any reserved memory just uses up address space and so isn't interesting. Windows will let you MEM_RESERVE terabytes of memory just fine. Committing the memory does use up resources but not in the way you'd think. When you call commit windows does a few sums and basically works out (total physical ram + total swap - current commit) and lets you allocate memory if there's enough free. BUT the windows memory manager doesn't actually give you physical ram until you actually use it.
Later, however, if windows is tight for physical RAM it'll swap some of your RAM out to disk (it may compress it and also throw away unused pages, throw away anything directly mapped from a file and other optimisations). This means your total commit and total physical ram usage for your program may be wildly different. Both numbers are useful depending on what you're measuring.
There's one last large caveat - memory that is shared. When you load DLLs the code, the read-only memory [and even maybe the read/write section but this is COW'd] can be shared with other programs. This means that your app requires that memory but you cannot count that memory against just your app - after all it can be shared and so doesn't take up as much physical memory as a naive count would think.
(If you are writing a game or similar you also need to count GPU memory but I'm no expert here)
All of the above goodness is normally wrapped up by the heap the application uses and you see none of this - you ask for and use memory. And its just as optimal as possible.
You can see this by going to the details tab and looking at the various options - commit-size and working-set are really useful. If you just look at the main window in task-manager and it has a single value I'd hope you understand now that a single value for memory used has to be some kind of compromise as its not a question that makes sense.
Now to answer your questions
Firstly the OS knows exactly how much memory your app has reserved and how much it has committed. What it doesn't know is if the heap implementation you (or more likely the CRT) are using has kept some freed memory about which it hasn't released back to the operation system. Heaps often do this as an optimisation - asking for memory from the OS and freeing it back to the OS is a fairly expensive operation (and can only be done in large chunks known as pages) and so most of them keep some around.
Second question: Dont use that value, go to details and use the values there as only you know what you actually want to ask.
EDIT:
For your comment, yes, but this depends on the size of the allocation. If you allocate a large block of memory (say >= 1MB) then the heap in the CRT generally directly defers the allocation to the operating system and so freeing individual ones will actually free them. For small allocations the heap in the CRT asks for pages of memory from the operating system and then subdivides that to give out in allocations. And so if you then free every other one of those you'll be left with holes - and the heap cannot give those holes back to the OS as the OS generally only works in whole pages. So anything you see in task manager will show that all the memory is still used. Remember this memory isn't lost or leaked, its just effectively pooled and will be used again if allocations ask for that size. If you care about this memory you can use the crt heap statistics famliy of functions to keep an eye on those - specifically _CrtMemDumpStatistics

Efficient way to truncate a std::vector<char> to length N - to free memory

I have several large std::vectors of chars (bytes loaded from binary files).
When my program runs out of memory, I need to be able to cull some of the memory used by these vectors. These vectors are almost the entirety of my memory usage, and they're just caches for local and network files, so it's safe to just grab the largest one and chop it in half or so.
Only thing is, I'm currently using vector::resize and vector::shrink_to_fit, but this seems to require more memory (I imagine for a reallocation of the new size) and then a bunch of time (for destruction of the now destroyed pointers, which I thought would be free?) and then copying the remaining to the new vector. Note, this is on Windows platform, in debug, so the pointers might not be destroyed in the Release build or on other platforms.
Is there something I can do to just say "C++, please tell the OS that I no longer need the memory located past location N in this vector"?
Alternatively, is there another container I'd be better off using? I do need to have random access though, or put effort into designing a way to easily keep iterators pointing at the place I'll want to read next, which would be possible, just not easy, so I'd prefer not using a std::list.
resize and shrink_to_fit are your best bets, as long as we are talking about standard C++, but these, as you noticed may not help at all if you are in a low memory situation to begin with: since the allocator interface do not provide a realloc-like operation, vector is forced to allocate a new block, copy the data in it and deallocate the old block.
Now, I see essentially four easy ways out:
drop whole vectors, not just parts of them, possibly using an LRU or stuff like that; working with big vectors, the C++ allocator normally just forwards to the OS's memory management calls, so the memory should go right back to the OS;
write your own container which uses malloc/realloc, or OS-specific functionality;
use std::deque instead; you lose guaranteed contiguousness of data, but, since deques normally allocate the space for data in distinct chunks, doing a resize+shrink_to_fit should be quite cheap - simply all the unused blocks at the end are freed, with no need for massive reallocations;
just leave this job to the OS. As already stated in the comments, the OS already has a file cache, and in normal cases it can handle it better than you or me, even just for the fact that it has a better vision of how much physical memory is left for that, what files are "hot" for most applications, and so on. Also, since you are working in a virtual address space you cannot even guarantee that your memory will actually stay in RAM; the very moment that the machine goes in memory pressure and you aren't using some memory pages so often, they get swapped to disk, so all your performance gain is lost (and you wasted space on the paging file for stuff that is already found on disk).
An additional way may be to just use memory mapped files - the system will do its own caching as usual, but you avoid any syscall overhead as long as the file remains in memory.
std::vector::shrink_to_fit() cannot result in more memory being used, if so it's a bug.
C++11 defines shrink_to_fit() as follows:
void shrink_to_fit(); Remarks: shrink_to_fit is a non-binding request to reduce capacity() to size(). [ Note: The request is non-binding to allow latitude for implementation-specific optimizations. — end note ]
As the note indicates, shrink_to_fit() may, but not necessarily, actually free memory, and the standard gives C++ implementations a free hand to recycle and optimize memory usage internally, as they see fit. C++ does not make it mandatory for shrink_to_fit(), and the like, to result in actually memory being released to the operating system, and in many cases the C++ runtime library may not actually be able to, as I'll get to in a moment. The C++ runtime library is allowed to take the freed memory, and stash it away internally, and reuse it automatically for the future memory allocation requests (explicit news, or container growth).
Most modern operating systems are not designed to allocate and release memory blocks of arbitrary sizes. Details differ, but typically an operating system allocates and deallocates memory in even chunks, typically 4Kb, or larger, at even memory page addresses. If you allocate a new object that's only a few hundred bytes long, the C++ library will request an entire page of memory to be allocated, take the first hundred bytes of it for the new object, then keep the spare amount of memory for future new requests.
Similarly, even if shrink_to_fit(), or delete, frees up a few hundred bytes, it can't go back to the operating system immediately, but only when an entire 4kb continuous memory range (or whatever is the allocation page size used by the operating system) -- suitably aligned -- is completely unused at all. Only then can a process release that page back to the operating system. Until then, the library keeps track of freed memory ranges, to be used for future new requests, without asking the operating system to allocate more pages of memory to the process.

Dynamic memory allocation and memory block metadata

I have a question about low level stuff of dynamic memory allocation.
I understand that there may be different implementations, but I need to understand the fundamental ideas.
So,
when a modern OS memory allocator or the equivalent allocates a block of memory, this block needs to be freed.
But, before that happends, some system needs to exist to control the allocation process.
I need to know:
How this system keeps track of allocated and unallocated memory. I mean, the system needs to know what blocks have already been allocated and what their size is to use this information in allocation and deallocation process.
Is this process supported by modern hardware, like allocation bits or something like that?
Or is some kind of data structure used to store allocation information.
If there is a data structure, how much memory it uses compared to the allocated memory?
Is it better to allocate memory in big chunks rather than small ones and why?
Any answer that can help reveal fundamental implementation details is appreciated.
If there is a need for code examples, C or C++ will be just fine.
"How this system keeps track of allocated and unallocated memory." For non-embedded systems with operating systems, a virtual page table, which the OS is in charge of organizing (with hardware TLB support of course), tracks the memory usage of programs.
AS FAR AS I KNOW (and the community will surely yell at me if I'm mistaken), tracking individual malloc() sizes and locations has a good number of implementations and is runtime-library dependent. Generally speaking, whenever you call malloc(), the size and location is stored in a table. Whenever you call free(), the table entry for the provided pointer is looked up. If it is found, that entry is removed. If it is not found, the free() is ignored (which also indicates a possible memory leak).
When all malloc() entries in a virtual page are freed, that virtual page is then released back to the OS (this also implies that free() does not always release memory back to the OS since the virtual page may still have other malloc() entries in it). If there is not enough space within a given virtual page to support another malloc() of a specified size, another virtual page is requested from the OS.
Embedded processors usually don't have operating systems, virtual page tables, nor multiple processes. In this case, virtual memory is not used. Instead, the entire memory of the embedded processor is treated like one large virtual page (although the addresses are actually physical addresses) and memory management follows a similar process as previously described.
Here is a similar stack overflow question with more in-depth answers.
"Is it better to allocate memory in big chunks rather than small ones and why?" Allocate as much memory as you need, no more and no less. Compiler optimizations are very smart, and memory will almost always be managed more efficiently (i.e. reducing memory fragmentation) than the programmer can manually do. This is especially true in a non-embedded environment.
Here is a similar stack overflow question with more in-depth answers (note that it pertains to C and not C++, however it is still relevant to this discussion).
Well, there are more than one way to achieve that.
I once had to wrote a malloc() (and free()) implementation for educational purpose.
This is from my experience, and real world implementation surely vary.
I used a double linked list.
Memory chunk returned to the user after calling malloc() were in fact a struct containing relevant information to my implementation (ie the next and prev pointer, and a is_used byte).
So when a user request N bytes I allocated N + sizeof(my_struct) bytes, hiding next and prev pointers at the begenning of the chunk, and returning what's left to the user.
Surely, this is poor design for a program that use a lot of small allocation (because each allocation takes up to N + 2 pointers + 1 byte).
For a real world implementation, you can take a look to the code of good and well known memory allocator.
Normally there exist two different layers.
One layer lives at application level, usually as part of the C standard library. This is what you call through functions like malloc and free (or operator new in C++, which in turn usually calls malloc). This layer takes care of your allocations, but does not know about memory or where it comes from.
The other layer, at OS level, does not know and does not care anything about your allocations. It only maintains a list of fixed-size memory pages that have been reserved, allocated, and accessed, and with each page information such as where it maps to.
There are many different implementations for either layer, but in general it works like this:
When you allocate memory, the allocator (the "application level part") looks whether it has a matching block somewhere in its books that it can give to you (some allocators will split a larger block in two, if need be).
If it doesn't find a suitable block, it reserves a new block (usually much larger than what you ask for) from the operating system. sbrk or mmap on Linux, or VirtualAlloc on Windows would be typical examples of functions it might use for that effect.
This does very little apart from showing intent to the operating system, and generating some page table entries.
The allocator then (logically, in its books) splits up that large area into smaller pieces according to its normal mode of operation, finds a suitable block, and returns it to you. Note that this returned memory does not necessarily even exist as phsyical memory (though most allocators write some metadata into the first few bytes of each allocated unit, so they necessarily pre-fault the pages).
In the mean time, invisibly, a background task zeroes out memory pages that were in use by some process once but have been freed. This happens all the time, on a tentative base, since sooner or later someone will ask for memory (often, that's what the idle task does).
Once you access an address in the page that contains your allocated block for the first time, you generate a fault. The page table entry of this yet non-existent page (it logically exists, just not phsyically) is replaced with a reference to a page from the pool of zero pages. In the uncommon case that there is none left, for example if huge amounts of memory are being allocated all the time, the OS swaps out a page which it believes will not be accessed any time soon, zeroes it, and returns this one.
Now the page becomes part of your working set, it corresponds to actual phsyical memory, and it accounts towards your process' quota. While your process is running, pages may be moved in and out of your working set, or may be paged out and in, as you exceed certain limits, and according to how much memory is needed and how it is accessed.
Once you call free, the allocator puts the freed area back into its books. It may tell the OS that it does not need the memory any more instead, but usually this does not happen as it is not really necessary and it is more efficient to keep around a little extra memory and reuse it. Also, it may not be easy to free the memory because usually the units that you allocate/deallocate do not directly correspond with the units the OS works with (and, in the case of sbrk they'd need to happen in the correct order, too).
When the process ends, the OS simply throws away all page table entries and adds all pages to the list of pages that the idle task will zero out. So the physical memory becomes available to the next process asking for some.

Advantages of anonymous mmap over malloc under memory pressure

I am running some large array processing code (on a Pentium running Linux). The sizes of the arrays are large enough for the processes to swap. So far it is working, probably because i try to keep my read and writes contiguous. However, I will soon need to handle larger arrays. In this scenario, would switching over to anonymous mmapped blocks help ?
If so would you please explain why.
In my shallow understanding, mmap implements a memory mapped file mounted from a tmpfs partition which under memory pressure would fall back to the swapping mechanism. What I would like to understand is how does mmap do it better than the standard malloc (for the sake or argument I am assuming that it is indeed better, I do not know if it is so).
Note: Please do not suggest getting a 64 bit and more RAM. That, unfortunately, is not an option.
The memory that backs your malloc() allocations is handled by the kernel in much the same way as the memory that backs private anonymous mappings created with mmap(). In fact, for large allocations malloc() will create an anonymous mapping with mmap() to back it anyway, so you are unlikely to see much difference by explicitly using mmap() yourself.
At the end of the day, if your working set exceeds the physical memory size then you will need to use swap, and whether you use anonymous mappings created with mmap() or malloc() is not going to change that. The best thing you can do is to try and refactor your algorithm so that it has good locality of reference, which will reduce the extent to which swap hurts you.
You can also try to give the kernel some hints about your usage of the memory with the madvise() system call.
The key difference here is that with malloc(3)-ed input buffers you ask the kernel to copy the data from file-mapped pages that are already in memory, while with mmap(2) you just use those pages. The first approach doubles the amount of physical memory required to back up both your and in-kernel buffers, while the second approach shares that physical memory and only increases number of virtual mappings for the useland process.

What kind of book keeping does the OS do when we use new to allocate memory?

Besides remembering the address of the pointer of the object, I think the OS also need to record how large the size of the memory is. So that when we use delete, the os will know how much memory to free.
Can anyone tell me more details about this? What other information are recorded? And where are those information stored? What does the OS do after you delete the memory?
As already noted, new is a library function, not an OS feature.
The general case is approximately like this:
The C++ compiler translates the new keyword into function calls to malloc() (or equivalent)
The allocator keeps a list of free blocks of memory, it searches there for the best match.
Typically, the 'best' match is bigger than the amount asked by your program. if so, the allocator splits the block, marks one with the size (and maybe a few other metadata), puts the rest back into the free list, and returns the allocated block to the your program.
If no appropriate free block is found, the allocator asks for some chunk of memory from the OS. There are several ways to do it, but it's typically considered a slow operation, so it asks in bigger steps (at least one page at a time, usually 4KB). When it gets the new free block, splits into the requested size and the rest is put in the free list.
The OS is the one controlling the MMU (Memory Management Unit) of the processor. This unit is the one that translates the linear addresses as seen by the currently running process into the physical addresses of RAM pages. This allows the OS the flexibility it needs to allocate and deallocate RAM pages to each process.
Each process has a different memory map, that allows each one to 'see' a linear memory space while at the same time keeping each process isolated from the others. The OS is the one who loads and unloads the map into the MMU at each process switch. Allocating a new page to a process ultimately means adding it to the memory map of the process.
can anyone tell me more detail about this?
These are all details that are highly dependent on the OS and compiler and I can only provide fairly broad answers (but you really shouldn't have to worry about this, unless you are thinking of going into that line of work, of course).
what other information are recorded?
Generally, freestore memory is commonly referred to as the "heap". This is basically because it is implemented as a heap, which is a priority-queue implementation that builds a tree of nodes (blocks of free memory). The actual implementation and the algorithms used are very complex because they need to perform very well (i.e. free blocks need to be found very quickly when requesting new memory, and newly freed memory has to be merged to the heap very quickly as well) and then, there are all the fragmentation issues and so on. Look at the wiki on Buddy memory allocation for a very simplified version.
So, there is quite a bit more than just the list of allocated pointers and their corresponding memory-block sizes. In fact, the list of available memory blocks is far more important.
and where are those information stored?
The Heap is a part of the running program (in fact, each process, and even each loaded module, will have one or more such structures). The heap is not residing in the OS (and I'm not even sure it needs to even be given by the OS, at least not for all OSes). However, the heap obviously asks the OS to provide it with large blocks of memory that it can incorporate into its tree of free memory-blocks, whenever capacity is exhausted by the memory requests of your program.
what does the OS do after you delete the memory?
Generally, when you delete memory, it simply gets added to the list of free (or available) memory blocks, and possibly gets merged with adjacent free memory-blocks if present. There is no "filling the memory with zeros" or any other actual operation on the memory that gets done, that is unnecessary and a waste of processing time.