How to hint the OS which blocks of memory should not be swapped to disk?

How to hint the OS which blocks of memory should not be swapped to disk? - c++

When the system memory is being exhausted, OS begins to swap unused memory regions to disk. I'm wondering if a developer could control this process.
For example, I have 2 blocks of memory, both are not being used for some time. But I don't want the first block to be swapped to disk, because application is waiting for something, and this block should be processed as soon as possible. The other block is not that important, so it could be swapped to disk without a doubt.
There might be no cross-platform way, but maybe there are OS-specific (Windows, Linux, etc) ways or hacky tricks to prioritize swapping and "mark" certain blocks of memory that should be swapped last?

On POSIX systems, posix_madvise with the POSIX_MADV_WILLNEED flag provides this sort of advice. It's only advice, so it's up to the OS how it interprets it, but in my experience, it typically behaves as:
Page in the memory range in bulk if it's currently paged out
Don't page it out unless operating under severe memory pressure
mlock can be used to say "never swap", but it's no longer advice at that point; you've told the OS to never swap it out, even under severe memory pressure (if too many processes do this, you can trigger out of memory errors or broad performance degradation as less important memory is forced to stay resident at the expense of more important memory).

Related

What part of the process virtual memory does Windows Task Manager display

My question is a bit naive. I'm willing to have an overview as simple as possible and couldn't find any resource that made it clear to me. I am a developer and I want to understand what exactly is the memory displayed in the "memory" column by default in Windows Task Manager:
To make things a bit simpler, let's forget about the memory the process shares with other processes, and imagine the shared memory is negligible. Also I'm focussed on the big picture and mainly care for things at GB level.
As far as I know, the memory reserved by the process called "virtual memory", is partly stored in the main memory (RAM), partly on the disk. The system decides what goes where. The system basically keeps in RAM the parts of the virtual memory that is accessed sufficiently frequently by the process. A process can reserve more virtual memory than RAM available in the computer.
From a developer point of view, the virtual memory may only be partially allocated by the program through its own memory manager (with malloc() or new X() for example). I guess the system has no awareness of what part of the virtual memory is allocated since this is handled by the process in a "private" way and depends on the language, runtime, compiler... Q: Is this correct?
My hypothesis is that the memory displayed by the task manager is essentially the part of the virtual memory being stored in RAM by the system. Q: Is it correct? And is there a simple way to know the total virtual memory reserved by the process?

Memory on windows is... extremely complicated and asking 'how much memory does my process use' is effectively a nonsensical question. TO answer your questions lets get a little background first.
Memory on windows is allocated via ptr = VirtualAlloc(..., MEM_RESERVE, ...) and committed later with VirtualAlloc(ptr+n, MEM_COMMIT, ...).
Any reserved memory just uses up address space and so isn't interesting. Windows will let you MEM_RESERVE terabytes of memory just fine. Committing the memory does use up resources but not in the way you'd think. When you call commit windows does a few sums and basically works out (total physical ram + total swap - current commit) and lets you allocate memory if there's enough free. BUT the windows memory manager doesn't actually give you physical ram until you actually use it.
Later, however, if windows is tight for physical RAM it'll swap some of your RAM out to disk (it may compress it and also throw away unused pages, throw away anything directly mapped from a file and other optimisations). This means your total commit and total physical ram usage for your program may be wildly different. Both numbers are useful depending on what you're measuring.
There's one last large caveat - memory that is shared. When you load DLLs the code, the read-only memory [and even maybe the read/write section but this is COW'd] can be shared with other programs. This means that your app requires that memory but you cannot count that memory against just your app - after all it can be shared and so doesn't take up as much physical memory as a naive count would think.
(If you are writing a game or similar you also need to count GPU memory but I'm no expert here)
All of the above goodness is normally wrapped up by the heap the application uses and you see none of this - you ask for and use memory. And its just as optimal as possible.
You can see this by going to the details tab and looking at the various options - commit-size and working-set are really useful. If you just look at the main window in task-manager and it has a single value I'd hope you understand now that a single value for memory used has to be some kind of compromise as its not a question that makes sense.
Now to answer your questions
Firstly the OS knows exactly how much memory your app has reserved and how much it has committed. What it doesn't know is if the heap implementation you (or more likely the CRT) are using has kept some freed memory about which it hasn't released back to the operation system. Heaps often do this as an optimisation - asking for memory from the OS and freeing it back to the OS is a fairly expensive operation (and can only be done in large chunks known as pages) and so most of them keep some around.
Second question: Dont use that value, go to details and use the values there as only you know what you actually want to ask.
EDIT:
For your comment, yes, but this depends on the size of the allocation. If you allocate a large block of memory (say >= 1MB) then the heap in the CRT generally directly defers the allocation to the operating system and so freeing individual ones will actually free them. For small allocations the heap in the CRT asks for pages of memory from the operating system and then subdivides that to give out in allocations. And so if you then free every other one of those you'll be left with holes - and the heap cannot give those holes back to the OS as the OS generally only works in whole pages. So anything you see in task manager will show that all the memory is still used. Remember this memory isn't lost or leaked, its just effectively pooled and will be used again if allocations ask for that size. If you care about this memory you can use the crt heap statistics famliy of functions to keep an eye on those - specifically _CrtMemDumpStatistics

What would be the disadvantage of creating an array of really big size on 64 bit systems?

Operating systems like Linux work on the principle of Copy-on-write, so even if you are allocating an array of say 100 GB, but only use upto 10GB, you would only be using 10 GB of memory. So, what would be the disadvantage of creating such a big array? I can see an advantage though, which is that you won't have to worry about using a dynamic array, which would have the cost of reallocations.

The main disadvantage is that by doing this you are making a strong assumption about how exactly the standard library allocators1 and the underlying Linux allocators work. In fact, the allocators and underlying system do not always work as you mention.
Now, you mentioned "copy on write", but what you are likely really referring to is the combination of lazy page population and overcommit. Depending on the configuration, it means that any memory you allocate but don't touch may not count against memory limits and may not occupy physical memory.
The problem is that this often may not work. For example:
Many allocators have modes where they touch the allocated memory, e.g., in debug mode to fill it out with a known pattern to help diagnose references to uninitialized memory. Most allocators touch at least a few bytes before your allocated region to store metadata that can be used on deallocation. So you are making a strong assumption about allocator behavior that is likely to break.
The Linux overcommit behavior is totally configurable. In practice many server-side Linux users will disable it in order to reduce uncertainty and unrecoverable problems related to the OOM killer. So your claim that Linux behaves lazily is only true in for some overcommit configuration and false for others.
You might assume that memory is being committed in 4K chunks and adjust your algorithm around that. However, systems have different page sizes: 16K and 64K are not uncommon as base page sizes, and x86 Linux systems by default have transparent huge pages enabled, so you may actually be getting 2,048K pages without realizing it! In this case you may end up committing nearly the entire array, depending on your access pattern.
As mentioned in the comments, the "failure mode" for this type of use is pretty poor. You think you'll only use a small portion of the array, but if you do end up using more than the system can handle, at best you may get a signal to your application on some random access to a new page, but more like the oom killer will just kill some other random process on your machine.
1 Here I'm assuming you are using something like malloc or new to allocate the array, since you didn't mention mmaping it directly or anything.

Real-world operating systems don't simply allow your program to access all memory available - they enforce quotas. So a 64-bit operating system, running on hardware with enough physical memory, will simply refuse to allocate all that memory to any program. This is even more true if your operating system is virtualised (e.g. some hypervisor hosts two or more operating systems on the same physical platform - the hypervisor enforces quotas for each hosted operating system, and one of them will enforce quotas for your program).
Attempting to allocate a large amount of memory is therefore, practically, an effective way to maximise likelihood that the operating system will not allow your program the memory it needs.
While, yes, it is possible for an administrator to increase quotas, that has consequences as well. If you don't have administrative access, you need to convince an administrator to increase those quotas (which isn't necessarily easy unless your machine only has one user). A program that consumes a large amount of memory can cause other programs to be starved of memory - which becomes a problem if those other programs are needed by yourself or other people. In extreme cases, your program can starve the operating system itself of resources, which causes it and all programs it hosts to slow down, and compromises system stability. These sort of concerns are why systems enforce quotas in the first place - often by default.
There are also problems that can arise because operating systems can be configured to over-commit. Loosely speaking, this means that when a program requests memory, the operating system tells the program the allocation has succeeded, even if the operating system hasn't allocated it. Subsequently, when the program USES that memory (typically, writes data to it), the operating system is suddenly required to ACTUALLY make the memory available. If the operating system cannot do this for any reason, that becomes a problem for the program (which believes it has access to memory, but the operating system prevents access). This typically results in some error condition affecting program execution (and often results in program termination). While the problems associated with over-committing can affect any program, the odds are markedly increased when the program allocates larges amount of memory.

Is it possible to partially free dynamically-allocated memory on a POSIX system?

I have a C++ application where I sometimes require a large buffer of POD types (e.g. an array of 25 billion floats) to be held in memory at once in a contiguous block. This particular memory organization is driven by the fact that the application makes use of some C APIs that operate on the data. Therefore, a different arrangement (such as a list of smaller chunks of memory like std::deque uses) isn't feasible.
The application has an algorithm that is run on the array in a streaming fashion; think something like this:
std::vector<float> buf(<very_large_size>);
for (size_t i = 0; i < buf.size(); ++i) do_algorithm(buf[i]);
This particular algorithm is the conclusion of a pipeline of earlier processing steps that have been applied to the dataset. Therefore, once my algorithm has passed over the i-th element in the array, the application no longer needs it.
In theory, therefore, I could free that memory in order to reduce my application's memory footprint as it chews through the data. However, doing something akin to a realloc() (or a std::vector<T>::shrink_to_fit()) would be inefficient because my application would have to spend its time copying the unconsumed data to the new spot at reallocation time.
My application runs on POSIX-compliant operating systems (e.g. Linux, OS X). Is there any interface by which I could ask the operating system to free only a specified region from the front of the block of memory? This would seem to be the most efficient approach, as I could just notify the memory manager that, for example, the first 2 GB of the memory block can be reclaimed once I'm done with it.

If your entire buffer has to be in memory at once, then you probably will not gain much from freeing it partially later.
The main point on this post is basically to NOT tell you to do what you want to do, because the OS will not unnecessarily keep your application's memory in RAM if it's not actually needed. This is the difference between "resident memory usage" and "virtual memory usage". "Resident" is what is currently used and in RAM, "virtual" is the total memory usage of your application. And as long as your swap partition is large enough, "virtual" memory is pretty much a non-issue. [I'm assuming here that your system will not run out of virtual memory space, which is true in a 64-bit application, as long as you are not using hundreds of terabytes of virtual space!]
If you still want to do that, and want to have some reasonable portability, I would suggest building a "wrapper" that behaves kind of like std::vector and allocates lumps of some megabytes (or perhaps a couple of gigabytes) of memory at a time, and then something like:
for (size_t i = 0; i < buf.size(); ++i) {
do_algorithm(buf[i]);
buf.done(i);
}
The done method will simply check if the value if i is (one element) past the end of the current buffer, and free it. [This should inline nicely, and produce very little overhead on the average loop - assuming elements are actually used in linear order, of course].
I'd be very surprised if this gains you anything, unless do_algorithm(buf[i]) takes quite some time (certainly many seconds, probably many minutes or even hours). And of course, it's only going to help if you actually have something else useful to do with that memory. And even then, the OS will reclaim memory that isn't actively used by swapping it out to disk, if the system is short of memory.
In other words, if you allocate 100GB, fill it, leave it sitting without touching, it will eventually ALL be on the hard-disk rather than in RAM.
Further, it is not at all unusual that the heap in the application retains freed memory, and that the OS does not get the memory back until the application exits - and certainly, if only parts of a larger allocation is freed, the runtime will not release it until the whole block has been freed. So, as stated at the beginning, I'm not sure how much this will actually help your application.
As with everything regarding "tuning" and "performance improvements", you need to measure and compare a benchmark, and see how much it helps.

Is it possible to partially free dynamically-allocated memory on a POSIX system?
You can not do it using malloc()/realloc()/free().
However, you can do it in a semi-portable way using mmap() and munmap(). The key point is that if you munmap() some page, malloc() can later use that page:
create an anonymous mapping using mmap();
subsequently call munmap() for regions that you don't need anymore.
The portability issues are:
POSIX doesn't specify anonymous mappings. Some systems provide MAP_ANONYMOUS or MAP_ANON flag. Other systems provide special device file that can be mapped for this purpose. Linux provides both.
I don't think that POSIX guarantees that when you munmap() a page, malloc() will be able to use it. But I think it'll work an all systems that have mmap()/unmap().
Update
If your memory region is so large that most pages surely will be written to swap, you will not loose anything by using file mappings instead of anonymous mappings. File mappings are specified in POSIX.

If you can do without the convenience of std::vector (which won't give you much in this case anyway because you'll never want to copy / return / move that beast anyway), you can do your own memory handling. Ask the operating system for entire pages of memory (via mmap) and return them as appropriate (using munmap). You can tell mmap via its fist argument and the optional MAP_FIXED flag to map the page at a particular address (which you must ensure to be not otherwise occupied, of course) so you can build up an area of contiguous memory. If you allocate the entire memory upfront, then this is not an issue and you can do it with a single mmap and let the operating system choose a convenient place to map it. In the end, this is what malloc does internally. For platforms that don't have sys/mman.h, it's not difficult to fall back to using malloc if you can live with the fact that on those platforms, you won't return memory early.
I'm suspecting that if your allocation sizes are always multiples of the page size, realloc will be smart enough not to copy any data. You'd have to try this out and see if it works (or consult your malloc's documentation) on your particular target platform, though.

mremap is probably what you need. As long as you're shifting whole pages, you can do a super fast realloc (actually the kernel would do it for you).

Allocating memory that can be freed by the OS if needed

I'm writing a program that generates thumbnails for every page in a large document. For performance reasons I would like to keep the thumbnails in memory for as long as possible, but I would like the OS to be able to reclaim that memory if it decides there is another more important use for it (e.g. the user has started running a different application.)
I can always regenerate the thumbnail later if the memory has gone away.
Is there any cross-platform method for flagging memory as can-be-removed-if-needed? The program is written in C++.
EDIT: Just to clarify, rather than being notified when memory is low or regularly monitoring the system's amount of memory, I'm thinking more along the lines of allocating memory and then "unlocking" it when it's not in use. The OS can then steal unlocked memory if needed (even for disk buffers if it thinks that would be a better use of the memory) and all I have to do as a programmer is just "lock" the memory again before I intend to use it. If the lock fails I know the memory has been reused for something else so I need to regenerate the thumbnail again, and if the lock succeeds I can just keep using the data from before.
The reason is I might be displaying maybe 20 pages of a document on the screen, but I may as well keep thumbnails of the other 200 or so pages in case the user scrolls around a bit. But if they go do something else for a while, that memory might be better used as a disk cache or for storing web pages or something, so I'd like to be able to tell the OS that it can reuse some of my memory if it wants to.
Having to monitor the amount of free system-wide memory may not achieve the goal (my memory will never be reclaimed to improve disk caching), and getting low-memory notifications will only help in emergencies. I was hoping that by having a lock/unlock method, this could be achieved in more of a lightweight way and benefit the performance of the system in a non-emergency situation.

Is there any cross-platform method for flagging memory as can-be-removed-if-needed? The program is written in C++
For Windows, at least, you can register for a memory resource notification.
HANDLE WINAPI CreateMemoryResourceNotification(
_In_ MEMORY_RESOURCE_NOTIFICATION_TYPE NotificationType
);
NotificationType
LowMemoryResourceNotification Available physical memory is running low.
HighMemoryResourceNotification Available physical memory is high.
Just be careful responding to both events. You might create a feedback loop (memory is low, release the thumbnails! and then memory is high, make all the thumbnails!).

In AIX, there is a signal SIGDANGER that is send to applications when available memory is low. You may handle this signal and free some memory.
There is a discussion among Linux people to implement this feature into Linux. But AFAIK it is not yet implemented in Linux. Maybe they think that application should not care about low level memory management, and it could be transparently handled in OS via swapping.
In posix standard there is a function posix_madvise might be used to mark an area of memory as less important. There is an advice POSIX_MADV_DONTNEED specifies that the application expects that it will not access the specified range in the near future.
But unfortunately, current Linux implementation will immediately free the memory range when posix_madvise is called with this advice.
So there's no portable solution to your question.
However, on almost every OS you are able to read the current available memory via some OS interface. So you can routinely read such value and manually free memory when available memory in OS is low.

There's nothing special you need to do. The OS will remove things from memory if they haven't been used recently automatically. Some OSes have platform-specific ways to improve this, but generally, nothing special is needed.

This question is very similar and has answers that cover things not covered here.
Allocating "temporary" memory (in Linux)
This shouldn't be too hard to do because this is exactly what the page cache does, using unused memory to cache the hard disk. In theory, someone could write a filesystem such that when you read from a certain file, it calculated something, and the page cache would cache it automatically.
All the basics of automatically freed cache space are already there in any OS with a disk cache, and It's hard to imagine there not being an API for something that would make a huge difference especially in things like mobile web browsers.

How to guarantee that when a process calls malloc(), it will allocate physical memory immediately?

I am looking for a way to pre-allocate memory to a process (PHYSICAL memory), so that it will be absolutely guaranteed to be available to the C++ heap when I call new/malloc. I need this memory to be available to my process regardless of what other processes are trying to do with the system memory. In other words, I want to reserve physical memory to the C++ heap, so that it will be available immediately when I call malloc().
Here are the details:
I am developing a real-time system. The system is composed of several memory-hungry processes. Process A is the mission-critical process and it must survive and be immune to bad behavior of any other processes. It usually fits in 0.5 GB of memory, but it sometimes needs as much as 2.5 GB. The other processes attempt to use any amount of memory.
My concern is that the other processes may allocate lots of memory, exhausting the physical memory reserves in the system. Then, when Process A needs more memory FAST, it's not available, and the system will have to swap pages, which would take a long time.
It is critical that Process A get all the memory it needs without delay, whereas I'm fine with the other processes failing.
I'm running on Windows 7 64-bit.
Edit:
Would SetProcessWorkingSetSize work? Meaning: Would calling this for a big enough amount of memory protect my process A from any other process in the system.

VirtualLock is what you're looking for. It will force the OS to keep the pages in memory, as long as they're in the working set size (which is the function linked to by MK in his answer). However, there is no way to feed this memory to malloc/new- you'll have to implement your own memory allocator.

I think this question is weird because Windows 7 is not exactly the OS of choice for realtime applications. That said, there appears to be an interface that might help you:
AllocateUserPhysicalPages

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js