I'm writing a high-performance server application (on Linux) and I'm trying to get a fast critical path. I'm concerned about memory paging and having memory swapped to disk (latency on order of milliseconds) during my operations.
My question is if I have a lot of memory on the server (say 16GB) and my memory utilization stays at around 6-10GB and I know there are no other processes on the same box. Can I be guaranteed not to have page misses after the application is started up and warm?
This is not guaranteed. Linux's default behavior is to sometimes use RAM to cache files, which can improve performance for some workflows. This means that sometimes memory pages will be swapped out even when the memory isn't all used.
You can use mlock/mlockall to lock the process's pages in memory. See man 2 mlock for more information.
Related
When the system memory is being exhausted, OS begins to swap unused memory regions to disk. I'm wondering if a developer could control this process.
For example, I have 2 blocks of memory, both are not being used for some time. But I don't want the first block to be swapped to disk, because application is waiting for something, and this block should be processed as soon as possible. The other block is not that important, so it could be swapped to disk without a doubt.
There might be no cross-platform way, but maybe there are OS-specific (Windows, Linux, etc) ways or hacky tricks to prioritize swapping and "mark" certain blocks of memory that should be swapped last?
On POSIX systems, posix_madvise with the POSIX_MADV_WILLNEED flag provides this sort of advice. It's only advice, so it's up to the OS how it interprets it, but in my experience, it typically behaves as:
Page in the memory range in bulk if it's currently paged out
Don't page it out unless operating under severe memory pressure
mlock can be used to say "never swap", but it's no longer advice at that point; you've told the OS to never swap it out, even under severe memory pressure (if too many processes do this, you can trigger out of memory errors or broad performance degradation as less important memory is forced to stay resident at the expense of more important memory).
If I fully memory map a file larger than system memory and write to it faster than disk IO, what will happen?
Will I run out of system memory or will the writes to the memory mapped memory IO block?
This is OS-dependent, but it's perfectly possible for this to work out correctly. When the OS memory-maps a file, it doesn't have to eagerly load the contents into memory and can lazily fetch the pages when a read or write occurs in that region. In other words, any time you try to access bytes from the file, the OS can page in that region and page out other parts of the file (or pages from other programs) to make it look as though the data was there all along. This might lead to some program slowdown due to paging, but it doesn't have to crash or lock up the system.
Hope this helps!
Assuming MapViewOfFile will let you map a view of the file larger than available memory, I would expect that some of your reads and writes will stall until the page fault can be resolved. Thus you will be limited to the available I/O bandwidth.
You're mistakingly assuming that you can "write faster than disk I/O".
Since your mapping exceeds RAM, there will be pages not in RAM. Writing to them will cause a non-fatal fault. The OS handles the fault by suspending your thread, freeing memory by writing out another page and then paging in the relevant part of the file. Only when the I/O finishes is your thread resumed.
So, your thread cannot outrun disk I/O as it's constantly waiting for disk I/O to finish.
Of course, if memory is available, you benefit from not having to free dirty pages. But you still block on the reads. After all, memory mapping works on a page base, and the CPU can't write a whole page in one instruction. Thus every write is necessarily a read-modify-write operation on a page.
I am looking for a way to pre-allocate memory to a process (PHYSICAL memory), so that it will be absolutely guaranteed to be available to the C++ heap when I call new/malloc. I need this memory to be available to my process regardless of what other processes are trying to do with the system memory. In other words, I want to reserve physical memory to the C++ heap, so that it will be available immediately when I call malloc().
Here are the details:
I am developing a real-time system. The system is composed of several memory-hungry processes. Process A is the mission-critical process and it must survive and be immune to bad behavior of any other processes. It usually fits in 0.5 GB of memory, but it sometimes needs as much as 2.5 GB. The other processes attempt to use any amount of memory.
My concern is that the other processes may allocate lots of memory, exhausting the physical memory reserves in the system. Then, when Process A needs more memory FAST, it's not available, and the system will have to swap pages, which would take a long time.
It is critical that Process A get all the memory it needs without delay, whereas I'm fine with the other processes failing.
I'm running on Windows 7 64-bit.
Edit:
Would SetProcessWorkingSetSize work? Meaning: Would calling this for a big enough amount of memory protect my process A from any other process in the system.
VirtualLock is what you're looking for. It will force the OS to keep the pages in memory, as long as they're in the working set size (which is the function linked to by MK in his answer). However, there is no way to feed this memory to malloc/new- you'll have to implement your own memory allocator.
I think this question is weird because Windows 7 is not exactly the OS of choice for realtime applications. That said, there appears to be an interface that might help you:
AllocateUserPhysicalPages
I'm writing an application in C++ which uses some external open source libraries. I tried to look at the Ubuntu System Monitor to have information about how my process uses resources, and I noticed that resident memory continues to increase to very large values (over 100MiB). This application should run in an embedded device, so I have to be careful.
I started to think there should be a (some) memory leak(s), so I'm using valgrind. Unfortunately it seems valgrind is not reporting significant memory leaks, only some minor issues in the libraries I'm using, nothing more.
So, do I have to conclude that my algorithm really uses that much memory? It seems very strange to me... Or maybe I'm misunderstanding the meaning of the columns of the System Monitor? Can someone clarify the meaning of "Virtual Memory", "Resident Memory", "Writable Memory" and "Memory" in the System Monitor when related to software profiling? Should I expect those values to immediately represent how much memory my process is taking in RAM?
In the past I've used tools that were able to tell me where I was using memory, like Apple Profiling Tools. Is there anything similar I can use in Linux as well?
Thanks!
Another tool you can try is the /lib/libmemusage.so library:
$ LD_PRELOAD=/lib/libmemusage.so vim
Memory usage summary: heap total: 4643025, heap peak: 997580, stack peak: 26160
total calls total memory failed calls
malloc| 42346 4528378 0
realloc| 52 7988 0 (nomove:26, dec:0, free:0)
calloc| 34 106659 0
free| 28622 3720100
Histogram for block sizes:
0-15 14226 33% ==================================================
16-31 8618 20% ==============================
32-47 1433 3% =====
48-63 4174 9% ==============
64-79 4736 11% ================
80-95 313 <1% =
...
(I quit vim immediately after startup.)
Maybe the histogram of block sizes will give you enough information to tell where leaks may be happening.
valgrind is very configurable; --leak-check=full --show-reachable=yes might be a good starting point, if you haven't tried it yet.
"Virtual Memory", "Resident Memory", "Writable Memory" and "Memory"
Virtual memory is the address space that your application has allocated. If you run malloc(1024*1024*100);, the malloc(3) library function will request 100 megabytes of storage from the operating system (or handle it out of the free lists). The 100 megabytes will be allocated with mmap(..., MAP_ANONYMOUS), which won't actually allocate any memory. (See the rant at the end of the malloc(3) page for details.) The OS will provide memory the first time each page is written.
Virtual memory accounts for all the libraries and executable objects that are mapped into your process, as well as your stack space.
Resident memory is the amount of memory that is actually in RAM. You might link against the entire 1.5 megabyte C library, but only use the 100k (wild guess) of the library required to support the Standard IO interface. The rest of the library will be demand paged in from disk when it is needed. Or, if your system is under memory pressure and some less-recently-used data is paged out to swap, it will no longer count against Resident memory.
Writable memory is the amount of address space that your process has allocated with write privileges. (Check the output of pmap(1) command: pmap $$ for the shell, for example, to see which pages are mapped to which files, anonymous space, the stack, and the privileges on those pages.) This is a reasonable indication of how much swap space the program might require in a worst-case swapping scenario, when everything must be paged to disk, or how much memory the process is using for itself.
Because there are probably 50--100 processes on your system at a time, and almost all of them are linked against the standard C library, all the processes get to share the read-only memory mappings for the library. (They also get to share all the copy-on-write private writable mappings for any files opened with mmap(..., MAP_PRIVATE|PROT_WRITE), until the process writes to the memory.) The top(1) tool will report the amount of memory that can be shared among processes in the SHR column. (Note that the memory might not be shared, but some of it (libc) definitely is shared.)
Memory is very vague. I don't know what it means.
I observe substantial speedups in data transfer when I use pinned memory for CUDA data transfers. On linux, the underlying system call for achieving this is mlock. From the man page of mlock, it states that locking the page prevents it from being swapped out:
mlock() locks pages in the address range starting at addr and continuing for len bytes. All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully;
In my tests, I had a fews gigs of free memory on my system so there was never any risk that the memory pages could've been swapped out yet I still observed the speedup. Can anyone explain what's really going on here?, any insight or info is much appreciated.
CUDA Driver checks, if the memory range is locked or not and then it will use a different codepath. Locked memory is stored in the physical memory (RAM), so device can fetch it w/o help from CPU (DMA, aka Async copy; device only need list of physical pages). Not-locked memory can generate a page fault on access, and it is stored not only in memory (e.g. it can be in swap), so driver need to access every page of non-locked memory, copy it into pinned buffer and pass it to DMA (Syncronious, page-by-page copy).
As described here http://forums.nvidia.com/index.php?showtopic=164661
host memory used by the asynchronous mem copy call needs to be page locked through cudaMallocHost or cudaHostAlloc.
I can also recommend to check cudaMemcpyAsync and cudaHostAlloc manuals at developer.download.nvidia.com. HostAlloc says that cuda driver can detect pinned memory:
The driver tracks the virtual memory ranges allocated with this(cudaHostAlloc) function and automatically accelerates calls to functions such as cudaMemcpy().
CUDA use DMA to transfer pinned memory to GPU. Pageable host memory cannot be used with DMA because they may reside on the disk.
If the memory is not pinned (i.e. page-locked), it's first copied to a page-locked "staging" buffer and then copied to GPU through DMA.
So using the pinned memory you save the time to copy from pageable host memory to page-locked host memory.
If the memory pages had not been accessed yet, they were probably never swapped in to begin with. In particular, newly allocated pages will be virtual copies of the universal "zero page" and don't have a physical instantiation until they're written to. New maps of files on disk will likewise remain purely on disk until they're read or written.
A verbose note on copying non-locked pages to locked pages.
It could be extremely expensive if non-locked pages are swapped out by OS on a busy system with limited CPU RAM. Then page fault will be triggered to load pages into CPU RAM through expensive disk IO operations.
Pinning pages can also cause virtual memory thrashing on a system where CPU RAM is precious. If thrashing happens, the throughput of CPU can be degraded a lot.