mmap - controlling memory mapped file virtual memory usage with cgroups

mmap - controlling memory mapped file virtual memory usage with cgroups - c++

I have an application that opens a file with mmap() and does stuff to it (long story short, makes calls to gdb to parse a coredump file and then 7z to compress the dump). What I am trying to achieve is setting a limit on how much resident memory (a.k.a. actual RAM) can be used by this application, while letting it use as much total virtual memory as it wants.
There are two main suggestions I've seen to achieve this: ulimit and cgroups.
mmap: an observation
Before moving forward, a note on mmap: my understanding is the whole point of using it is to minimize the total amount of memory used to read file. This works by having the mmap'ed file backed up by themselves, not by swap or RAM. However, when I start my application (that uses mmap) and look at the output from top, I notice it still reports the application as using a large amount of virtual memory... using just a bit under the size of the file that is being opened with mmap. So a 15GB file might report 0.5GB of RAM usage and 14.5GB of virtual memory usage. So does this mean mmap needs to load the entire file into (virtual) memory or is this just a quirk of the way Linux reports memory usage for mmap (as in, it "counts" the space on the hard drive where the file is located as virtual memory)?
ulimit
ulimit only supports setting a limit for virtual memory as a whole. There is no way to way to specify a limit for only resident memory, which is what I'm interested in. Since mmap appears to use roughly the same amount of virtual memory as the size of the file it is opening (as described above), this doesn't work for me. Set ulimit -v to any thing less, and my application crashes.
cgroups
cgroups lets us set a specific limit for resident memory with memory.limit_in_bytes. I tried creating a cgroup and running my application with it. Here I saw a phenomenon that's left me stumped: on a machine with only 4GB of RAM and 2 CPUS, the cgroup seems to respect the RAM usage limit I set, with the limit_in_bytes only set to 100MB. However on a machine with 500GB, 60 CPUs and a limit of 100 bytes, the exact same file, exact same application (same executable, not rebuilt on the new machine or anything), setting the same 100MB limit leads to the application crashing. Only when I set the limit back to around the same size as the file being mmapd, can it run successfully.
So there are a two questions here:
Does mmap need to load the whole file into virtual memory to work or not? My evidence points to yes after trying ulimit... and no after my experiment with cgroups, on the 4GB machine.
Any suggestions on what other factors could explain why the 4GB is able to successfully work with the cgroup limit, but not the 500GB machine?

Related

mmap versus memory allocated with new

I have a BitVector class that can either allocate memory dynamically using new or it can mmap a file. There isn't a noticeable difference in performance when using it with small files, but when using a 16GB file I have found that the mmap file is far slower than the memory allocated with new. (Something like 10x slower or more.) Note that my machine has 64GB of RAM.
The code in question is loading values from a large disk file and placing them into a Bloom filter which uses my BitVector class for storage.
At first I thought this might be because the backing for the mmap file was on the same disk as the file I was loading from, but this didn't seem to be the issue. I put the two files on two physically different disks, and there was no change in performance. (Although I believe they are on the same controller.)
Then, I used mlock to try to force everything into RAM, but the mmap implementation was still really slow.
So, for the time being I'm just allocating the memory directly. The only thing I'm changing in the code for this comparison is a flag the BitVector constructor.
Note that to measure performance I'm both looking at top and watching how many states I can add into the Bloom filter per second. The CPU usage doesn't even register on top when using mmap - although jbd2/sda1-8 starts to move up (I'm running on an Ubuntu server), which looks to be a process that is dealing with journaling for the drive. The input and output files are stored on two HDDs.
Can anyone explain this huge difference in performance?
Thanks!

Just to start with, mmap is an system call or interface provided to access the Virtual Memory of the system.
Now, in linux (I hope you are working on *nix) a lot of performance improvement is acheived by lazy loading or more commonly known as Copy-On-Write.
For mmap as well, this kind of lazy loading is implemented.
What happens is, when you call mmap on a file, kernel does not immediately allocate main memory pages for the file to be mapped. Instead, it waits for the program to write/read from the illusionary page, at which stage, a page fault occurs, and the corresponding interrupt handler will then actually load that particular file part that can be held in that page frame (Also the page table is updated, so that next time, when you are reading/writing to same page, it is pointing to a valid frame).
Now, you can control this behavior with mlock, madvise, MAP_POPULATE flag with mmap etc.
MAP_POPULATE flags with mmap, tells the kernel to map the file to memory pages before the call returns rather than page faulting every time you access a new page.So, till the file is loaded, the function will be blocked.
From the Man Page:
MAP_POPULATE (since Linux 2.5.46)
Populate (prefault) page tables for a mapping. For a file
mapping, this causes read-ahead on the file. Later accesses
to the mapping will not be blocked by page faults.
MAP_POPULATE is supported for private mappings only since
Linux 2.6.23.

Single Process Maximum Possible Memory in x64 Linux

is there any memory limit for a single process in x64 Linux?
we are running a Linux Server with 32Gb of RAM and I'm wondering if I can allocate most of it for a single process I'm coding which requires lots of RAM!

Certain kernels have different limits, but on any modern 64-bit linux the single-process limit is still far over 32GB (assuming that process is a 64-bit executable). Various distributions may also have set per-process limits using sysctl, so you'll want to check your local environment to make sure that there aren't arbitrarily low limits set (also check ipcs -l on RPM-based systems).
The Debian port documentation for the AMD64 port specifically mentions that the per-process virtual address space limit is 128TiB (twice the physical memory limit), so that should be the reasonable upper bound you're working with.

The resource limits are set using setrlimit syscall. You can change them with a shell builtin (e.g. ulimit on bash, limit with zsh).
The practical limit is also related to RAM size and swap size. The free command show these. (Some systems are overcommitting memory, but that is risky).
A process actually don't use RAM, it consumes virtual memory using system calls like mmap (which may get called by malloc). You could even map a portion of a file into memory with that call.
To learn about the memory map of a process 1234, look into the  /proc/1234/maps file. From your own application, read the /proc/self/maps. And you have also /proc/1234/smaps and /proc/self/smaps. Try the command cat /proc/self/mapsto understand the memory map of the process running that cat.
On a 32Gb RAM machine, you can usually run a process with 31 Gb of process space (assuming no other big process exist). If you had also 64Gb of swap, you could run a process of at least 64Gb but that would be unbelievably slow (most of the time would be spent on swapping to disk). You can add swap space (e.g. by swapping to a file, initialized with dd then mkswap, and activated with swapon).
If coding a server, be very careful about memory leaks. The valgrind tool is helpful to hunt such bugs. And you could consider using Boehm's garbage collector

Current 64bit Linux kernel has limit to 64TB of physical RAM and 128TB of virtual memory (see RHEL limits and Debian port). Current x86_64 CPUs (ie. what we have in the PC) has (virtual) address limit 2^48=256TB because of how the address register in the CPU use all the bits (upper bits are used for page flags like ReadOnly, Writable, ExecuteDisable, PagedToDisc etc in the pagetable), but the specification allows to switch to true 64bit address mode reaching the maximum at 2^64=16EB (Exa Bytes). However, the motherboard and CPU die does not have so many pins to deliver all 48 bits of the memory address to the RAM chip through the address bus, so the limit for physical RAM is lower (and depends on manufacturer), but the virtual address space could by nature reach more than the amount of RAM one could have on the motherboard up to virtual memory limit mentioned above.
The limit per process are raised by how the memory virtual address space for the process is set, because there could be various sizes for stack, mmap() area (and dynamic libraries), program code itself, also the kernel is mapped into the process space. Some of these settings could be changed by passing argument to the linker, sometimes by special directive in the source code, or by modifying the binary file with the program directly (binary has ELF format). Also there are limits the administrator of the machine (root) has set or the user has (see output of the command "ulimit -a"). These limits could be soft or hard and the user is unable to overcome hard limit.
Also the Linux kernel could be set to allow memory overcommit allocation. In this case, the program is allowed to allocate a huge amount of RAM and then use only a few of pages (see sparse arrays, sparse matrix), see Linux kernel documentation. So in this case, the program will fail only after filling up the requested memory by data, but not at the time of memory allocation.

how to increase the size of virtual memory page file for the application data

Our application will allocate more and more application data as the time goes by, and is there any way to increase the virtual memory page file automatically based on the application calculation?
OS: windows 32 bit
Thanks

There is no point in doing this, since you are using x86 architecture and it has a limit of 2GB per application.
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366778(v=vs.85).aspx
Extend the page file size to the maximum that you'll ever need, instead. And, having the app consume memory linearly over time should force you to rethink your design, as no page file will help you escape the death of your program due to memory exhaustion.

Just set it to System Managed size in the Virtual Memory options:

You probably want to stay away from trying to adjust the system pagefile within your program because the system pagefile needs to provide virtual memory not just to your process but to every other process that is running at the same time (including the OS). You might be better off creating your own memory mapped file and using it for virtual memory.
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366556%28v=vs.85%29.aspx

Limit physical memory per process

I am writing an algorithm to perform some external memory computations, i.e. where your input data does not fit into main memory and you have to consider the I/O complexity.
Since for my tests I do not always want to use real inputs I want to limit the amount of memory available to my process. What I have found is, that I can set the mem kernel parameter to limit the physically used memory of all processes (is that correct?)
Is there a way to do the same, but with a per process limit. I have seen ulimit, but it only limits the virtual memory per process. Any ideas (maybe I can even set it programmatically from within my C++ code)?

You can try with 'cgroups'.
To use them type the following commands, as root.
# mkdir /dev/cgroups
# mount -t cgroup -omemory memory /dev/cgroups
# mkdir /dev/cgroups/test
# echo 10000000 > /dev/cgroups/test/memory.limit_in_bytes
# echo 12000000 > /dev/cgroups/test/memory.memsw.limit_in_bytes
# echo <PID> > /dev/cgroups/test/tasks
Where is the PID of the process you want to add to the cgroup. Note that the limit applies to the sum of all the processes assigned to this cgroup.
From this moment on, the processes are limited to 10MB of physical memory and 12MB of pysical+swap.
There are other tunable parameters in that directory, but the exact list will depend on the kernel version you are using.
You can even make hierarchies of limits, just creating subdirectories.
The cgroup is inherited when you fork/exec, so if you add the shell from where your program is launched to a cgroup it will be assigned automatically.
Note that you can mount the cgroups in any directory you want, not just /dev/cgroups.

I can't provide a direct answer but pertaining to doing such stuff, I usually write my own memory management system so that I can have full control of the memory area and how much I allocate. This is usually appliacble when you're writing for microcontrollers as well. Hope it helps.

I would use the setrlimti with the RLIMIT_AS parameter to set the limit of virtual memory (this is what ulimit does) and then have the process use mlockall(MCL_CURRENT|MCL_FUTURE) to force the kernel to fault in and lock into physical RAM all the process pages, so that amount virtual == amount physical memory for this process

have you considered trying your code in some kind of virtual environment? A virtual machine might be too much for your needs, but something like User-Mode Linux could be a good fit. This runs a linux kernel as a single process inside your regular operating system. Then you can provide a separate mem= kernel setting, as well as a separate swap space to make controlled experiments.

Kernel mem= boot parameter limits how much memory in total OS will use.
This is almost never what user wants.
For physical memory, there is RSS rlimit aka RLIMIT_AS.

As other posters have indicated already, setrlimit is the most probable solution, it controls the limits of all configurable aspects of a process environment. Use this command to see these individual settings on your shell process:
ulimit -a
The ones most pertinent to your scenario in the resulting output are as follows:
data seg size (kbytes, -d) unlimited
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
virtual memory (kbytes, -v) unlimited
Checkout the manual page for setrlimit ("man setrlimit"), it can be invoked programmatically from your C/C++ code. I have used it to good effect in the past for controlling stack size limits. (btw, there is no dedicated man page for ulimit, it's actually an embedded bash command, so it's in the bash man page.)

largest memory map allocation size?

My System:
Physical memory: 3gb
Windows XP Service Pack 3 (32bit)
Swap file size: 30gb
Goal: To find the largest possible memory map size I can allocate on my machine.
When I run the following code to allocate 2gb memory map file, the call fails.
handle=CreateFileMapping(INVALID_HANDLE_VALUE,NULL,PAGE_READWRITE|SEC_COMMIT,0,INT_MAX,NULL);
I've been very puzzled by this, because I can allocate a memory map file's up to the system swap file size of 30gb by constantly calling CreateFileMapping with 100mb at a time.
After restarting the machine, and re-running the application that requests 2gb of memory mapped file to CreateFileMapping it works and it returns a valid handle. So this leads me a bit confused what the hell is going on under the hood with windows?
So the situation is this, I can create many small memory mapped files using up all the system page file (30gb), but when asking for a single allocation of 2gb the call fails. When restarting the machine and running the same application the call succeeds!
Some notes:
1) The memory mapped file is not being loaded into the proccess virtual address space, there is no view yet to the file.
2) The OS can allocate small 100mb memory mapped files to 30gb of the systems page file!
Right now the only conclusion I can come to, is that the Windows XP SP3 (32bit) virtual memory manager cannot successfully reserve the requested 2gb in the system page file, and then fails due to the system memory fragmentation (it seems like it needs to reserve a continues allocation of memory, even though the page file is 4kb each). After a restart I assume the system memory fragmentation is less, thus allowing the same call to succeed and allocate a memory mapped file of 2gb in size.
I've run some experiments, after running the machine for a day I started a small application that would allocate a memory maped file of 300mb and then release it. It would then increase the size by 1mb and try again. Finally it stops at 700mb and reports (insufficient system resources). I would then go through and close down each application and this would in turn stop the error messages and it finally continues to allocate a memory mapped file of 3.5gb in size!
So my question is what is going on here? There must be some type of memory fragmentation happening internally with the virtual memory manager, because allocating 100mbs memory mapped files will consume up to the 30gb of the system page file (commit limit).
Update
Conclusion is if you're going to create a large memory mapped file backed by the system page file with INVALID_HANDLE_VALUE, then the system page file (swap file) needs to be resize to the required size and be in a non fragmented state for large allocations > 2gb! Though under heavy IO load it can still fail. To get around all these problems you can create your own file with the needed size (I did 1tb) and memory map to that file instead.
Final Update
I ran the same tests on a Windows 7 box, and to my surprise it works every single time (up to the system page file size) without touching anything. So I guess this is just a bug, that large memory allocations can fail more often on Windows XP than Windows 7.

The problem is file fragmentation. Physical memory (RAM) has nothing to do with anything here. In a virtual memory system, 'memory' is allocated from the file system. Physical memory is just an optimization to speed access to memory.
When you request a memory-mapped file with write access, the system must have a file with contiguous pages free. The system swap file is often fragmented. If your disk drive is nicely defragmented, you should be able to create a large memory-mapped file using a file of your choice (not the system page file).
So if you really have to have a 2GB memory-mapped file, you need to create one on the drive at installation. This shifts the problem of creating a contiguous 2GB file to installation, but once created, you should be ok.

So my question is what is going on here? There must be some type of memory fragementation happening internally with the virtual memory manager, because allocating 100mbs memory mapped files will consume up to the 30gb of the system page file (commit limit).
Sounds about right. If you don't need large contiguous chunks of memory, don't ask for them if you can get the same amount of memory in smaller chunks.
To find the largest possible memory map size I can allocate on my machine.
Try it with size X.
If that fails, try with size X/2 and repeat.
This gets you a chunk at runtime, maybe not the exact largest possible chunk, but within a factor of 2.

Let's takes up Windows developer position.
Assume some user perform following steps:
Create memory mapping.
Populate some memory with sensitive data
Unmap from file
Continue using memory
Windows need to unload these pages for critical tasks.
Resolution - mapped memory should feat for swapping. But it doesn't means that mapped will be swapped.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js