Single Process Maximum Possible Memory in x64 Linux

Single Process Maximum Possible Memory in x64 Linux - c++

is there any memory limit for a single process in x64 Linux?
we are running a Linux Server with 32Gb of RAM and I'm wondering if I can allocate most of it for a single process I'm coding which requires lots of RAM!

Certain kernels have different limits, but on any modern 64-bit linux the single-process limit is still far over 32GB (assuming that process is a 64-bit executable). Various distributions may also have set per-process limits using sysctl, so you'll want to check your local environment to make sure that there aren't arbitrarily low limits set (also check ipcs -l on RPM-based systems).
The Debian port documentation for the AMD64 port specifically mentions that the per-process virtual address space limit is 128TiB (twice the physical memory limit), so that should be the reasonable upper bound you're working with.

The resource limits are set using setrlimit syscall. You can change them with a shell builtin (e.g. ulimit on bash, limit with zsh).
The practical limit is also related to RAM size and swap size. The free command show these. (Some systems are overcommitting memory, but that is risky).
A process actually don't use RAM, it consumes virtual memory using system calls like mmap (which may get called by malloc). You could even map a portion of a file into memory with that call.
To learn about the memory map of a process 1234, look into the  /proc/1234/maps file. From your own application, read the /proc/self/maps. And you have also /proc/1234/smaps and /proc/self/smaps. Try the command cat /proc/self/mapsto understand the memory map of the process running that cat.
On a 32Gb RAM machine, you can usually run a process with 31 Gb of process space (assuming no other big process exist). If you had also 64Gb of swap, you could run a process of at least 64Gb but that would be unbelievably slow (most of the time would be spent on swapping to disk). You can add swap space (e.g. by swapping to a file, initialized with dd then mkswap, and activated with swapon).
If coding a server, be very careful about memory leaks. The valgrind tool is helpful to hunt such bugs. And you could consider using Boehm's garbage collector

Current 64bit Linux kernel has limit to 64TB of physical RAM and 128TB of virtual memory (see RHEL limits and Debian port). Current x86_64 CPUs (ie. what we have in the PC) has (virtual) address limit 2^48=256TB because of how the address register in the CPU use all the bits (upper bits are used for page flags like ReadOnly, Writable, ExecuteDisable, PagedToDisc etc in the pagetable), but the specification allows to switch to true 64bit address mode reaching the maximum at 2^64=16EB (Exa Bytes). However, the motherboard and CPU die does not have so many pins to deliver all 48 bits of the memory address to the RAM chip through the address bus, so the limit for physical RAM is lower (and depends on manufacturer), but the virtual address space could by nature reach more than the amount of RAM one could have on the motherboard up to virtual memory limit mentioned above.
The limit per process are raised by how the memory virtual address space for the process is set, because there could be various sizes for stack, mmap() area (and dynamic libraries), program code itself, also the kernel is mapped into the process space. Some of these settings could be changed by passing argument to the linker, sometimes by special directive in the source code, or by modifying the binary file with the program directly (binary has ELF format). Also there are limits the administrator of the machine (root) has set or the user has (see output of the command "ulimit -a"). These limits could be soft or hard and the user is unable to overcome hard limit.
Also the Linux kernel could be set to allow memory overcommit allocation. In this case, the program is allowed to allocate a huge amount of RAM and then use only a few of pages (see sparse arrays, sparse matrix), see Linux kernel documentation. So in this case, the program will fail only after filling up the requested memory by data, but not at the time of memory allocation.

Related

mmap - controlling memory mapped file virtual memory usage with cgroups

I have an application that opens a file with mmap() and does stuff to it (long story short, makes calls to gdb to parse a coredump file and then 7z to compress the dump). What I am trying to achieve is setting a limit on how much resident memory (a.k.a. actual RAM) can be used by this application, while letting it use as much total virtual memory as it wants.
There are two main suggestions I've seen to achieve this: ulimit and cgroups.
mmap: an observation
Before moving forward, a note on mmap: my understanding is the whole point of using it is to minimize the total amount of memory used to read file. This works by having the mmap'ed file backed up by themselves, not by swap or RAM. However, when I start my application (that uses mmap) and look at the output from top, I notice it still reports the application as using a large amount of virtual memory... using just a bit under the size of the file that is being opened with mmap. So a 15GB file might report 0.5GB of RAM usage and 14.5GB of virtual memory usage. So does this mean mmap needs to load the entire file into (virtual) memory or is this just a quirk of the way Linux reports memory usage for mmap (as in, it "counts" the space on the hard drive where the file is located as virtual memory)?
ulimit
ulimit only supports setting a limit for virtual memory as a whole. There is no way to way to specify a limit for only resident memory, which is what I'm interested in. Since mmap appears to use roughly the same amount of virtual memory as the size of the file it is opening (as described above), this doesn't work for me. Set ulimit -v to any thing less, and my application crashes.
cgroups
cgroups lets us set a specific limit for resident memory with memory.limit_in_bytes. I tried creating a cgroup and running my application with it. Here I saw a phenomenon that's left me stumped: on a machine with only 4GB of RAM and 2 CPUS, the cgroup seems to respect the RAM usage limit I set, with the limit_in_bytes only set to 100MB. However on a machine with 500GB, 60 CPUs and a limit of 100 bytes, the exact same file, exact same application (same executable, not rebuilt on the new machine or anything), setting the same 100MB limit leads to the application crashing. Only when I set the limit back to around the same size as the file being mmapd, can it run successfully.
So there are a two questions here:
Does mmap need to load the whole file into virtual memory to work or not? My evidence points to yes after trying ulimit... and no after my experiment with cgroups, on the 4GB machine.
Any suggestions on what other factors could explain why the 4GB is able to successfully work with the cgroup limit, but not the 500GB machine?

Allocate extra space to process

can I provide extra space to the process other than provided by the operating system.
Can extra detachable memory be used for such purposes.

can I provide extra space to the process other than provided by the
operating system.
No you cant, for every piece of memory you have to request your OS.malloc(), new and other memory allocating functions and operator resolve as a system call that request OS for memory to be provided to the program.

Every process has a definite maximum memory space allocated to it, that depends on the machine architecture. On a 32-bit machine, the maximum addressable space is 2^32 bytes ~= 4GB. Hence a process should be able to address 4 GB of memory typically. But this space is divided into two parts, 1. Kernel Space and 2. Process Space. Kernel space is used for OS drivers etc while Process space is the space where your data can be allocated. Hence the memory available to you is just the Process space.
On a typical Windows XP machine, it is equally divided. i.e. 2 GB for process space (However, there are ways to modify this. For example, the /3G option). Any allocation beyond 2 GB gives a out of memory error.This process space becomes more when you move from a 32-bit application to a 64-bit application. This is one of the major incentives for moving to 64-bit applications.
So to answer your question, there is a maximum memory available to a process beyond which the OS denies memory allocations to the process.

There are some obscure ways. E.g. if you would attach a Windows CE device to a Windows PC, the memory of that device could be accessed via the "RAPI" interface. The Windows OS wouldn't be aware of this device memory; this was handles via the ActiveSync service. It wasn't very quick memory, though.

allocate more than 1 GB memory on 32 bit XP

I'v run into an odd problem, my process cannot allocate more than what seems to be slightly below 1 GiB. Windows Task Manager "Mem Usage" column shows values close to 1 GiB when my software gives a bad_alloc exception. Yes, i'v checked that the value passed to memory allocation is sensible. ( no race condition / corruption exists that would make this fail ). Yes, I need all this memory and there is no way around it. ( It's a buffer for images, which cannot be compressed any further )
I'm not trying to allocate the whole 1 GiB memory in one go, there a few allocations around 300 MiB each. Would this cause problems? ( I'll try to see if making more smaller allocations works any better ). Is there some compiler switch or something else that I must set in order to get past 1 GiB? I've seen others complaining about the 2 GiB limit, which would be fine for me.. I just need little bit more :). I'm using VS 2005 with SP1 and i'm running it on a 32 bit XP and it's in C++.

On a 32-bit OS, a process has a 4GB address space in total.
On Windows, half of this is off-limits, so your process has 2GB.
This is 2GB of contiguous memory. But it gets fragmented. Your executable is loaded in at one address, each DLL is loaded at another address, then there's the stack, and heap allocations and so on. So while your process probably has enough free address space, there are no contiguous blocks large enough to fulfill your requests for memory. So making smaller allocations will probably solve it.
If your application is compiled with the LARGEADDRESSAWARE flag, it will be allowed to use as much of the remaining 2GB as Windows can spare. (And the value of that depends on your platform and environment.
for 32-bit code running on a 64-bit OS, you'll get a full 4-GB address space
for 32-bit code running on a 32-bit OS without the /3GB boot switch, the flag means nothing at all
for 32-bit code running on a 32-bit OS with the /3GB boot switch, you'll get 3GB of address space.
So really, setting the flag is always a good idea if your application can handle it (it's basically a capability flag. It tells Windows that we can handle more memory, so if Windows can too, it should just go ahead and give us as large an address space as possible), but you probably can't rely on it having an effect. Unless you're on a 64-bit OS, it's unlikely to buy you much. (The /3GB boot switch is necessary, and it has been known to cause problems with drivers, especially video drivers)

Allocating big chunks of continuous memory is always a problem.
It is very likely to get more memory in smaller chunks
You should redesign your memory structures.

You are right to suspect the larger 300MB allocations. Your process will be able to get close to 2GB (3 if you use the /3GB boot.ini switch and LARGEADDRESSAWARE link flag), but not as a large contiguous block.
Typical solutions for this are to break up the requests into tiles or strips of fixed size (say 256x256x4 bytes) and write an intermediate class to hide this representation detail.
You can quickly verify this by writing a small allocation loop that allocate blocks of different sizes.

You could also check this function from MSDN. 1GB rings a bell from here:
This parameter must be greater than or equal to 13 pages (for example,
53,248 on systems with a 4K page size), and less than the system-wide
maximum (number of available pages minus 512 pages). The default size
is 345 pages (for example, this is 1,413,120 bytes on systems with a
4K page size).
Here they mentioned that default maximum number of pages allowed for a process is 345 pages which is slightly more than 1GB.

When I have a few big allocs like that to do, I use the Windows function VirtualAlloc, to avoid stressing the default allocator.
Another way forward might be to use nedmalloc in your project.

Finding amount of RAM using C++

How would i find out the amount of RAM and details about my system like CPU type, speed, amount of physical memory available. amount of stack and heap memory in RAM, number of processes running.
Also how to determine if there is any way to determin how long it takes your computer to execute an instruction, fetch a word from memory (with and without a cache miss), read consecutive words from disk, and seek to a new location on disk.
Edit: I want to accomplish this on my linux system using g++ compiler. are there any inbulit functions for this..? Also tell me if such things are possible on windows system.
I just got this question out of curiosity when I was learning some memory management stuff in c++. Please guide me through this step by step or may be online tutorials ll do great. Thanks.

With Linux and GCC, you can use the sysconf function included using the <unistd.h> header.
There are various arguments you can pass to get hardware information. For example, to get the amount of physical RAM in your machine you would need to do:
sysconf(_SC_PHYS_PAGES) * sysconf(_SC_PAGESIZE);
See the man page for all possible usages.
You can get the maximum stack size of a process using the getrlimit system call along with the RLIMIT_STACK argument, included using the <sys/resource.h> header.
To find out how many processes are running on the current machine you can check the /proc directory. Each running process is represented as a file in this directory named by its process ID number.

For Windows - GetPhysicallyInstalledSystemMemory for installed RAM, GetSystemInfo for CPUs, Process Status API for process enumeration. Heap and stack usage can be gotten only by the local process for itself. Remember stack usage is per-thread, and in Windows a process can have multiple heaps (use GetProcessHeaps to enumerate them). Memory usage per process in externally visible usage can be retrieved for each process using GetProcessMemoryInfo.
I'm not aware of Win32 APIs for the second paragraph's list. Probably have to do this at the device driver level (kernel mode) I would think, if it's even possible. Instruction fetch and
execution depend on the processor, cache size and instruction itself (they are not all the same in complexity). Memory access speed will depend on RAM, CPU and the motherboard FSB speed. Disk access likewise is totally dependent on the system characteristics.

On Windows Vista and Windows 7, the Windows System Assessment Tool can provide a lot of info. Supposedly it can be programmatically accessed via the WEI API.

Can you allocate a very large single chunk of memory ( > 4GB ) in c or c++?

With very large amounts of ram these days I was wondering, it is possible to allocate a single chunk of memory that is larger than 4GB? Or would I need to allocate a bunch of smaller chunks and handle switching between them?
Why???
I'm working on processing some openstreetmap xml data and these files are huge. I'm currently streaming them in since I can't load them all in one chunk but I just got curious about the upper limits on malloc or new.

Short answer: Not likely
In order for this to work, you absolutely would have to use a 64-bit processor.
Secondly, it would depend on the Operating System support for allocating more than 4G of RAM to a single process.
In theory, it would be possible, but you would have to read the documentation for the memory allocator. You would also be more susceptible to memory fragmentation issues.
There is good information on Windows memory management.

A Primer on physcal and virtual memory layouts
You would need a 64-bit CPU and O/S build and almost certainly enough memory to avoid thrashing your working set. A bit of background:
A 32 bit machine (by and large) has registers that can store one of 2^32 (4,294,967,296) unique values. This means that a 32-bit pointer can address any one of 2^32 unique memory locations, which is where the magic 4GB limit comes from.
Some 32 bit systems such as the SPARCV8 or Xeon have MMU's that pull a trick to allow more physical memory. This allows multiple processes to take up memory totalling more than 4GB in aggregate, but each process is limited to its own 32 bit virtual address space. For a single process looking at a virtual address space, only 2^32 distinct physical locations can be mapped by a 32 bit pointer.
I won't go into the details but This presentation (warning: powerpoint) describes how this works. Some operating systems have facilities (such as those described Here - thanks to FP above) to manipulate the MMU and swap different physical locations into the virtual address space under user level control.
The operating system and memory mapped I/O will take up some of the virtual address space, so not all of that 4GB is necessarily available to the process. As an example, Windows defaults to taking 2GB of this, but can be set to only take 1GB if the /3G switch is invoked on boot. This means that a single process on a 32 bit architecture of this sort can only build a contiguous data structure of somewhat less than 4GB in memory.
This means you would have to explicitly use the PAE facilities on Windows or Equivalent facilities on Linux to manually swap in the overlays. This is not necessarily that hard, but it will take some time to get working.
Alternatively you can get a 64-bit box with lots of memory and these problems more or less go away. A 64 bit architecture with 64 bit pointers can build a contiguous data structure with as many as 2^64 (18,446,744,073,709,551,616) unique addresses, at least in theory. This allows larger contiguous data structures to be built and managed.

The advantage of memory mapped files is that you can open a file much bigger than 4Gb (almost infinite on NTFS!) and have multiple <4Gb memory windows into it.
It's much more efficent than opening a file and reading it into memory,on most operating systems it uses the built-in paging support.

This shouldn't be a problem with a 64-bit OS (and a machine that has that much memory).
If malloc can't cope then the OS will certainly provide APIs that allow you to allocate memory directly. Under Windows you can use the VirtualAlloc API.

it depends on which C compiler you're using, and on what platform (of course) but there's no fundamental reason why you cannot allocate the largest chunk of contiguously available memory - which may be less than you need. And of course you may have to be using a 64-bit system to address than much RAM...
see Malloc for history and details
call HeapMax in alloc.h to get the largest available block size

Have you considered using memory mapped files? Since you are loading in really huge files, it would seem that this might be the best way to go.

It depends on whether the OS will give you virtual address space that allows addressing memory above 4GB and whether the compiler supports allocating it using new/malloc.
For 32-bit Windows you won't be able to get single chunk bigger than 4GB, as the pointer size is 32-bit, thus limiting your virtual address space to 4GB. (You could use Physical Address Extension to get more than 4GB memory; however, I believe you have to map that memory into the virtualaddress space of 4GB yourself)
For 64-bit Windows, the VC++ compiler supports 64-bit pointers with theoretical limit of the virtual address space to 8TB.
I suspect the same applies for Linux/gcc - 32-bit does not allow you, whereas 64-bit allows you.

As Rob pointed out, VirtualAlloc for Windows is a good option for this, as is an anonymouse file mapping. However, specifically with respect to your question, the answer to "if C or C++" can allocate, the answer is NO THIS IS NOT SUPPORTED EVEN ON WIN7 RC 64
In the PE/COFF specification for exe files, the field which specifies the HEAP reserve and HEAP commit, is a 32 bit quantity. This is in-line with the physical size limitations of the current heap implmentation in the windows CRT, which is just short of 4GB. So, there is no way to allocate more than 4GB from C/C++ (technicall the OS support facilities of CreateFileMapping and VirtualAlloc/VirtualAllocNuma etc... are not C or C++).
Also, BE AWARE that there are underlying x86 or amd64 ABI construct's known as the page table's. This WILL in effect do what you are concerened about, allocating smaller chunks for your larger request, even though this is happining in kernel memory, there is an effect on the overall system, these tables are finite.
If you are allocating memory in such grandious purportions, you would be well advised to allocate based on the allocation granularity (which VirtualAlloc enforces) and also to identify optional flags's or methods to enable larger pages.
4kb pages were the initial page size for the 386, subsaquently the pentium added 4MB. Today, the AMD64 (Software Optimization Guide for AMD Family 10h Processors) has a maximum page table entry size of 1GB. This mean's for your case here, let's say you just did 4GB, it would require only 4 unique entries in the kernel's directory to locate\assign and permission your process's memory.
Microsoft has also released this manual that articulates some of the finer points of application memory and it's use for the Vista/2008 platform and newer.
Contents
Introduction. 4
About the Memory Manager 4
Virtual Address Space. 5
Dynamic Allocation of Kernel Virtual
Address Space. 5
Details for x86 Architectures. 6
Details for 64-bit Architectures. 7
Kernel-Mode Stack Jumping in x86
Architectures. 7
Use of Excess Pool Memory. 8
Security: Address Space Layout
Randomization. 9
Effect of ASLR on Image Load
Addresses. 9
Benefits of ASLR.. 11
How to Create Dynamically Based
Images. 11
I/O Bandwidth. 11
Microsoft SuperFetch. 12
Page-File Writes. 12
Coordination of Memory Manager and
Cache Manager 13
Prefetch-Style Clustering. 14
Large File Management 15
Hibernate and Standby. 16
Advanced Video Model 16
NUMA Support 17
Resource Allocation. 17
Default Node and Affinity. 18
Interrupt Affinity. 19
NUMA-Aware System Functions for
Applications. 19
NUMA-Aware System Functions for
Drivers. 19
Paging. 20
Scalability. 20
Efficiency and Parallelism.. 20
Page-Frame Number and PFN Database. 20
Large Pages. 21
Cache-Aligned Pool Allocation. 21
Virtual Machines. 22
Load Balancing. 22
Additional Optimizations. 23
System Integrity. 23
Diagnosis of Hardware Errors. 23
Code Integrity and Driver Signing. 24
Data Preservation during Bug Checks. 24
What You Should Do. 24
For Hardware Manufacturers. 24
For Driver Developers. 24
For Application Developers. 25
For System Administrators. 25
Resources. 25

If size_t is greater than 32 bits on your system, you've cleared the first hurdle. But the C and C++ standards aren't responsible for determining whether any particular call to new or malloc succeeds (except malloc with a 0 size). That depends entirely on the OS and the current state of the heap.

Like everyone else said, getting a 64bit machine is the way to go. But even on a 32bit machine intel machine, you can address bigger than 4gb areas of memory if your OS and your CPU support PAE. Unfortunately, 32bit WinXP does not do this (does 32bit Vista?). Linux lets you do this by default, but you will be limited to 4gb areas, even with mmap() since pointers are still 32bit.
What you should do though, is let the operating system take care of the memory management for you. Get in an environment that can handle that much RAM, then read the XML file(s) into (a) data structure(s), and let it allocate the space for you. Then operate on the data structure in memory, instead of operating on the XML file itself.
Even in 64bit systems though, you're not going to have a lot of control over what portions of your program actually sit in RAM, in Cache, or are paged to disk, at least in most instances, since the OS and the MMU handle this themselves.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js