How to query amount of allocated memory on Linux (and OSX)? - c++

While this might look like a duplicate from other questions, let me explain why it's not.
I am looking to get a specific part of my application to degrade gracefully when a certain memory limit has been reached. I could have used criteria based on remaining available physical memory, but this wouldn't be safe, because the OS could start paging out memory used by my application before reaching the criteria, which would think there is still some physical memory left, and keep allocating, etc. For the same reason, I can't used the amount of physical memory currently used by the process, because as soon as the OS would start swapping me out, I would keep allocating as the OS pages memory so the number would not grow anymore.
For this reason, I chose a criteria based on the amount of memory allocated by my application, i.e. very close to virtual memory size.
This question (How to determine CPU and memory consumption from inside a process?) provides great ways of querying the amount of virtual memory used by the current process, which I THOUGHT was what I needed.
On Windows, I'm using GetProcessMemoryInfo() and the PrivateUsage field, which works great.
On Linux, I tried several things (listed below) that did not work. The reason why virtual memory usage does not work for me is because of something that happens with OpenCL context creation on NVidia hardware on Linux. The driver reserves a region of the virtual memory space big enough to hold all RAM, all swap and all video memory. My guess is it does so for unified address space and everything. But it also means that the process reports using enormous amounts of memory. On my system for instance, top will report 23.3 Gb in the VIRT column (12 Gb of RAM, 6 Gb of swap, 2 Gb of video memory, which gives 20 Gb reserved by the NVidia driver).
On OSX, by using task_info() and the virtual_size field, I also get a bigger than expected number (a few Gb for an app that takes not even close to 1 Gb on Windows), but not as big as Linux.
So here is the big question: how can I get the amount of memory allocated by my application? I know that this is a somewhat vague question (what does "allocated memory" means?), but I'm flexible:
I would prefer to include the application static data, code section and everything, but I can live without.
I would prefer to include the memory allocated for stacks, but I can live without.
I would prefer to include the memory used by shared libraries, but I can live without.
I don't really care for mmap stuff, I can do with or without at that point.
Etc.
What is really important is that the number grows with dynamic allocation (new, malloc, anything) and shrinks when the memory is released (which I know can be implementation-dependent).
Things I have tried
Here are a couple of solutions I have tried and/or thought of but that would not work for me.
Read from /proc/self/status
This is the approach suggested by how-to-determine-cpu-and-memory-consumption-from-inside-a-process. However, as stated above, this returns the amount of virtual memory, which does not work for me.
Read from /proc/self/statm
Very slightly worst: according to http://kernelnewbies.kernelnewbies.narkive.com/iG9xCmwB/proc-pid-statm-doesnt-match-with-status, which refers to Linux kernel code, the only difference between those two values is that the second one does not substract reserved_vm to the amount of virtual memory. I would have HOPED that reserved_vm would include the memory reserved by the OpenCL driver, but it does not.
Use mallinfo() and the uordblks field
This does not seem to include all the allocations (I'm guessing the news are missing), since for an +2Gb growth in virtual memory space (after doing some memory-heavy work and still holding the memory), I'm only seeing about 0.1Gb growth in the number returned by mallinfo().
Read the [heap] section size from /proc/self/smaps
This value started at around 336,760 Kb and peaked at 1,019,496 Kb for work that grew virtual memory space by +2Gb, and then it never gets down, so I'm not sure I can't really rely on this number...
Monitor all memory allocations in my application
Yes, in an ideal world, I would have control over everybody who allocates memory. However, this is a legacy application, using tons of different allocators, some mallocs, some news, some OS-specific routines, etc. There are some plug-ins that could do whatever they want, they could be compiled with a different compiler, etc. So while this would be great to really control memory, this does not work in my context.
Read the virtual memory size before and after the OpenCL context initialization
While this could be a "hacky" way to solve the problem (and I might have to fallback to it), I would really wish for a more reliable way to query memory, because OpenCL context could be initialized somewhere out of my control, and other similar but non-OpenCL specific issues could creep in and I wouldn't know about it.
So that's pretty much all I've got. There is one more thing I have not tried yet, because it only works on OSX, but it is to use the approach described in Why does mstats and malloc_zone_statistics not show recovered memory after free?, i.e. use malloc_get_all_zones() and malloc_zone_statistics(), but I think this might be the same problem as mallinfo(), i.e. not take all allocations into account.
So, can anyone suggest a way to query memory usage (as vague of a term as this is, see above for precision) of a given process in Linux (and also OSX even if it's a different method)?

You can try and use information returned by getrusage():
#include <sys/time.h>
#include <sys/resource.h>
int getrusage(int who, struct rusage *usage);
struct rusage {
struct timeval ru_utime; /* user CPU time used */
struct timeval ru_stime; /* system CPU time used */
long ru_maxrss; /* maximum resident set size */
long ru_ixrss; /* integral shared memory size */
long ru_idrss; /* integral unshared data size */
long ru_isrss; /* integral unshared stack size */
long ru_minflt; /* page reclaims (soft page faults) */
long ru_majflt; /* page faults (hard page faults) */
long ru_nswap; /* swaps */
long ru_inblock; /* block input operations */
long ru_oublock; /* block output operations */
long ru_msgsnd; /* IPC messages sent */
long ru_msgrcv; /* IPC messages received */
long ru_nsignals; /* signals received */
long ru_nvcsw; /* voluntary context switches */
long ru_nivcsw; /* involuntary context switches */
};
If the memory information does not fit you purpose, observing the page fault counts can help monitor memory stress, which is what you intend to detect.

Have you tried a shared library interposer for Linux for section (5) above? So long as your application is not statically linking the malloc functions, you can interpose a new function between your program and the kernel malloc. I've used this tactic many times to collect stats on memory usage.
It does required setting LD_PRELOAD before running the program but no source or binary changes. It is an ideal answer in many cases.
Here is an example of a malloc interposer:
http://www.drdobbs.com/building-library-interposers-for-fun-and/184404926
You probably will also want to do calloc and free. Calls to new generally end up as a call to malloc so C++ is covered as well.
OS X seems to have similar capabilities but I have not tried it.
http://tlrobinson.net/blog/2007/12/overriding-library-functions-in-mac-os-x-the-easy-way-dyld_insert_libraries/
--Matt

Here is what I ended up using. I scan /proc/self/maps and sum the size of all the address ranges meeting my criteria, which is:
Only include ranges from inode 0 (i.e. no devices, no mapped file, etc.)
Only include ranges that are at least one of readable, writable or executable
Only include private memory
In my experiments I did not see instances of shared memory from inode 0. Maybe with inter-process shared memory...?
Here is the code for my solution:
size_t getValue()
{
FILE* file = fopen("/proc/self/maps", "r");
if (!file)
{
assert(0);
return 0;
}
size_t value = 0;
char line[1024];
while (fgets(line, 1024, file) != NULL)
{
ptrdiff_t start_address, end_address;
char perms[4];
ptrdiff_t offset;
int dev_major, dev_minor;
unsigned long int inode;
const int nb_scanned = sscanf(
line, "%16tx-%16tx %c%c%c%c %16tx %02x:%02x %lu",
&start_address, &end_address,
&perms[0], &perms[1], &perms[2], &perms[3],
&offset, &dev_major, &dev_minor, &inode
);
if (10 != nb_scanned)
{
assert(0);
continue;
}
if ((inode == 0) &&
(perms[0] != '-' || perms[1] != '-' || perms[2] != '-') &&
(perms[3] == 'p'))
{
assert(dev_major == 0);
assert(dev_minor == 0);
value += (end_address - start_address);
}
}
fclose(file);
return value;
}
Since this is looping through all the lines in /proc/self/maps, querying memory that way is significantly slower than using "Virtual Memory currently used by current process" from How to determine CPU and memory consumption from inside a process?.
However, it provides an answer much closer to what I need.

Related

Avoid memory fragmentation when memory pools are a bad idea

I am developing a C++ application, where the program run endlessly, allocating and freeing millions of strings (char*) over time. And RAM usage is a serious consideration in the program. This results in RAM usage getting higher and higher over time. I think the problem is heap fragmentation. And I really need to find a solution.
You can see in the image, after millions of allocation and freeing in the program, the usage is just increasing. And the way I am testing it, I know for a fact that the data it stores is not increasing. I can guess that you will ask, "How are you sure of that?", "How are you sure it's not just a memory leak?", Well.
This test run much longer. I run malloc_trim(0), whenever possible in my program. And it seems, application can finally return the unused memory to the OS, and it goes almost to zero (the actual data size my program has currently). This implies the problem is not a memory leak. But I can't rely on this behavior, the allocation and freeing pattern of my program is random, what if it never releases the memory ?
I said memory pools are a bad idea for this project in the title. Of course I don't have absolute knowledge. But the strings I am allocating can be anything between 30-4000 bytes. Which makes many optimizations and clever ideas much harder. Memory pools are one of them.
I am using GCC 11 / G++ 11 as a compiler. If some old versions have bad allocators. I shouldn't have that problem.
How am I getting memory usage ? Python psutil module. proc.memory_full_info()[0], which gives me RSS.
Of course, you don't know the details of my program. It is still a valid question, if this is indeed because of heap fragmentation. Well what I can say is, I am keeping a up to date information about how many allocations and frees took place. And I know the element counts of every container in my program. But if you still have some ideas about the causes of the problem, I am open to suggestions.
I can't just allocate, say 4096 bytes for all the strings so it would become easier to optimize. That's the opposite I am trying to do.
So my question is, what do programmers do(what should I do), in an application where millions of alloc's and free's take place over time, and they are of different sizes so memory pools are hard to use efficiently. I can't change what the program does, I can only change implementation details.
Bounty Edit: When trying to utilize memory pools, isn't it possible to make multiple of them, to the extent that there is a pool for every possible byte count ? For example my strings can be something in between 30-4000 bytes. So couldn't somebody make 4000 - 30 + 1, 3971 memory pools, for each and every possible allocation size of the program. Isn't this applicable ? All pools could start small (no not lose much memory), then enlarge, in a balance between performance and memory. I am not trying to make a use of memory pool's ability to reserve big spaces beforehand. I am just trying to effectively reuse freed space, because of frequent alloc's and free's.
Last edit: It turns out that, the memory growth appearing in the graphs, was actually from a http request queue in my program. I failed to see that hundreds of thousands of tests that I did, bloated this queue (something like webhook). And the reasonable explanation of figure 2 is, I finally get DDOS banned from the server (or can't open a connection anymore for some reason), the queue emptied, and the RAM issue resolved. So anyone reading this question later in the future, consider every possibility. It would have never crossed my mind that it was something like this. Not a memory leak, but an implementation detail. Still I think #Hajo Kirchhoff deserves the bounty, his answer was really enlightening.
If everything really is/works as you say it does and there is no bug you have not yet found, then try this:
malloc and other memory allocation usually uses chunks of 16 bytes anyway, even if the actual requested size is smaller than 16 bytes. So you only need 4000/16 - 30/16 ~ 250 different memory pools.
const int chunk_size = 16;
memory_pools pool[250]; // 250 memory pools, managing '(idx+1)*chunk_size' size
char* reserve_mem(size_t sz)
{
size_t pool_idx_to_use = sz/chunk_size;
char * rc=pool[pool_idx_to_use].allocate();
}
IOW, you have 250 memory pools. pool[0] allocates and manages chunk with a length of 16 bytes. pool[100] manages chunks with 1600 bytes etc...
If you know the length distribution of your strings in advance, you can reserve initial memory for the pools based on this knowledge. Otherwise I'd just probably reserve memory for the pools in 4096 bytes increment.
Because while the malloc C heap usually allocates memory in multiple of 16 bytes it will (at least unter Windows, but I am guessing, Linux is similar here) ask the OS for memory - which usually works with 4K pages. IOW, the "outer" memory heap managed by the operating system reserves and frees 4096 bytes.
So increasing your own internal memory pool in 4096 bytes means no fragmentation in the OS app heap. This 4096 page size (or multiple of...) comes from the processor architecture. Intel processors have a builtin page size of 4K (or multiple of). Don't know about other processors, but I suspect similar architectures there.
So, to sum it up:
Use chunks of multiple of 16 bytes for your strings per memory pool.
Use chunks of multiple of 4K bytes to increase your memory pool.
That will align the memory use of your application with the memory management of the OS and avoid fragmentation as much as possible.
From the OS point of view, your application will only increment memory in 4K chunks. That's very easy to allocate and release. And there is no fragmentation.
From the internal (lib) C heap management point of view, your application will use memory pools and waste at most 15 bytes per string. Also all similar length allocations will be heaped together, so also no fragmentation.
Here a solution, which might help, if your application has the following traits:
Runs on Windows
Has episodes, where the working set is large, but all that data is released at around the same point in time entirely. (If the data is processed in some kind of batch mode and the work is done when its done and then the related data is freed).
This approach uses a (unique?) Windows feature, called custom heaps. Possibly, you can create yourself something similar on other OS.
The functions, you need are in a header called <heapapi.h>.
Here is how it would look like:
At the start of a memory intensive phase of your program, use HeapCreate() to have a heap, where all your data will go.
Perform the memory intensive tasks.
At the end of the memory intensive phase, Free your data and call HeapDestroy().
Depending on the detailed behavior of your application (e.g. whether or not this memory intensive computation runs in 1 or multiple threads), you can configure the heap accordingly and possibly even gain a little speed (If only 1 thread uses the data, you can give HeapCreate() the HEAP_NO_SERIALIZE flag, so it would not take a lock.) You can also give an upper bound of what the heap will allow to store.
And once, your computation is complete, destroying the heap also prevents long term fragmentation, because when the time comes for the next computation phase, you start with a fresh heap.
Here is the documentation of the heap api.
In fact, I found this feature so useful, that I replicated it on FreeBSD and Linux for an application I ported, using virtual memory functions and my own heap implementation. Was a few years back and I do not have the rights for that piece of code, so I cannot show or share.
You could also combine this approach with fixed element size pools, using one heap for one dedicated block size and then expect less/no fragmentation within each of those heaps (because all blocks have same size).
struct EcoString {
size_t heapIndex;
size_t capacity;
char* data;
char* get() { return data; }
};
struct FixedSizeHeap {
const size_t SIZE;
HANDLE m_heap;
explicit FixedSizeHeap(size_t size)
: SIZE(size)
, m_heap(HeapCreate(0,0,0)
{
}
~FixedSizeHeap() {
HeapDestroy(m_heap);
}
bool allocString(capacity, EcoString& str) {
assert(capacity <= SIZE);
str.capacity = SIZE; // we alloc SIZE bytes anyway...
str.data = (char*)HeapAlloc(m_heap, 0, SIZE);
if (nullptr != str.data)
return true;
str.capacity = 0;
return false;
}
void freeString(EcoString& str) {
HeapFree(m_heap, 0, str.data);
str.data = nullptr;
str.capacity = 0;
}
};
struct BucketHeap {
using Buckets = std::vector<FixedSizeHeap>; // or std::array
Buckets buckets;
/*
(loop for i from 0
for size = 40 then (+ size (ash size -1))
while (< size 80000)
collecting (cons i size))
((0 . 40) (1 . 60) (2 . 90) (3 . 135) (4 . 202) (5 . 303) (6 . 454) (7 . 681)
(8 . 1021) (9 . 1531) (10 . 2296) (11 . 3444) (12 . 5166) (13 . 7749)
(14 . 11623) (15 . 17434) (16 . 26151) (17 . 39226) (18 . 58839))
Init Buckets with index (first item) with SIZE (rest item)
in the above item list.
*/
// allocate(nbytes) looks for the correct bucket (linear or binary
// search) and calls allocString() on that bucket. And stores the
// buckets index in the EcoString.heapIndex field (so free has an
// easier time).
// free(EcoString& str) uses heapIndex to find the right bucket and
// then calls free on that bucket.
}
What you are trying to do is called a slab allocator and it is a very well studied problem with lots of research papers.
You don't need every possible size. Usually slabs come in power of 2 sizes.
Don't reinvent the wheel, there are plenty of opensource implementations of a slab allocator. The Linux kernel uses one.
Start by reading the Wikipedia page on Slab Allocation.
I don't know the details of your program. There are many patterns to work with memory, each of which leads to different results and problems. However you can take a look of .Net garbage collector. It uses several generations of objects (similar to pool of memory). Each object in that generation is movable (its address in memory is changed during compressing of generation). Such trick allows to "compress/compact" memory and reduce fragmentation of memory. You can implement memory manager with similar semantics. It's not neccesary to implement garbage collector at all.

Fast synchronized access to shared array with changing base address (in C11)

I am currently designing a user space scheduler in C11 for a custom co-processor under Linux (user space, because the co-processor does not run its own OS, but is controlled by software running on the host CPU). It keeps track of all the tasks' states with an array. Task states are regular integers in this case. The array is dynamically allocated and each time a new task is submitted whose state does not fit into the array anymore, the array is reallocated to twice its current size. The scheduler uses multiple threads and thus needs to synchronize its data structures.
Now, the problem is that I very often need to read entries in that array, since I need to know the states of tasks for scheduling decisions and resource management. If the base address was guaranteed to always be the same after each reallocation, I would simply use C11 atomics for accessing it. Unfortunately, realloc obviously cannot give such a guarantee. So my current approach is wrapping each access (reads AND writes) with one big lock in the form of a pthread mutex. Obviously, this is really slow, since there is locking overhead for each read, and the read is really small, since it only consists of a single integer.
To clarify the problem, I give some code here showing the relevant passages:
Writing:
// pthread_mutex_t mut;
// size_t len_arr;
// int *array, idx, x;
pthread_mutex_lock(&mut);
if (idx >= len_arr) {
len_arr *= 2;
array = realloc(array, len_arr*sizeof(int));
if (array == NULL)
abort();
}
array[idx] = x;
pthread_mutex_unlock(&mut);
Reading:
// pthread_mutex_t mut;
// int *array, idx;
pthread_mutex_lock(&mut);
int x = array[idx];
pthread_mutex_unlock(&mut);
I have already used C11 atomics for efficient synchronization elsewhere in the implementation and would love to use them to solve this problem as well, but I could not find an efficient way to do so. In a perfect world, there would be an atomic accessor for arrays which performs address calculation and memory read/write in a single atomic operation. Unfortunately, I could not find such an operation. But maybe there is a similarly fast or even faster way of achieving synchronization in this situation?
EDIT:
I forgot to specify that I cannot reuse slots in the array when tasks terminate. Since I guarantee access to the state of every task ever submitted since the scheduler was started, I need to store the final state of each task until the application terminates. Thus, static allocation is not really an option.
Do you need to be so economical with virtual address space? Can't you just set a very big upper limit and allocate enough address space for it (maybe even a static array, or dynamic if you want the upper limit to be set at startup from command-line options).
Linux does lazy memory allocation so virtual pages that you never touch aren't actually using any physical memory. See Why is iterating though `std::vector` faster than iterating though `std::array`? that show by example that reading or writing an anonymous page for the first time causes a page fault. If it was a read access, it gets the kernel to CoW (copy-on-write) map it to a shared physical zero page. Only an initial write, or a write to a CoW page, triggers actual allocation of a physical page.
Leaving virtual pages completely untouched avoids even the overhead of wiring them into the hardware page tables.
If you're targeting a 64-bit ISA like x86-64, you have boatloads of virtual address space. Using up more virtual address space (as long as you aren't wasting physical pages) is basically fine.
Practical example of allocating more address virtual space than you can use:
If you allocate more memory than you could ever practically use (touching it all would definitely segfault or invoke the kernel's OOM killer), that will be as large or larger than you could ever grow via realloc.
To allocate this much, you may need to globally set /proc/sys/vm/overcommit_memory to 1 (no checking) instead of the default 0 (heuristic which makes extremely large allocations fail). Or use mmap(MAP_NORESERVE) to allocate it, making that one mapping just best-effort growth on page-faults.
The documentation says you might get a SIGSEGV on touching memory allocated with MAP_NORESERVE, which is different than invoking the OOM killer. But I think once you've already successfully touched memory, it is yours and won't get discarded. I think it's also not going to spuriously fail unless you're actually running out of RAM + swap space. IDK how you plan to detect that in your current design (which sounds pretty sketchy if you have no way to ever deallocate).
Test program:
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
int main(void) {
size_t sz = 1ULL << 46; // 2**46 = 64 TiB = max power of 2 for x86-64 with 48-bit virtual addresses
// in practice 1ULL << 40 (1TiB) should be more than enough.
// the smaller you pick, the less impact if multiple things use this trick in the same program
//int *p = aligned_alloc(64, sz); // doesn't use NORESERVE so it will be limited by overcommit settings
int *p = mmap(NULL, sz, PROT_WRITE|PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
madvise(p, sz, MADV_HUGEPAGE); // for good measure to reduce page-faults and TLB misses, since you're using large contiguous chunks of this array
p[1000000000] = 1234; // or sz/sizeof(int) - 1 will also work; this is only touching 1 page somewhere in the array.
printf("%p\n", p);
}
$ gcc -Og -g -Wall alloc.c
$ strace ./a.out
... process startup
mmap(NULL, 70368744177664, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x15c71ef7c000
madvise(0x15c71ef7c000, 70368744177664, MADV_HUGEPAGE) = 0
... stdio stuff
write(1, "0x15c71ef7c000\n", 15) = 15
0x15c71ef7c000
exit_group(0) = ?
+++ exited with 0 +++
My desktop has 16GiB of RAM (a lot of it in use by Chromium and some big files in /tmp) + 2GiB of swap. Yet this program allocated 64 TiB of virtual address space and touched 1 int of it nearly instantly. Not measurably slower than if it had only allocated 1MiB. (And future performance from actually using that memory should also be unaffected.)
The largest power-of-2 you can expect to work on current x86-64 hardware is 1ULL << 46. The total lower canonical range of the 48-bit virtual address space is 47 bits (user-space virtual address space on Linux), and some of that is already allocated for stack/code/data. Allocating a contiguous 64 TiB chunk of that still leaves plenty for other allocations.
(If you do actually have that much RAM + swap, you're probably waiting for a new CPU with 5-level page tables so you can use even more virtual address space.)
Speaking of page tables, the larger the array the more chance of putting some other future allocations very very far from existing blocks. This can have a minor cost in TLB-miss (page walk) time, if your actual in-use pages end up more scattered around your address space in more different sub-trees of the multi-level page tables. That's more page-table memory to keep in cache (including cached within the page-walk hardware).
The allocation size doesn't have to be a power of 2 but it might as well be. There's also no reason to make it that big. 1ULL << 40 (1TiB) should be fine on most systems. IDK if having more than half the available address space for a process allocated could slow future allocations; bookkeeping is I think based on extents (ptr + length) not bitmaps.
Keep in mind that if everyone starts doing this for random arrays in libraries, that could use up a lot of address space. This is great for the main array in a program that spends a lot of time using it. Keep it as small as you can while still being big enough to always be more than you need. (Optionally make it a config parameter if you want to avoid a "640kiB is enough for everyone" situation). Using up virtual address space is very low-cost, but it's probably better to use less.
Think of this as reserving space for future growth but not actually using it until you touch it. Even though by some ways of looking at it, the memory already is "allocated". But in Linux it really isn't. Linux defaults to allowing "overcommit": processes can have more total anonymous memory mapped than the system has physical RAM + swap. If too many processes try to use too much by actually touching all that allocated memory, the OOM killer has to kill something (because the "allocate" system calls like mmap have already returned success). See https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
(With MAP_NORESERVE, it's only reserving address space which is shared between threads, but not reserving any physical pages until you touch them.)
You probably want your array to be page-aligned: #include <stdalign.h> so you can use something like
alignas(4096) struct entry process_array[MAX_LEN];
Or for non-static, allocate it with C11 aligned_alloc().
Give back early parts of the array when you're sure all threads are done with it
Page alignment makes it easy do the calculations to "give back" a memory page (4kiB on x86) if your array's logical size shrinks enough. madvise(addr, 4096*n, MADV_FREE); (Linux 4.5 and later). This is kind of like mmap(MAP_FIXED) to replace some pages with new untouched anonymous pages (that will read as zeroes), except it doesn't split up the logical mapping extents and create more bookkeeping for the kernel.
Don't bother with this unless you're returning multiple pages, and leave at least one page unfreed above the current top to avoid page faults if you grow again soon. Like maybe maintain a high-water mark that you've ever touched (without giving back) and a current logical size. If high_water - logical_size > 16 pages give back all page from 4 past the logical size up to the high water mark.
If you will typically be actually using/touching at least 2MiB of your array, use madvise(MADV_HUGEPAGE) when you allocate it to get the kernel to prefer using transparent hugepages. This will reduce TLB misses.
(Use strace to see return values from your madvise system calls, and look at /proc/PID/smaps, to see if your calls are having the desired effect.)
If up-front allocation is unacceptable, RCU (read-copy-update) might be viable if it's read-mostly. https://en.wikipedia.org/wiki/Read-copy-update. But copying a gigantic array every time an element changes isn't going to work.
You'd want a different data-structure entirely where only small parts need to be copied. Or something other than RCU; like your answer, you might not need the read side being always wait-free. The choice will depend on acceptable worst-case latency and/or average throughput, and also how much contention there is for any kind of ref counter that has to bounce around between all threads.
Too bad there isn't a realloc variant that attempts to grow without copying so you could attempt that before bothering other threads. (e.g. have threads with idx>len spin-wait on len in case it increases without the array address changing.)
So, I came up with a solution:
Reading:
while(true) {
cnt++;
if (wait) {
cnt--;
yield();
} else {
break;
}
}
int x = array[idx];
cnt--;
Writing:
if (idx == len) {
wait = true;
while (cnt > 0); // busy wait to minimize latency of reallocation
array = realloc(array, 2*len*sizeof(int));
if (!array) abort(); // shit happens
len *= 2; // must not be updated before reallocation completed
wait = false;
}
// this is why len must be updated after realloc,
// it serves for synchronization with other writers
// exceeding the current length limit
while (idx > len) {yield();}
while(true) {
cnt++;
if (wait) {
cnt--;
yield();
} else {
break;
}
}
array[idx] = x;
cnt--;
wait is an atomic bool initialized as false, cnt is an atomic int initialized as zero.
This only works because I know that task IDs are chosen ascendingly without gaps and that no task state is read before it is initialized by the write operation. So I can always rely on the one thread which pulls the ID which only exceeds current array length by 1. New tasks created concurrently will block their thread until the responsible thread performed reallocation. Hence the busy wait, since the reallocation should happen quickly so the other threads do not have to wait for too long.
This way, I eliminate the bottlenecking big lock. Array accesses can be made concurrently at the cost of two atomic additions. Since reallocation occurs seldom (due to exponential growth), array access is practically block-free.
EDIT:
After taking a second look, I noticed that one has to be careful about reordering of stores around the length update. Also, the whole thing only works if concurrent writes always use different indices. This is the case for my implementation, but might not generally be. Thus, this is not as elegant as I thought and the solution presented in the accepted answer should be preferred.

Allocating more memory than there exists using malloc

This code snippet will allocate 2Gb every time it reads the letter 'u' from stdin, and will initialize all the allocated chars once it reads 'a'.
#include <iostream>
#include <stdlib.h>
#include <stdio.h>
#include <vector>
#define bytes 2147483648
using namespace std;
int main()
{
char input [1];
vector<char *> activate;
while(input[0] != 'q')
{
gets (input);
if(input[0] == 'u')
{
char *m = (char*)malloc(bytes);
if(m == NULL) cout << "cant allocate mem" << endl;
else cout << "ok" << endl;
activate.push_back(m);
}
else if(input[0] == 'a')
{
for(int x = 0; x < activate.size(); x++)
{
char *m;
m = activate[x];
for(unsigned x = 0; x < bytes; x++)
{
m[x] = 'a';
}
}
}
}
return 0;
}
I am running this code on a linux virtual machine that has 3Gb of ram. While monitoring the system resource usage using the htop tool, I have realized that the malloc operation is not reflected on the resources.
For example when I input 'u' only once(i.e. allocate 2GB of heap memory), I don't see the memory usage increasing by 2GB in htop. It is only when I input 'a'(i.e. initialize), I see the memory usage increasing.
As a consequence, I am able to "malloc" more heap memory than there exists. For example, I can malloc 6GB(which is more than my ram and swap memory) and malloc would allow it(i.e. NULL is not returned by malloc). But when I try to initialize the allocated memory, I can see the memory and swap memory filling up till the process is killed.
-My questions:
1.Is this a kernel bug?
2.Can someone explain to me why this behavior is allowed?
It is called memory overcommit. You can disable it by running as root:
echo 2 > /proc/sys/vm/overcommit_memory
and it is not a kernel feature that I like (so I always disable it). See malloc(3) and mmap(2) and proc(5)
NB: echo 0 instead of echo 2 often -but not always- works also. Read the docs (in particular proc man page that I just linked to).
from man malloc (online here):
By default, Linux follows an optimistic memory allocation strategy.
This means that when malloc() returns non-NULL there is no guarantee
that the memory really is available.
So when you just want to allocate too much, it "lies" to you, when you want to use the allocated memory, it will try to find enough memory for you and it might crash if it can't find enough memory.
No, this is not a kernel bug. You have discovered something known as late paging (or overcommit).
Until you write a byte to the address allocated with malloc (...) the kernel does little more than "reserve" the address range. This really depends on the implementation of your memory allocator and operating system of course, but most good ones do not incur the majority of kernel overhead until the memory is first used.
The hoard allocator is one big offender that comes to mind immediately, through extensive testing I have found it almost never takes advantage of a kernel that supports late paging. You can always mitigate the effects of late paging in any allocator if you zero-fill the entire memory range immediately after allocation.
Real-time operating systems like VxWorks will never allow this behavior because late paging introduces serious latency. Technically, all it does is put the latency off until a later indeterminate time.
For a more detailed discussion, you may be interested to see how IBM's AIX operating system handles page allocation and overcommitment.
This is a result of what Basile mentioned, over commit memory. However, the explanation kind of interesting.
Basically when you attempt to map additional memory in Linux (POSIX?), the kernel will just reserve it, and will only actually end up using it if your application accesses one of the reserved pages. This allows multiple applications to reserve more than the actual total amount of ram / swap.
This is desirable behavior on most Linux environments unless you've got a real-time OS or something where you know exactly who will need what resources, when and why.
Otherwise somebody could come along, malloc up all the ram (without actually doing anything with it) and OOM your apps.
Another example of this lazy allocation is mmap(), where you have a virtual map that the file you're mapping can fit inside - but you only have a small amount of real memory dedicated to the effort. This allows you to mmap() huge files (larger than your available RAM), and use them like normal file handles which is nifty)
-n
Initializing / working with the memory should work:
memset(m, 0, bytes);
Also you could use calloc that not only allocates memory but also fills it with zeros for you:
char* m = (char*) calloc(1, bytes);
1.Is this a kernel bug?
No.
2.Can someone explain to me why this behavior is allowed?
There are a few reasons:
Mitigate need to know eventual memory requirement - it's often convenient to have an application be able to an amount of memory that it considers an upper limit on the need it might actually have. For example, if it's preparing some kind of report either of an initial pass just to calculate the eventual size of the report or a realloc() of successively larger areas (with the risk of having to copy) may significantly complicate the code and hurt performance, where-as multiplying some maximum length of each entry by the number of entries could be very quick and easy. If you know virtual memory is relatively plentiful as far as your application's needs are concerned, then making a larger allocation of virtual address space is very cheap.
Sparse data - if you have the virtual address space spare, being able to have a sparse array and use direct indexing, or allocate a hash table with generous capacity() to size() ratio, can lead to a very high performance system. Both work best (in the sense of having low overheads/waste and efficient use of memory caches) when the data element size is a multiple of the memory paging size, or failing that much larger or a small integral fraction thereof.
Resource sharing - consider an ISP offering a "1 giga-bit per second" connection to 1000 consumers in a building - they know that if all the consumers use it simultaneously they'll get about 1 mega-bit, but rely on their real-world experience that, though people ask for 1 giga-bit and want a good fraction of it at specific times, there's inevitably some lower maximum and much lower average for concurrent usage. The same insight applied to memory allows operating systems to support more applications than they otherwise would, with reasonable average success at satisfying expectations. Much as the shared Internet connection degrades in speed as more users make simultaneous demands, paging from swap memory on disk may kick in and reduce performance. But unlike an internet connection, there's a limit to the swap memory, and if all the apps really do try to use the memory concurrently such that that limit's exceeded, some will start getting signals/interrupts/traps reporting memory exhaustion. Summarily, with this memory overcommit behaviour enabled, simply checking malloc()/new returned a non-NULL pointer is not sufficient to guarantee the physical memory is actually available, and the program may still receive a signal later as it attempts to use the memory.

How and why an allocation memory can fail?

This was an question I asked myself when I was a student, but failing to get a satisfying answer, I got it little by little out my mind... till today.
I known I can deal with an allocation memory error either by checking if the returned pointer is NULL or by handling the bad_alloc exception.
Ok, but I wonder: How and why the call of new can fail? Up to my knowledge, an allocation memory can fail if there is not enough space in the free store. But does this situation really occur nowadays, with several GB of RAM (at least on a regular computer; I am not talking about embedded systems)? Can we have other situations where an allocation memory failure may occur?
Although you've gotten a number of answers about why/how memory could fail, most of them are sort of ignoring reality.
In reality, on real systems, most of these arguments don't describe how things really work. Although they're right from the viewpoint that these are reasons an attempted memory allocation could fail, they're mostly wrong from the viewpoint of describing how things are typically going to work in reality.
Just for example, in Linux, if you try to allocate more memory than the system has available, your allocation will not fail (i.e., you won't get a null pointer or a strd::bad_alloc exception). Instead, the system will "over commit", so you get what appears to be a valid pointer -- but when/if you attempt to use all that memory, you'll get an exception, and/or the OOM Killer will run, trying to free memory by killing processes that use a lot of memory. Unfortunately, this may about as easily kill the program making the request as other programs (in fact, many of the examples given that attempt to cause allocation failure by just repeatedly allocating big chunks of memory should probably be among the first to be killed).
Windows works a little closer to how the C and C++ standards envision things (but only a little). Windows is typically configured to expand the swap file if necessary to meet a memory allocation request. This means that what as you allocate more memory, the system will go semi-crazy with swapping memory around, creating bigger and bigger swap files to meet your request.
That will eventually fail, but on a system with lots of drive space, it might run for hours (most of it madly shuffling data around on the disk) before that happens. At least on a typical client machine where the user is actually...well, using the computer, he'll notice that everything has dragged to a grinding halt, and do something to stop it well before the allocation fails.
So, to get a memory allocation that truly fails, you're typically looking for something other than a typical desktop machine. A few examples include a server that runs unattended for weeks at a time, and is so lightly loaded that nobody notices that it's thrashing the disk for, say, 12 hours straight, or a machine running MS-DOS or some RTOS that doesn't supply virtual memory.
Bottom line: you're basically right, and they're basically wrong. While it's certainly true that if you allocate more memory than the machine supports, that something's got to give, it's generally not true that the failure will necessarily happen in the way prescribed by the C++ standard -- and, in fact, for typical desktop machines that's more the exception (pardon the pun) than the rule.
Apart from the obvious "out of memory", memory fragmentation can also cause this. Imagine a program that does the following:
until main memory is almost full:
allocate 1020 bytes
allocate 4 bytes
free all the 1020 byte blocks
If the memory manager puts all these sequentially in memory in the order they are allocated, we now have plenty of free memory, but any allocation larger than 1020 bytes will not be able to find a contiguous space to put them, and fail.
Usually on modern machines it will fail due to scarcity of virtual address space; if you have a 32 bit process that tries to allocate more than 2/3 GB of memory1, even if there would be physical RAM (or paging file) to satisfy the allocation, simply there won't be space in the virtual address space to map such newly allocated memory.
Another (similar) situation happens when the virtual address space is heavily fragmented, and thus the allocation fails because there's not enough contiguous addresses for it.
Also, running out of memory can happen, and in fact I got in such a situation last week; but several operating systems (notably Linux) in this case don't return NULL: Linux will happily give you a pointer to an area of memory that isn't already committed, and actually allocate it when the program tries to write in it; if at that moment there's not enough memory, the kernel will try to kill some memory-hogging processes to free memory (an exception to this behavior seems to be when you try to allocate more than the whole capacity of the RAM and of the swap partition - in such a case you get a NULL upfront).
Another cause of getting NULL from a malloc may be due to limits enforced by the OS over the process; for example, trying to run this code
#include <cstdlib>
#include <iostream>
#include <limits>
void mallocbsearch(std::size_t lower, std::size_t upper)
{
std::cout<<"["<<lower<<", "<<upper<<"]\n";
if(upper-lower<=1)
{
std::cout<<"Found! "<<lower<<"\n";
return;
}
std::size_t mid=lower+(upper-lower)/2;
void *ptr=std::malloc(mid);
if(ptr)
{
free(ptr);
mallocbsearch(mid, upper);
}
else
mallocbsearch(lower, mid);
}
int main()
{
mallocbsearch(0, std::numeric_limits<std::size_t>::max());
return 0;
}
on Ideone you find that the maximum allocation size is about 530 MB, which is probably a limit enforced by setrlimit (similar mechanisms exist on Windows).
it varies between OSes and can often be configured; the total virtual address space of a 32 bit process is 4 GB, but on all the current mainstream OSes a big chunk of it (the upper 2 GB by for 32 bit Windows with default settings) is reserved for kernel data.
The amount of memory available to the given process is finite. If the process exhausts its memory, and tries to allocate more, the allocation would fail.
There are other reasons why an allocation could fail. For example, the heap could get fragmented and not have a single free block large enough to satisfy the allocation request.

Can CreateThread interfere with VirtualAlloc usage?

Is it possible for stack space allocated by CreateThread to interfere with the usage of VirtualAlloc? I can't find any discussion or documentation explaining precisely where stack space is allowed to be allocated...
The following more precisely illustrates my question:
uint8_t *baseA = (uint8_t*)VirtualAlloc(NULL,1,MEM_RESERVE,PAGE_NOACCESS);
// Create a thread with the default stack size
HANDLE hThread = CreateThread(NULL,0,SomeThreadProc,NULL,NULL,NULL);
// Possibly create even more threads here.
// Can this ever fail in the absence of other allocators? It doesn't here...
uint8_t *baseB = (uint8_t*)VirtualAlloc(NULL,1,MEM_RESERVE,PAGE_NOACCESS);
// Furthermore, in this test, baseB-baseA == 65536 (unless the debugger did something),
// so nothing appeared between baseA and baseB... not even enough space for the
// full 64kb of wastage, as baseA points to 4096 bytes by itself
If it does in fact use some analogue of VirtualAlloc, is there a way to change how Windows allocates stack space in a given process?
Stack space can be allocated anywhere in the address space of the process. There is no documentation on this now and it is unlikely that such documentation will appear in the future.
You can safely assume that creation of the thread and virtual alloc are independent. If this would not be the case, a lot of things would be broken. Allocator cannot give out overlapping address ranges. This is unthinkable. The problem is somewhere else.
The only thing that might look like correlation - amount of memory used and virtual address space fragmentation. In this case the latest request will simply fail.
I worked on a memory analysis utilities.
This picture shows distribution of the numbers of virtual allocations per size of the allocation.
This is example of the address space contents for a 32-bit process (blue - committed, magenta - reserved, green is a free memory).
What I write here is based on a real experience.
the windows NT kernel treats memory alloc operations on a high interrupt priority, also in a thread safe manner.
That means only one thread of a process can allocate memory at the same time, which makes all allocation processes thread safe (in theory).
there shouldn't be any interferences between stack allocation and virtual allocation.
Also you should keep in mine that you can allocate 1GB of space but your program still only uses it's 2mb RAM.
That's because windows "pre allocates" virtual space, but it doesen't assign it until you use it (write on it).
Actually the memory management is alot more complicated but for now you can be shure that no allocation operations should interfere, ever, since windows is locking your process onto one core, delaying all other threads alloc requests, as long as allocation is processed. (deadlock)
*EDIT: That also means that allocation and de-allocation is kinda a performance needing process if you allocate millions of small bits. It's always better to allocate/de-allocate larger memory areas, because of this deadlock behavior.