Fast synchronized access to shared array with changing base address (in C11) - concurrency

I am currently designing a user space scheduler in C11 for a custom co-processor under Linux (user space, because the co-processor does not run its own OS, but is controlled by software running on the host CPU). It keeps track of all the tasks' states with an array. Task states are regular integers in this case. The array is dynamically allocated and each time a new task is submitted whose state does not fit into the array anymore, the array is reallocated to twice its current size. The scheduler uses multiple threads and thus needs to synchronize its data structures.
Now, the problem is that I very often need to read entries in that array, since I need to know the states of tasks for scheduling decisions and resource management. If the base address was guaranteed to always be the same after each reallocation, I would simply use C11 atomics for accessing it. Unfortunately, realloc obviously cannot give such a guarantee. So my current approach is wrapping each access (reads AND writes) with one big lock in the form of a pthread mutex. Obviously, this is really slow, since there is locking overhead for each read, and the read is really small, since it only consists of a single integer.
To clarify the problem, I give some code here showing the relevant passages:
Writing:
// pthread_mutex_t mut;
// size_t len_arr;
// int *array, idx, x;
pthread_mutex_lock(&mut);
if (idx >= len_arr) {
len_arr *= 2;
array = realloc(array, len_arr*sizeof(int));
if (array == NULL)
abort();
}
array[idx] = x;
pthread_mutex_unlock(&mut);
Reading:
// pthread_mutex_t mut;
// int *array, idx;
pthread_mutex_lock(&mut);
int x = array[idx];
pthread_mutex_unlock(&mut);
I have already used C11 atomics for efficient synchronization elsewhere in the implementation and would love to use them to solve this problem as well, but I could not find an efficient way to do so. In a perfect world, there would be an atomic accessor for arrays which performs address calculation and memory read/write in a single atomic operation. Unfortunately, I could not find such an operation. But maybe there is a similarly fast or even faster way of achieving synchronization in this situation?
EDIT:
I forgot to specify that I cannot reuse slots in the array when tasks terminate. Since I guarantee access to the state of every task ever submitted since the scheduler was started, I need to store the final state of each task until the application terminates. Thus, static allocation is not really an option.

Do you need to be so economical with virtual address space? Can't you just set a very big upper limit and allocate enough address space for it (maybe even a static array, or dynamic if you want the upper limit to be set at startup from command-line options).
Linux does lazy memory allocation so virtual pages that you never touch aren't actually using any physical memory. See Why is iterating though `std::vector` faster than iterating though `std::array`? that show by example that reading or writing an anonymous page for the first time causes a page fault. If it was a read access, it gets the kernel to CoW (copy-on-write) map it to a shared physical zero page. Only an initial write, or a write to a CoW page, triggers actual allocation of a physical page.
Leaving virtual pages completely untouched avoids even the overhead of wiring them into the hardware page tables.
If you're targeting a 64-bit ISA like x86-64, you have boatloads of virtual address space. Using up more virtual address space (as long as you aren't wasting physical pages) is basically fine.
Practical example of allocating more address virtual space than you can use:
If you allocate more memory than you could ever practically use (touching it all would definitely segfault or invoke the kernel's OOM killer), that will be as large or larger than you could ever grow via realloc.
To allocate this much, you may need to globally set /proc/sys/vm/overcommit_memory to 1 (no checking) instead of the default 0 (heuristic which makes extremely large allocations fail). Or use mmap(MAP_NORESERVE) to allocate it, making that one mapping just best-effort growth on page-faults.
The documentation says you might get a SIGSEGV on touching memory allocated with MAP_NORESERVE, which is different than invoking the OOM killer. But I think once you've already successfully touched memory, it is yours and won't get discarded. I think it's also not going to spuriously fail unless you're actually running out of RAM + swap space. IDK how you plan to detect that in your current design (which sounds pretty sketchy if you have no way to ever deallocate).
Test program:
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
int main(void) {
size_t sz = 1ULL << 46; // 2**46 = 64 TiB = max power of 2 for x86-64 with 48-bit virtual addresses
// in practice 1ULL << 40 (1TiB) should be more than enough.
// the smaller you pick, the less impact if multiple things use this trick in the same program
//int *p = aligned_alloc(64, sz); // doesn't use NORESERVE so it will be limited by overcommit settings
int *p = mmap(NULL, sz, PROT_WRITE|PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
madvise(p, sz, MADV_HUGEPAGE); // for good measure to reduce page-faults and TLB misses, since you're using large contiguous chunks of this array
p[1000000000] = 1234; // or sz/sizeof(int) - 1 will also work; this is only touching 1 page somewhere in the array.
printf("%p\n", p);
}
$ gcc -Og -g -Wall alloc.c
$ strace ./a.out
... process startup
mmap(NULL, 70368744177664, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x15c71ef7c000
madvise(0x15c71ef7c000, 70368744177664, MADV_HUGEPAGE) = 0
... stdio stuff
write(1, "0x15c71ef7c000\n", 15) = 15
0x15c71ef7c000
exit_group(0) = ?
+++ exited with 0 +++
My desktop has 16GiB of RAM (a lot of it in use by Chromium and some big files in /tmp) + 2GiB of swap. Yet this program allocated 64 TiB of virtual address space and touched 1 int of it nearly instantly. Not measurably slower than if it had only allocated 1MiB. (And future performance from actually using that memory should also be unaffected.)
The largest power-of-2 you can expect to work on current x86-64 hardware is 1ULL << 46. The total lower canonical range of the 48-bit virtual address space is 47 bits (user-space virtual address space on Linux), and some of that is already allocated for stack/code/data. Allocating a contiguous 64 TiB chunk of that still leaves plenty for other allocations.
(If you do actually have that much RAM + swap, you're probably waiting for a new CPU with 5-level page tables so you can use even more virtual address space.)
Speaking of page tables, the larger the array the more chance of putting some other future allocations very very far from existing blocks. This can have a minor cost in TLB-miss (page walk) time, if your actual in-use pages end up more scattered around your address space in more different sub-trees of the multi-level page tables. That's more page-table memory to keep in cache (including cached within the page-walk hardware).
The allocation size doesn't have to be a power of 2 but it might as well be. There's also no reason to make it that big. 1ULL << 40 (1TiB) should be fine on most systems. IDK if having more than half the available address space for a process allocated could slow future allocations; bookkeeping is I think based on extents (ptr + length) not bitmaps.
Keep in mind that if everyone starts doing this for random arrays in libraries, that could use up a lot of address space. This is great for the main array in a program that spends a lot of time using it. Keep it as small as you can while still being big enough to always be more than you need. (Optionally make it a config parameter if you want to avoid a "640kiB is enough for everyone" situation). Using up virtual address space is very low-cost, but it's probably better to use less.
Think of this as reserving space for future growth but not actually using it until you touch it. Even though by some ways of looking at it, the memory already is "allocated". But in Linux it really isn't. Linux defaults to allowing "overcommit": processes can have more total anonymous memory mapped than the system has physical RAM + swap. If too many processes try to use too much by actually touching all that allocated memory, the OOM killer has to kill something (because the "allocate" system calls like mmap have already returned success). See https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
(With MAP_NORESERVE, it's only reserving address space which is shared between threads, but not reserving any physical pages until you touch them.)
You probably want your array to be page-aligned: #include <stdalign.h> so you can use something like
alignas(4096) struct entry process_array[MAX_LEN];
Or for non-static, allocate it with C11 aligned_alloc().
Give back early parts of the array when you're sure all threads are done with it
Page alignment makes it easy do the calculations to "give back" a memory page (4kiB on x86) if your array's logical size shrinks enough. madvise(addr, 4096*n, MADV_FREE); (Linux 4.5 and later). This is kind of like mmap(MAP_FIXED) to replace some pages with new untouched anonymous pages (that will read as zeroes), except it doesn't split up the logical mapping extents and create more bookkeeping for the kernel.
Don't bother with this unless you're returning multiple pages, and leave at least one page unfreed above the current top to avoid page faults if you grow again soon. Like maybe maintain a high-water mark that you've ever touched (without giving back) and a current logical size. If high_water - logical_size > 16 pages give back all page from 4 past the logical size up to the high water mark.
If you will typically be actually using/touching at least 2MiB of your array, use madvise(MADV_HUGEPAGE) when you allocate it to get the kernel to prefer using transparent hugepages. This will reduce TLB misses.
(Use strace to see return values from your madvise system calls, and look at /proc/PID/smaps, to see if your calls are having the desired effect.)
If up-front allocation is unacceptable, RCU (read-copy-update) might be viable if it's read-mostly. https://en.wikipedia.org/wiki/Read-copy-update. But copying a gigantic array every time an element changes isn't going to work.
You'd want a different data-structure entirely where only small parts need to be copied. Or something other than RCU; like your answer, you might not need the read side being always wait-free. The choice will depend on acceptable worst-case latency and/or average throughput, and also how much contention there is for any kind of ref counter that has to bounce around between all threads.
Too bad there isn't a realloc variant that attempts to grow without copying so you could attempt that before bothering other threads. (e.g. have threads with idx>len spin-wait on len in case it increases without the array address changing.)

So, I came up with a solution:
Reading:
while(true) {
cnt++;
if (wait) {
cnt--;
yield();
} else {
break;
}
}
int x = array[idx];
cnt--;
Writing:
if (idx == len) {
wait = true;
while (cnt > 0); // busy wait to minimize latency of reallocation
array = realloc(array, 2*len*sizeof(int));
if (!array) abort(); // shit happens
len *= 2; // must not be updated before reallocation completed
wait = false;
}
// this is why len must be updated after realloc,
// it serves for synchronization with other writers
// exceeding the current length limit
while (idx > len) {yield();}
while(true) {
cnt++;
if (wait) {
cnt--;
yield();
} else {
break;
}
}
array[idx] = x;
cnt--;
wait is an atomic bool initialized as false, cnt is an atomic int initialized as zero.
This only works because I know that task IDs are chosen ascendingly without gaps and that no task state is read before it is initialized by the write operation. So I can always rely on the one thread which pulls the ID which only exceeds current array length by 1. New tasks created concurrently will block their thread until the responsible thread performed reallocation. Hence the busy wait, since the reallocation should happen quickly so the other threads do not have to wait for too long.
This way, I eliminate the bottlenecking big lock. Array accesses can be made concurrently at the cost of two atomic additions. Since reallocation occurs seldom (due to exponential growth), array access is practically block-free.
EDIT:
After taking a second look, I noticed that one has to be careful about reordering of stores around the length update. Also, the whole thing only works if concurrent writes always use different indices. This is the case for my implementation, but might not generally be. Thus, this is not as elegant as I thought and the solution presented in the accepted answer should be preferred.

Related

Measuring overhead of page-table locks in C/C++

There are two locks on the PMD and PTE levels of the page table in the Linux kernel. Each time a thread/process allocates/maps a memory it should hold one of these locks to update the page table accordingly. It is obvious that as the number of threads increases the race for holding the locks increases as well. This may degrade the memory mapping throughput since many threads hold the spinlock.
What I'd like to measure for a task, is the worst-case overhead of these locks on memory mapping throughput but I have no idea how to measure it.
I tried using malloc in an infinite-loop as I increase the number of threads running the same loop. I check the /proc/{pid}/maps for each set of running threads to count the number of mapped memories. But I'm not sure if it is the correct way. Besides, this method consumes a lot of memory.
Is there any efficient way to measure the worst-case overhead of these locks?
A lot of the comments are correct, however I thought I might try my hand at a full response.
Firstly, using malloc will not enable you to have explicit control over page mappings as a comment said the malloc part of stdlib will actually allocate a huge chunk of memory after the first allocation.
Secondly when creating a new thread, this will use the same address-space, so there will be no additional mappings created.
I'm going to assume you want to do this from user space, because from kernel space, you can do a lot of things to make this exploration somewhat degenerate (for example you can just try and map pages to the same location).
Instead you want to allocate anonymous pages using mmap.
Mmap is an explicit call to create a Virtual Memory Entry so that when that particular page is accessed for the first time, the kernel can actually put some blank physical memory at that location.
It is the first access to that location that causes the fault, and that first access which will actually use the locks in the PTE and PUD.
Ensuring Good Benchmarking Procedure:
If you are just trying to stress the page tables you might also want to turn off Transparent Huge Pages within that process. (The syscall to look into is prnctl with the flag DISABLE_THP). Run this before spawning any child processes.
Pin Threads to cores using cpuset.
You want to explicitly control your region of interest, so you want to pick specific addresses for each thread that all share the same page table. This way you ensure that the maximum number of locks is used.
Use a psuedo random function to write to the location that has a different seed for each thread.
Compare with a baseline that does the exact same thing but that has very different parts of the address space that is stressed.
Make sure that as little is different between the baseline and the overly contented workload.
Do not over-subscribe the processor, this will make the overhead due to context-switches which are notorious to root out.
Make sure to start capturing timing after the threads are created and stop it before they are destroyed.
What does this translate in each thread:
address = <per-thread address>
total = 0;
for(int i = 0; i < N; i++)
{
uint64_t* x = (uint64_t*) mmap((void*) address, 4096, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS, -1, 0); //Maps one page anonymously
assert(x);
*x ^= pseudo_rand(); // Accesses the page and causes the allocation
total += *x; // For fun
int res = munmap((void*) x, 4096); //Deallocates the page (similar locks)
assert(!res);
}
The big take aways are:
Use mmap and explicitly access the allocated location to actually control individual page allocation.
The compactness of addresses determines what locks are acquired.
Measuring kernel and virtual memory things requires strict discipline in benchmark procedure.

Avoid memory fragmentation when memory pools are a bad idea

I am developing a C++ application, where the program run endlessly, allocating and freeing millions of strings (char*) over time. And RAM usage is a serious consideration in the program. This results in RAM usage getting higher and higher over time. I think the problem is heap fragmentation. And I really need to find a solution.
You can see in the image, after millions of allocation and freeing in the program, the usage is just increasing. And the way I am testing it, I know for a fact that the data it stores is not increasing. I can guess that you will ask, "How are you sure of that?", "How are you sure it's not just a memory leak?", Well.
This test run much longer. I run malloc_trim(0), whenever possible in my program. And it seems, application can finally return the unused memory to the OS, and it goes almost to zero (the actual data size my program has currently). This implies the problem is not a memory leak. But I can't rely on this behavior, the allocation and freeing pattern of my program is random, what if it never releases the memory ?
I said memory pools are a bad idea for this project in the title. Of course I don't have absolute knowledge. But the strings I am allocating can be anything between 30-4000 bytes. Which makes many optimizations and clever ideas much harder. Memory pools are one of them.
I am using GCC 11 / G++ 11 as a compiler. If some old versions have bad allocators. I shouldn't have that problem.
How am I getting memory usage ? Python psutil module. proc.memory_full_info()[0], which gives me RSS.
Of course, you don't know the details of my program. It is still a valid question, if this is indeed because of heap fragmentation. Well what I can say is, I am keeping a up to date information about how many allocations and frees took place. And I know the element counts of every container in my program. But if you still have some ideas about the causes of the problem, I am open to suggestions.
I can't just allocate, say 4096 bytes for all the strings so it would become easier to optimize. That's the opposite I am trying to do.
So my question is, what do programmers do(what should I do), in an application where millions of alloc's and free's take place over time, and they are of different sizes so memory pools are hard to use efficiently. I can't change what the program does, I can only change implementation details.
Bounty Edit: When trying to utilize memory pools, isn't it possible to make multiple of them, to the extent that there is a pool for every possible byte count ? For example my strings can be something in between 30-4000 bytes. So couldn't somebody make 4000 - 30 + 1, 3971 memory pools, for each and every possible allocation size of the program. Isn't this applicable ? All pools could start small (no not lose much memory), then enlarge, in a balance between performance and memory. I am not trying to make a use of memory pool's ability to reserve big spaces beforehand. I am just trying to effectively reuse freed space, because of frequent alloc's and free's.
Last edit: It turns out that, the memory growth appearing in the graphs, was actually from a http request queue in my program. I failed to see that hundreds of thousands of tests that I did, bloated this queue (something like webhook). And the reasonable explanation of figure 2 is, I finally get DDOS banned from the server (or can't open a connection anymore for some reason), the queue emptied, and the RAM issue resolved. So anyone reading this question later in the future, consider every possibility. It would have never crossed my mind that it was something like this. Not a memory leak, but an implementation detail. Still I think #Hajo Kirchhoff deserves the bounty, his answer was really enlightening.
If everything really is/works as you say it does and there is no bug you have not yet found, then try this:
malloc and other memory allocation usually uses chunks of 16 bytes anyway, even if the actual requested size is smaller than 16 bytes. So you only need 4000/16 - 30/16 ~ 250 different memory pools.
const int chunk_size = 16;
memory_pools pool[250]; // 250 memory pools, managing '(idx+1)*chunk_size' size
char* reserve_mem(size_t sz)
{
size_t pool_idx_to_use = sz/chunk_size;
char * rc=pool[pool_idx_to_use].allocate();
}
IOW, you have 250 memory pools. pool[0] allocates and manages chunk with a length of 16 bytes. pool[100] manages chunks with 1600 bytes etc...
If you know the length distribution of your strings in advance, you can reserve initial memory for the pools based on this knowledge. Otherwise I'd just probably reserve memory for the pools in 4096 bytes increment.
Because while the malloc C heap usually allocates memory in multiple of 16 bytes it will (at least unter Windows, but I am guessing, Linux is similar here) ask the OS for memory - which usually works with 4K pages. IOW, the "outer" memory heap managed by the operating system reserves and frees 4096 bytes.
So increasing your own internal memory pool in 4096 bytes means no fragmentation in the OS app heap. This 4096 page size (or multiple of...) comes from the processor architecture. Intel processors have a builtin page size of 4K (or multiple of). Don't know about other processors, but I suspect similar architectures there.
So, to sum it up:
Use chunks of multiple of 16 bytes for your strings per memory pool.
Use chunks of multiple of 4K bytes to increase your memory pool.
That will align the memory use of your application with the memory management of the OS and avoid fragmentation as much as possible.
From the OS point of view, your application will only increment memory in 4K chunks. That's very easy to allocate and release. And there is no fragmentation.
From the internal (lib) C heap management point of view, your application will use memory pools and waste at most 15 bytes per string. Also all similar length allocations will be heaped together, so also no fragmentation.
Here a solution, which might help, if your application has the following traits:
Runs on Windows
Has episodes, where the working set is large, but all that data is released at around the same point in time entirely. (If the data is processed in some kind of batch mode and the work is done when its done and then the related data is freed).
This approach uses a (unique?) Windows feature, called custom heaps. Possibly, you can create yourself something similar on other OS.
The functions, you need are in a header called <heapapi.h>.
Here is how it would look like:
At the start of a memory intensive phase of your program, use HeapCreate() to have a heap, where all your data will go.
Perform the memory intensive tasks.
At the end of the memory intensive phase, Free your data and call HeapDestroy().
Depending on the detailed behavior of your application (e.g. whether or not this memory intensive computation runs in 1 or multiple threads), you can configure the heap accordingly and possibly even gain a little speed (If only 1 thread uses the data, you can give HeapCreate() the HEAP_NO_SERIALIZE flag, so it would not take a lock.) You can also give an upper bound of what the heap will allow to store.
And once, your computation is complete, destroying the heap also prevents long term fragmentation, because when the time comes for the next computation phase, you start with a fresh heap.
Here is the documentation of the heap api.
In fact, I found this feature so useful, that I replicated it on FreeBSD and Linux for an application I ported, using virtual memory functions and my own heap implementation. Was a few years back and I do not have the rights for that piece of code, so I cannot show or share.
You could also combine this approach with fixed element size pools, using one heap for one dedicated block size and then expect less/no fragmentation within each of those heaps (because all blocks have same size).
struct EcoString {
size_t heapIndex;
size_t capacity;
char* data;
char* get() { return data; }
};
struct FixedSizeHeap {
const size_t SIZE;
HANDLE m_heap;
explicit FixedSizeHeap(size_t size)
: SIZE(size)
, m_heap(HeapCreate(0,0,0)
{
}
~FixedSizeHeap() {
HeapDestroy(m_heap);
}
bool allocString(capacity, EcoString& str) {
assert(capacity <= SIZE);
str.capacity = SIZE; // we alloc SIZE bytes anyway...
str.data = (char*)HeapAlloc(m_heap, 0, SIZE);
if (nullptr != str.data)
return true;
str.capacity = 0;
return false;
}
void freeString(EcoString& str) {
HeapFree(m_heap, 0, str.data);
str.data = nullptr;
str.capacity = 0;
}
};
struct BucketHeap {
using Buckets = std::vector<FixedSizeHeap>; // or std::array
Buckets buckets;
/*
(loop for i from 0
for size = 40 then (+ size (ash size -1))
while (< size 80000)
collecting (cons i size))
((0 . 40) (1 . 60) (2 . 90) (3 . 135) (4 . 202) (5 . 303) (6 . 454) (7 . 681)
(8 . 1021) (9 . 1531) (10 . 2296) (11 . 3444) (12 . 5166) (13 . 7749)
(14 . 11623) (15 . 17434) (16 . 26151) (17 . 39226) (18 . 58839))
Init Buckets with index (first item) with SIZE (rest item)
in the above item list.
*/
// allocate(nbytes) looks for the correct bucket (linear or binary
// search) and calls allocString() on that bucket. And stores the
// buckets index in the EcoString.heapIndex field (so free has an
// easier time).
// free(EcoString& str) uses heapIndex to find the right bucket and
// then calls free on that bucket.
}
What you are trying to do is called a slab allocator and it is a very well studied problem with lots of research papers.
You don't need every possible size. Usually slabs come in power of 2 sizes.
Don't reinvent the wheel, there are plenty of opensource implementations of a slab allocator. The Linux kernel uses one.
Start by reading the Wikipedia page on Slab Allocation.
I don't know the details of your program. There are many patterns to work with memory, each of which leads to different results and problems. However you can take a look of .Net garbage collector. It uses several generations of objects (similar to pool of memory). Each object in that generation is movable (its address in memory is changed during compressing of generation). Such trick allows to "compress/compact" memory and reduce fragmentation of memory. You can implement memory manager with similar semantics. It's not neccesary to implement garbage collector at all.

How to query amount of allocated memory on Linux (and OSX)?

While this might look like a duplicate from other questions, let me explain why it's not.
I am looking to get a specific part of my application to degrade gracefully when a certain memory limit has been reached. I could have used criteria based on remaining available physical memory, but this wouldn't be safe, because the OS could start paging out memory used by my application before reaching the criteria, which would think there is still some physical memory left, and keep allocating, etc. For the same reason, I can't used the amount of physical memory currently used by the process, because as soon as the OS would start swapping me out, I would keep allocating as the OS pages memory so the number would not grow anymore.
For this reason, I chose a criteria based on the amount of memory allocated by my application, i.e. very close to virtual memory size.
This question (How to determine CPU and memory consumption from inside a process?) provides great ways of querying the amount of virtual memory used by the current process, which I THOUGHT was what I needed.
On Windows, I'm using GetProcessMemoryInfo() and the PrivateUsage field, which works great.
On Linux, I tried several things (listed below) that did not work. The reason why virtual memory usage does not work for me is because of something that happens with OpenCL context creation on NVidia hardware on Linux. The driver reserves a region of the virtual memory space big enough to hold all RAM, all swap and all video memory. My guess is it does so for unified address space and everything. But it also means that the process reports using enormous amounts of memory. On my system for instance, top will report 23.3 Gb in the VIRT column (12 Gb of RAM, 6 Gb of swap, 2 Gb of video memory, which gives 20 Gb reserved by the NVidia driver).
On OSX, by using task_info() and the virtual_size field, I also get a bigger than expected number (a few Gb for an app that takes not even close to 1 Gb on Windows), but not as big as Linux.
So here is the big question: how can I get the amount of memory allocated by my application? I know that this is a somewhat vague question (what does "allocated memory" means?), but I'm flexible:
I would prefer to include the application static data, code section and everything, but I can live without.
I would prefer to include the memory allocated for stacks, but I can live without.
I would prefer to include the memory used by shared libraries, but I can live without.
I don't really care for mmap stuff, I can do with or without at that point.
Etc.
What is really important is that the number grows with dynamic allocation (new, malloc, anything) and shrinks when the memory is released (which I know can be implementation-dependent).
Things I have tried
Here are a couple of solutions I have tried and/or thought of but that would not work for me.
Read from /proc/self/status
This is the approach suggested by how-to-determine-cpu-and-memory-consumption-from-inside-a-process. However, as stated above, this returns the amount of virtual memory, which does not work for me.
Read from /proc/self/statm
Very slightly worst: according to http://kernelnewbies.kernelnewbies.narkive.com/iG9xCmwB/proc-pid-statm-doesnt-match-with-status, which refers to Linux kernel code, the only difference between those two values is that the second one does not substract reserved_vm to the amount of virtual memory. I would have HOPED that reserved_vm would include the memory reserved by the OpenCL driver, but it does not.
Use mallinfo() and the uordblks field
This does not seem to include all the allocations (I'm guessing the news are missing), since for an +2Gb growth in virtual memory space (after doing some memory-heavy work and still holding the memory), I'm only seeing about 0.1Gb growth in the number returned by mallinfo().
Read the [heap] section size from /proc/self/smaps
This value started at around 336,760 Kb and peaked at 1,019,496 Kb for work that grew virtual memory space by +2Gb, and then it never gets down, so I'm not sure I can't really rely on this number...
Monitor all memory allocations in my application
Yes, in an ideal world, I would have control over everybody who allocates memory. However, this is a legacy application, using tons of different allocators, some mallocs, some news, some OS-specific routines, etc. There are some plug-ins that could do whatever they want, they could be compiled with a different compiler, etc. So while this would be great to really control memory, this does not work in my context.
Read the virtual memory size before and after the OpenCL context initialization
While this could be a "hacky" way to solve the problem (and I might have to fallback to it), I would really wish for a more reliable way to query memory, because OpenCL context could be initialized somewhere out of my control, and other similar but non-OpenCL specific issues could creep in and I wouldn't know about it.
So that's pretty much all I've got. There is one more thing I have not tried yet, because it only works on OSX, but it is to use the approach described in Why does mstats and malloc_zone_statistics not show recovered memory after free?, i.e. use malloc_get_all_zones() and malloc_zone_statistics(), but I think this might be the same problem as mallinfo(), i.e. not take all allocations into account.
So, can anyone suggest a way to query memory usage (as vague of a term as this is, see above for precision) of a given process in Linux (and also OSX even if it's a different method)?
You can try and use information returned by getrusage():
#include <sys/time.h>
#include <sys/resource.h>
int getrusage(int who, struct rusage *usage);
struct rusage {
struct timeval ru_utime; /* user CPU time used */
struct timeval ru_stime; /* system CPU time used */
long ru_maxrss; /* maximum resident set size */
long ru_ixrss; /* integral shared memory size */
long ru_idrss; /* integral unshared data size */
long ru_isrss; /* integral unshared stack size */
long ru_minflt; /* page reclaims (soft page faults) */
long ru_majflt; /* page faults (hard page faults) */
long ru_nswap; /* swaps */
long ru_inblock; /* block input operations */
long ru_oublock; /* block output operations */
long ru_msgsnd; /* IPC messages sent */
long ru_msgrcv; /* IPC messages received */
long ru_nsignals; /* signals received */
long ru_nvcsw; /* voluntary context switches */
long ru_nivcsw; /* involuntary context switches */
};
If the memory information does not fit you purpose, observing the page fault counts can help monitor memory stress, which is what you intend to detect.
Have you tried a shared library interposer for Linux for section (5) above? So long as your application is not statically linking the malloc functions, you can interpose a new function between your program and the kernel malloc. I've used this tactic many times to collect stats on memory usage.
It does required setting LD_PRELOAD before running the program but no source or binary changes. It is an ideal answer in many cases.
Here is an example of a malloc interposer:
http://www.drdobbs.com/building-library-interposers-for-fun-and/184404926
You probably will also want to do calloc and free. Calls to new generally end up as a call to malloc so C++ is covered as well.
OS X seems to have similar capabilities but I have not tried it.
http://tlrobinson.net/blog/2007/12/overriding-library-functions-in-mac-os-x-the-easy-way-dyld_insert_libraries/
--Matt
Here is what I ended up using. I scan /proc/self/maps and sum the size of all the address ranges meeting my criteria, which is:
Only include ranges from inode 0 (i.e. no devices, no mapped file, etc.)
Only include ranges that are at least one of readable, writable or executable
Only include private memory
In my experiments I did not see instances of shared memory from inode 0. Maybe with inter-process shared memory...?
Here is the code for my solution:
size_t getValue()
{
FILE* file = fopen("/proc/self/maps", "r");
if (!file)
{
assert(0);
return 0;
}
size_t value = 0;
char line[1024];
while (fgets(line, 1024, file) != NULL)
{
ptrdiff_t start_address, end_address;
char perms[4];
ptrdiff_t offset;
int dev_major, dev_minor;
unsigned long int inode;
const int nb_scanned = sscanf(
line, "%16tx-%16tx %c%c%c%c %16tx %02x:%02x %lu",
&start_address, &end_address,
&perms[0], &perms[1], &perms[2], &perms[3],
&offset, &dev_major, &dev_minor, &inode
);
if (10 != nb_scanned)
{
assert(0);
continue;
}
if ((inode == 0) &&
(perms[0] != '-' || perms[1] != '-' || perms[2] != '-') &&
(perms[3] == 'p'))
{
assert(dev_major == 0);
assert(dev_minor == 0);
value += (end_address - start_address);
}
}
fclose(file);
return value;
}
Since this is looping through all the lines in /proc/self/maps, querying memory that way is significantly slower than using "Virtual Memory currently used by current process" from How to determine CPU and memory consumption from inside a process?.
However, it provides an answer much closer to what I need.

How to use dynamic data structures like std::vector and prevent paging ?

Following up on this question Vector push_back only if enough memory is available, I tried
to rephrase the question in a more general sense.
Consider this fragment :
vector<double> v1;
cout << "pushing back ..." << endl;
while (true) {
try {
v1.push_back(0.0);
} catch (bad_alloc& ba){
cout << "bad_alloc caught: " << ba.what() << endl;
break;
}
}
Which of the following statements regarding the above code fragment are true ?
1) Eventually, the catch block will be reached
2) You can not determine beforehand if there is enough memory for push_back to not throw bad_alloc
3) Every action in the catch block that involves memory allocation could fail, because there is no memory left
The first thing I did was to run this program on Windows which lead to the observation that before any paging happened, bad_alloc was thrown because obviously the per process amount of memory had been exceeded. This observation lead to the next statement :
4) On most Operating Systems bad_alloc will be thrown before paging happens, but there is no certain way to tell beforehand.
After some research I came up with the following thoughts on the above statements :
A1) True, the catch block will be reached but maybe not before the OS has performed intensive I/O operations due to paging.
A2) True, at least not in an OS independent way
A3) True, you have to preallocate memory in order to do something useful with data in the vector gathered so far (e.g. do some paging on your own, if you find this useful)
A4) True, this is dependent on multiple OS-specific parameters like max amount of RAM per process, process priority, strategy of the OS process scheduler etc ...
I am not sure if A1-A4 are correct, hence my question, but if so, here is the next statement :
5) If you need to write some algorithm and be sure that there will be no paging, do not use dynamic data structures like std::vector. Instead use an Array and make sure it will stay in memory using OS-specific functions like for example mlockall (Unix)
If 5) is true it leads to the last statement :
6) There is no OS-independent way to write a program that will not cause paging.
Thanks everybody in advance for sharing your thoughts on the above statements.
If your program must run on Windows/Unix/OS X make a wrapper functions:
bool lockMemoryRegion( void *addr, size_t size )
{
#ifdef WIN32
return VirtualLock( addr, size ) != 0;
#else
return mlock( addr, size ) == 0;
#endif
}
bool unlockMemoryRegion( void *addr, size_t size )
{
#ifdef WIN32
return VirtualUnlock( addr, size ) != 0;
#else
return munlock( addr, size ) == 0;
#endif
}
Then if you need to lock memory used by std::vector:
std::vector<int> v( 1000 );
lockMemoryRegion( v.data(), v.capacity() * sizeof (int) );
Use memory locks only if you really ought to. Locking pages into memory may degrade the performance of the system by reducing the available RAM and forcing the system to swap out other critical pages to the paging file.
What a rambling mess of a question. You still need to get your head around modern memory allocation on the operating systems you're actually interested in. I'd recommend a bit of systematic background reading, as answers to your hodge-podge of questions won't necessarily give you the proper big picture.
1) Eventually, the catch block will be reached
2) You can not determine beforehand if there is enough memory for push_back to not throw bad_alloc
3) Every action in the catch block that involves memory allocation could fail, because there is no memory left
None of these are necessarily true... the OS may allocate the virtual address space then terminate the process when it's accessed and the OS can't find physical memory to back it. Further, a low-memory process killer may decide you've pushed too far and terminate you or any other non-critical process.
For 3) specifically, the Standard explicitly says an implementation may use a separate memory area to convey the thrown object towards the catch statement that will handle it - after all, it doesn't makes sense to put in on the same stack you're unwinding during exception processing. So, that memory allocation has much less issues than dynamic memory allocation (with new or malloc) but may still page and therefore precipitate process termination in very rare cases. It's still dangerous is the object being thrown internally does dynamic memory allocation (e.g. stores a description in a string or istringstream data member). Similarly, the catch statement may allocate stack space for variables, expression evaluations, function calls etc. - they could also precipitate failure but are less dangerous than new/malloc.
4) On most Operating Systems bad_alloc will be thrown before paging happens, but there is no certain way to tell beforehand.
Certainly not - what would be the point of paging then?
A1) True, the catch block will be reached but maybe not before the OS has performed intensive I/O operations due to paging.
If there happens to be swap disk in use, then yes you should get paging happening before an out of memory condition, but again that may not manifest as an exception.
A2) True, at least not in an OS independent way
Nope... it wasn't true to begin with.
A3) True, you have to preallocate memory in order to do something useful with data in the vector gathered so far (e.g. do some paging on your own, if you find this useful)
You don't have to preallocate anything... which would be done with a constructor parameter or resize... that's optional, but may allow you to process more data without hitting an out of memory condition simply because there's less need for momentarily increased memory usage as the data is moved to a larger memory block. All that has nothing to do with whether you "do something useful", and I have no idea what you imagine by "do some paging on your own". If you access vector elements they may have to be paged in. If you haven't used them for a while they may be paged out. The OS caching algorithms decide this. You may want to at least understand a simple algorithm of this type, such as Least Recently Used (LRU).
A4) True, this is dependent on multiple OS-specific parameters like max amount of RAM per process, process priority, strategy of the OS process scheduler etc ...
You can have a per-process memory allocation limit, but your conception that paging won't happen until you exceed that limit is wrong. Paging can happen to any part of your process - dynamically allocated, stack, executable image, static data, thread-specific data etc. - whenever the OS sees it hasn't been used for a while and wants the physical memory for some other more pressing purpose.
Your question makes it clear the following suppositions are conditional on the truth of the earlier ones, but I'll address them quickly as they have elements of truth and/or relevance anyway....
5) If you need to write some algorithm and be sure that there will be no paging, do not use dynamic data structures like std::vector. Instead use an Array and make sure it will stay in memory using OS-specific functions like for example mlockall (Unix)
Which type of data type/container you use is irrelevant - the OS doesn't even know or care to what use you're putting different parts of the memory it's granted your process. So, functions like that can be applied to arrays or dynamically allocated memory - for example - if you've populated a vector then you can use .data() to get a pointer to the actual memory region storing data, then lock it into physical RAM. Of course, if you do something to force the vector to find a different memory region (e.g. adding elements beyond capacity()) then it will still look for more memory and having some deleted memory region locked in physical memory may adversely affect your process and system performance.
If 5) is true it leads to the last statement :
6) There is no OS-independent way to write a program that will not cause paging.
No, there's not. Paging is meant to be transparent to the processes undergoing it, and processes rarely need to avoid it.
1, 2, and 3 are all correct, assuming that 2 refers to portable ways. You can make a decent guess based on OS-specific process memory usage reporting functions. They're not that accurate and they're not portable, but they do offer a fairly good guess.
As for 4, that's just not true. It is a function of the amount of physical memory compared to the virtual address space size of the process. x64 has a way larger address space than there is physical memory. x86 is substantially smaller now but go back a few years to older machines with 2GB or 1GB of RAM and it would be bigger.
If you need to write some algorithm and be sure that there will be no
paging, do not use dynamic data structures like std::vector. Instead
use an Array and make sure it will stay in memory using OS-specific
functions like for example mlockall (Unix)
Bullshit. You can reserve the vector to allocate all the memory you need, then call mlock anyway.
But there is most certainly no OS-independent way to write a program that will not cause paging. Paging is an implementation detail of the flat memory model used by C++ and there is certainly no Standard functionality relating to this implementation detail, nor will there ever be.
1) Eventually, the catch block will be reached
This "eventually" doesn't mean "when you allocate up to bytes" but a lot more (virtual memory mapping - if present - would have to be exhausted as well).
I've seen a linux process scheduler about ten years ago had a habbit of killing applications that misbehaved. I think this application would qualify (i.e. it may be terminated by the OS before the catch block is reached).
3) Every action in the catch block that involves memory allocation could fail, because there is no memory left
Theoretically true, practically, probably false. The vector will keep allocating larger and larger contiguous blocks. As it does, it is possible it will no longer be able to allocate a LARGE block, but the previous smaller allocations have been released. It it possible that you will have some free memory available in the catch block.
4) On most Operating Systems bad_alloc will be thrown before paging happens, but there is no certain way to tell beforehand.
Since there is no way to tell beforehand, the only realistic way to find out is to measure it.
5) If you need to write some algorithm and be sure that there will be no paging, do not use dynamic data structures like std::vector. Instead use an Array and make sure it will stay in memory using OS-specific functions like for example mlockall (Unix)
This is incorrect. A vector is a safe wrapper on an allocated contiguous memory block. You can just as well work with a vector and memory locking functions.
For (6): Paging is HW, OS and application dependent (you can run the same application on two different systems and have it paged differently).

Allocating more memory than there exists using malloc

This code snippet will allocate 2Gb every time it reads the letter 'u' from stdin, and will initialize all the allocated chars once it reads 'a'.
#include <iostream>
#include <stdlib.h>
#include <stdio.h>
#include <vector>
#define bytes 2147483648
using namespace std;
int main()
{
char input [1];
vector<char *> activate;
while(input[0] != 'q')
{
gets (input);
if(input[0] == 'u')
{
char *m = (char*)malloc(bytes);
if(m == NULL) cout << "cant allocate mem" << endl;
else cout << "ok" << endl;
activate.push_back(m);
}
else if(input[0] == 'a')
{
for(int x = 0; x < activate.size(); x++)
{
char *m;
m = activate[x];
for(unsigned x = 0; x < bytes; x++)
{
m[x] = 'a';
}
}
}
}
return 0;
}
I am running this code on a linux virtual machine that has 3Gb of ram. While monitoring the system resource usage using the htop tool, I have realized that the malloc operation is not reflected on the resources.
For example when I input 'u' only once(i.e. allocate 2GB of heap memory), I don't see the memory usage increasing by 2GB in htop. It is only when I input 'a'(i.e. initialize), I see the memory usage increasing.
As a consequence, I am able to "malloc" more heap memory than there exists. For example, I can malloc 6GB(which is more than my ram and swap memory) and malloc would allow it(i.e. NULL is not returned by malloc). But when I try to initialize the allocated memory, I can see the memory and swap memory filling up till the process is killed.
-My questions:
1.Is this a kernel bug?
2.Can someone explain to me why this behavior is allowed?
It is called memory overcommit. You can disable it by running as root:
echo 2 > /proc/sys/vm/overcommit_memory
and it is not a kernel feature that I like (so I always disable it). See malloc(3) and mmap(2) and proc(5)
NB: echo 0 instead of echo 2 often -but not always- works also. Read the docs (in particular proc man page that I just linked to).
from man malloc (online here):
By default, Linux follows an optimistic memory allocation strategy.
This means that when malloc() returns non-NULL there is no guarantee
that the memory really is available.
So when you just want to allocate too much, it "lies" to you, when you want to use the allocated memory, it will try to find enough memory for you and it might crash if it can't find enough memory.
No, this is not a kernel bug. You have discovered something known as late paging (or overcommit).
Until you write a byte to the address allocated with malloc (...) the kernel does little more than "reserve" the address range. This really depends on the implementation of your memory allocator and operating system of course, but most good ones do not incur the majority of kernel overhead until the memory is first used.
The hoard allocator is one big offender that comes to mind immediately, through extensive testing I have found it almost never takes advantage of a kernel that supports late paging. You can always mitigate the effects of late paging in any allocator if you zero-fill the entire memory range immediately after allocation.
Real-time operating systems like VxWorks will never allow this behavior because late paging introduces serious latency. Technically, all it does is put the latency off until a later indeterminate time.
For a more detailed discussion, you may be interested to see how IBM's AIX operating system handles page allocation and overcommitment.
This is a result of what Basile mentioned, over commit memory. However, the explanation kind of interesting.
Basically when you attempt to map additional memory in Linux (POSIX?), the kernel will just reserve it, and will only actually end up using it if your application accesses one of the reserved pages. This allows multiple applications to reserve more than the actual total amount of ram / swap.
This is desirable behavior on most Linux environments unless you've got a real-time OS or something where you know exactly who will need what resources, when and why.
Otherwise somebody could come along, malloc up all the ram (without actually doing anything with it) and OOM your apps.
Another example of this lazy allocation is mmap(), where you have a virtual map that the file you're mapping can fit inside - but you only have a small amount of real memory dedicated to the effort. This allows you to mmap() huge files (larger than your available RAM), and use them like normal file handles which is nifty)
-n
Initializing / working with the memory should work:
memset(m, 0, bytes);
Also you could use calloc that not only allocates memory but also fills it with zeros for you:
char* m = (char*) calloc(1, bytes);
1.Is this a kernel bug?
No.
2.Can someone explain to me why this behavior is allowed?
There are a few reasons:
Mitigate need to know eventual memory requirement - it's often convenient to have an application be able to an amount of memory that it considers an upper limit on the need it might actually have. For example, if it's preparing some kind of report either of an initial pass just to calculate the eventual size of the report or a realloc() of successively larger areas (with the risk of having to copy) may significantly complicate the code and hurt performance, where-as multiplying some maximum length of each entry by the number of entries could be very quick and easy. If you know virtual memory is relatively plentiful as far as your application's needs are concerned, then making a larger allocation of virtual address space is very cheap.
Sparse data - if you have the virtual address space spare, being able to have a sparse array and use direct indexing, or allocate a hash table with generous capacity() to size() ratio, can lead to a very high performance system. Both work best (in the sense of having low overheads/waste and efficient use of memory caches) when the data element size is a multiple of the memory paging size, or failing that much larger or a small integral fraction thereof.
Resource sharing - consider an ISP offering a "1 giga-bit per second" connection to 1000 consumers in a building - they know that if all the consumers use it simultaneously they'll get about 1 mega-bit, but rely on their real-world experience that, though people ask for 1 giga-bit and want a good fraction of it at specific times, there's inevitably some lower maximum and much lower average for concurrent usage. The same insight applied to memory allows operating systems to support more applications than they otherwise would, with reasonable average success at satisfying expectations. Much as the shared Internet connection degrades in speed as more users make simultaneous demands, paging from swap memory on disk may kick in and reduce performance. But unlike an internet connection, there's a limit to the swap memory, and if all the apps really do try to use the memory concurrently such that that limit's exceeded, some will start getting signals/interrupts/traps reporting memory exhaustion. Summarily, with this memory overcommit behaviour enabled, simply checking malloc()/new returned a non-NULL pointer is not sufficient to guarantee the physical memory is actually available, and the program may still receive a signal later as it attempts to use the memory.