Optimal Way of Using mlockall() for Real-time Application (nanosecond sensitive) - c++

I am reading mlockall()'s manpage: http://man7.org/linux/man-pages/man2/mlock.2.html
It mentions
Real-time processes that are using mlockall() to prevent delays on page
faults should reserve enough locked stack pages before entering the time-
critical section, so that no page fault can be caused by function calls. This
can be achieved by calling a function that allocates a sufficiently large
automatic variable (an array) and writes to the memory occupied by this array in
order to touch these stack pages. This way, enough pages will be mapped for the
stack and can be locked into RAM. The dummy writes ensure that not even copy-
on-write page faults can occur in the critical section.
I am a bit confused by this statement:
This can be achieved by calling a function that allocates a sufficiently large
automatic variable (an array) and writes to the memory occupied by this array in
order to touch these stack pages.
All the automatic variables (variables on stack) are created "on the fly" on the stack when the function is called. So how can I achieve what the last statement says?
For example, let's say I have this function:
void foo() {
char a;
uint16_t b;
std::deque<int64_t> c;
// do something with those variables
}
Or does it mean before I call any function, I should call a function like this in main():
void reserveStackPages() {
int64_t stackPage[4096/8 * 1024 * 1024];
memset(stackPage, 0, sizeof(stackPage));
}
If yes, does it make a difference if I first allocate the stackPage variable on heap, write and then free? Probably yes, because heap and stack are 2 different region in the RAM?
std::deque exists above is just to bring up another related question -- what if I want to reserve memory for things using both stack pages and heap pages. Will calling "heap" version of reserveStackPages() help?
The goal is to minimize all the jitters in the application (yes, I know there are many other things to look at such as TLB miss, etc; just trying to deal with one kind of jitter at once, and slowly progressing into all).
Thanks in advance.
P.S. This is for a low latency trading application if it matters.

You generally don't need to use mlockall, unless you code (more or less hard) real-time applications (I actually never used it).
If you do need it, you'll better code in C (not in genuine C++) the most real-time parts of your code, because you surely want to understand the details of memory allocation. Notice that unless you dive into std::deque implementation, you don't exactly know where it is sitting (probably most of the data is heap allocated, even if your c is an automatic variable).
You should first understand in details the virtual address space of your process. For that, proc(5) is useful: from inside your process, you'll read /proc/self/maps (see this), from outside (e.g. some terminal) you'll do cat /proc/1234/maps for a process of pid 1234. Or use pmap(1).
because heap and stack are 2 different regions in the RAM?
In fact, your process' address space contains many segments (listed in /proc/1234/maps), much more that two. Typically every dynamically linked shared library (such as libc.so) brings a few segments.
Try cat /proc/self/maps and cat /proc/$$/maps in your terminal to get a better intuition about virtual address spaces. On my machine, the first gives 19 segments of the cat process -each displayed as a line- and the second 97 segments of the zsh (my shell) process.
To ensure that your stack has enough space, you indeed could call a function allocating a large enough automatic variable, like your reserveStackPages. Beware that call stacks are practically of limited size (a few megabytes usually, see also setrlimit(2)).
If you really need mlockall (which is unlikely) you might consider linking statically your program (to have less segments in your virtual address space).
Look also into madvise(2) (and perhaps mincore(2)). It is generally much more useful than mlockall. BTW, in practice, most of your virtual memory is in RAM (unless your system experiments thrashing, and then you'll see it immediately).
Read also Operating Systems: Three Easy Pieces to understand the role of paging.
PS. Nano-second sensitive applications does not make much sense (because of cache misses that the software does not control).

Related

Stack allocation for C++ green threads

I'm doing some research in C++ green threads, mostly boost::coroutine2 and similar POSIX functions like makecontext()/swapcontext(), and planning to implement a C++ green thread library on top of boost::coroutine2. Both require the user code to allocate a stack for every new function/coroutine.
My target platform is x64/Linux. I want my green thread library to be suitable for general use, so the stacks should expand as required (a reasonable upper limit is fine, e.g. 10MB), it would be great if the stacks could shrink when too much memory is unused (not required). I haven't figured out an appropriate algorithm to allocate stacks.
After some googling, I figured out a few options myself:
use split stack implemented by the compiler (gcc -fsplit-stack), but split stack has performance overhead. Go has already moved away from split stack due to performance reasons.
allocate a large chunk of memory with mmap() hope the kernel is smart enough to leave the physical memory unallocated and allocate only when the stacks are accessed. In this case, we are at the mercy of the kernel.
reserve a large memory space with mmap(PROT_NONE) and setup a SIGSEGV signal handler. In the signal handler, when the SIGSEGV is caused by stack access (the accessed memory is inside the large memory space reserved), allocate needed memory with mmap(PROT_READ | PROT_WRITE). Here is the problem for this approach: mmap() isn't asynchronous safe, cannot be called inside a signal handler. It still can be implemented, very tricky though: create another thread during program startup for memory allocation, and use pipe() + read()/write() to send memory allocation information from the signal handler to the thread.
A few more questions about option 3:
I'm not sure the performance overhead of this approach, how well/bad the kernel/CPU performs when the memory space is extremely fragmented due to thousands of mmap() call ?
Is this approach correct if the unallocated memory is accessed in kernel space ? e.g. when read() is called ?
Are there any other (better) options for stack allocation for green threads ? How are green thread stacks allocated in other implementations, e.g. Go/Java ?
The way that glibc allocates stacks for normal C programs is to mmap a region with the following mmap flag designed just for this purpose:
MAP_GROWSDOWN
Used for stacks. Indicates to the kernel virtual memory system
that the mapping should extend downward in memory.
For compatibility, you should probably use MAP_STACK too. Then you don't have to write the SIGSEGV handler yourself, and the stack grows automatically. The bounds can be set as described here What does "ulimit -s unlimited" do?
If you want a bounded stack size, which is normally what people do for signal handlers if they want to call sigaltstack(2), just issue an ordinary mmap call.
The Linux kernel always maps physical pages that back virtual pages, catching the page fault when a page is first accessed (perhaps not in real-time kernels but certainly in all other configurations). You can use the /proc/<pid>/pagemap interface (or this tool I wrote https://github.com/dwks/pagemap) to verify this if you are interested.
Why mmap? When you allocate with new (or malloc) the memory is untouched and definitely not mapped.
const int STACK_SIZE = 10 * 1024*1024;
char*p = new char[STACK_SIZE*numThreads];
p now has enough memory for the threads you want. When you need the memory, start accessing p + STACK_SIZE * i

Stack memory not released

I have the following loop, which pops a C++ concurrent queue I have, from the implementation here. https://juanchopanzacpp.wordpress.com/2013/02/26/concurrent-queue-c11/
while (!interrupted)
{
pxData data = queue->pop();
if (data.value == -1)
{
break; // exit loop on terminating condition
}
usleep(7000); // stub to simulate processing
}
I am looking at the memory history using System Monitor in CentOS7.
I'm trying to free up the memory taken up by the queue, after reading the value from the queue. However, as the following while loop runs, I don't see the memory usage going down. I've verified that the queue length does go down.
It does go down, however, when -1 is encountered and the loop exits. (program is still running) But I can't have this, because where usleep is, I want to do some intensive processing.
Question: Why doesn't the memory occupied by data get free-ed? (according to System Monitor) Isn't the stack allocated memory supposed to be free-ed when the variable goes out of scope?
The struct is defined as follows, and populated at the beginning of the program.
typedef struct pxData
{
float value; // -1 value terminates the loop
float x, y, z;
std::complex<float> valueData[65536];
} pxData;
It's populated with ~10000 pxData, which roughly translates to 5GB. System only has ~8GB.
So it's important that the memory is free-ed up for doing other processing in the system.
There are a few things at play here.
Virtual Memory
First, you need to understand that just because your program is "using" 5 GB of memory does not mean that there are only 3 GB of RAM left for other programs. Virtual memory means that those 5 GB might be only 1 GB of actual "resident" data, and the other 4 GB may actually be on disk rather than in RAM. So it's important to look at the "resident set size" rather than the "virtual size" when you're looking at your program. And note that if your system actually runs low on RAM, the OS may shrink the RSS of some programs by "paging out" some of their memory. So don't worry too much about "5 GB" appearing in the system monitor--worry if you have a real, concrete performance problem.
Heap Allocation
The second aspect is why your virtual size does not decrease as you remove items from the queue. We can guess that you put those elements into the queue by creating them with malloc or new one-by-one, then pushing them onto the back of the queue. This means that the first element you allocated will come out of the queue first. And that in turn means that when you have drained 90% of the queue, your memory allocation might look like this:
[program|------------------unused-------------------|pxData]
The problem here is that in the real world, just because you free or delete something does not mean the operating system instantly reclaims that memory. In fact, it may not be able to reclaim any unused spans unless they are at the "end" (i.e. most recently allocated). Since C++ does not have garbage collection and cannot move items around in memory without your consent, you end up with this big "hole" in your program's virtual memory. That hole would be used to satisfy future memory allocation requests, but if you haven't got any, it just sits there, until the queue is completely empty:
[program|------------------unused--------------------------]
Then the system is able to shrink your virtual address space back down:
[program]
Which brings you back to where you started.
Solutions
If you want to "fix" this, one option is to allocate your memory in "reverse", i.e. put the last items allocated into the front of the queue.
Another option is to allocate the elements for the queue via mmap, which is something that e.g. Linux will do automatically for allocations which are "large." You can change the threshold for this by calling mallopt(3) with M_MMAP_THRESHOLD and setting it to be a little bit smaller than your struct size. This makes the allocations independent of each other, so the OS can reclaim them individually. This technique can even be applied to existing programs without recompilation, so is often useful if you need to solve this problem in a program you cannot modify.
A C++ implementation would call some operator delete to release dynamically allocated (using some operator new) memory. In several C++ standard libraries, new calls malloc and delete calls free.
(I am focusing with a Linux point of view, but the principles are similar on other OSes)
But while malloc (or ::operator new) is sometimes asking the OS kernel some more memory by system calls changing the virtual address space like mmap(2), free (or ::operator delete) is often simply marking the released memory zone as re-available to future calls to malloc (or to new)
So from the kernel point of view (e.g. as seen thru /proc/, see proc(5)...), the virtual address space is not changing, and the memory remains consumed, even if inside the application it is marked as "freed" and will be reused at some future allocation (by future calls to malloc or new)
And most C++ standard containers are internally using heap data. In particular your local (stack-allocated) std::map or std::vector (or std::deque) variable will call new & delete for internal data.
BTW, I find quite strange your declaration. Unless every struct pxData has exactly 65536 used valueData slots, I would suggest to use some std::vector so have
std::vector<std::complex<float>> valueData;
and improve your code accordingly. You'll probably need to do some valueData.reserve(somesize); and/or valueData.resize(somesize); and/or valueData.push_back(somecomplexnumber); etc....

How does a computer 'know' what memory is allocated?

When memory is allocated in a computer, how does it know which bytes are already occupied and can't be overwritten?
So if these are some bytes of memory that aren't being used:
[0|0|0|0]
How does the computer know whether they are or not? They could just be an integer that equals zero. Or it could be empty memory. How does it know?
That depends on the way the allocation is performed, but it generally involves manipulation of data belonging to the allocation mechanism.
When you allocate some variable in a function, the allocation is performed by decrementing the stack pointer. Via the stack pointer, your program knows that anything below the stack pointer is not allocated to the stack, while anything above the stack pointer is allocated.
When you allocate something via malloc() etc. on the heap, things are similar, but more complicated: all theses allocators have some internal data structures which they never expose to the calling application, but which allow them to select which memory addresses to return on an allocation request. Some malloc() implementation, for instance, use a number of memory pools for small objects of fixed size, and maintain linked lists of free objects for each fixed size which they track. That way, they can quickly pop one memory region of that list, only doing more expensive computations when they run out of regions to satisfy a certain request size.
In any case, each of the allocators have to request memory from the system kernel from time to time. This mechanism always works on complete memory pages (usually 4 kiB), and works via the syscalls brk() and mmap(). Again, the kernel keeps track of which pages are visible in which processes, and at which addresses they are mapped, so there is additional memory allocated inside the kernel for this.
These mappings are made available to the processor via the page tables, which uses them to resolve the virtual memory addresses to the physical addresses. So here, finally, you have some hardware involved in the process, but that is really far, far down in the guts of the mechanics, much below anything that a userspace process is ever able to see. Still, even the page tables are managed by the software of the kernel, not by the hardware, the hardware only interpretes what the software writes into the page tables.
First of all, I have the impression that you believe that there is some unoccupied memory that doesn't holds any value. That's wrong. You can imagine the memory as a very large array when each box contains a value whereas someone put something in it or not. If a memory was never written, then it contains a random value.
Now to answer your question, it's not the computer (meaning the hardware) but the operating system. It holds somewhere in its memory some tables recording which part of the memory are used. Also any byte of memory can be overwriten.
In general, you cannot tell by looking at content of memory at some location whether that portion of memory is used or not. Memory value '0' does not mean the memory is not used.
To tell what portions of memory are used you need some structure to tell you this. For example, you can divide memory into chunks and keep track of which chunks are used and which are not.
There are memory blocks, they have an occupied or not occupied. On the heap, there are very complex data structures which organise it. But the answer to your question is too broad.

Can CreateThread interfere with VirtualAlloc usage?

Is it possible for stack space allocated by CreateThread to interfere with the usage of VirtualAlloc? I can't find any discussion or documentation explaining precisely where stack space is allowed to be allocated...
The following more precisely illustrates my question:
uint8_t *baseA = (uint8_t*)VirtualAlloc(NULL,1,MEM_RESERVE,PAGE_NOACCESS);
// Create a thread with the default stack size
HANDLE hThread = CreateThread(NULL,0,SomeThreadProc,NULL,NULL,NULL);
// Possibly create even more threads here.
// Can this ever fail in the absence of other allocators? It doesn't here...
uint8_t *baseB = (uint8_t*)VirtualAlloc(NULL,1,MEM_RESERVE,PAGE_NOACCESS);
// Furthermore, in this test, baseB-baseA == 65536 (unless the debugger did something),
// so nothing appeared between baseA and baseB... not even enough space for the
// full 64kb of wastage, as baseA points to 4096 bytes by itself
If it does in fact use some analogue of VirtualAlloc, is there a way to change how Windows allocates stack space in a given process?
Stack space can be allocated anywhere in the address space of the process. There is no documentation on this now and it is unlikely that such documentation will appear in the future.
You can safely assume that creation of the thread and virtual alloc are independent. If this would not be the case, a lot of things would be broken. Allocator cannot give out overlapping address ranges. This is unthinkable. The problem is somewhere else.
The only thing that might look like correlation - amount of memory used and virtual address space fragmentation. In this case the latest request will simply fail.
I worked on a memory analysis utilities.
This picture shows distribution of the numbers of virtual allocations per size of the allocation.
This is example of the address space contents for a 32-bit process (blue - committed, magenta - reserved, green is a free memory).
What I write here is based on a real experience.
the windows NT kernel treats memory alloc operations on a high interrupt priority, also in a thread safe manner.
That means only one thread of a process can allocate memory at the same time, which makes all allocation processes thread safe (in theory).
there shouldn't be any interferences between stack allocation and virtual allocation.
Also you should keep in mine that you can allocate 1GB of space but your program still only uses it's 2mb RAM.
That's because windows "pre allocates" virtual space, but it doesen't assign it until you use it (write on it).
Actually the memory management is alot more complicated but for now you can be shure that no allocation operations should interfere, ever, since windows is locking your process onto one core, delaying all other threads alloc requests, as long as allocation is processed. (deadlock)
*EDIT: That also means that allocation and de-allocation is kinda a performance needing process if you allocate millions of small bits. It's always better to allocate/de-allocate larger memory areas, because of this deadlock behavior.

C++ new / new[], how is it allocating memory?

I would like to now how those instructions are allocating memory.
For example what if I got code:
x = new int[5];
y = new int[5];
If those are allocated how it actually looks like in RAM?
Is whole block reserved for each of the variables or block(memory page or how-you-call-it - 4KB of size on 32bit) is shared for 2 variables?
I couldn't find answer for my question in any manual. Thanks for all replies.
I found on wikipedia:
Internal fragmentation of pages
Rarely do processes require the use of an exact number of pages. As a result, the last page will likely only be partially full, wasting some amount of memory. Larger page sizes clearly increase the potential for wasted memory this way, as more potentially unused portions of memory are loaded into main memory. Smaller page sizes ensure a closer match to the actual amount of memory required in an allocation.
As an example, assume the page size is 1024KB. If a process allocates 1025KB, two pages must be used, resulting in 1023KB of unused space (where one page fully consumes 1024KB and the other only 1KB).
And that was answer for my question. Anyway thanks guys.
A typical allocator implementation will first call the operating system to get huge block of memory, and then to satisfy your request it will give you a piece of that memory, this is known as suballocation. If it runs out of memory, it will get more from the operating system.
The allocator must keep track of both all the big blocks it got from the operating system and also all the small blocks it handed out to its clients. It also must accept blocks back from clients.
A typical suballocation algorithm keeps a list of returned blocks of each size called a freelist and always tries to satisfy a request from the freelist, only going to the main block if the freelist is empty. This particular implementation technique is extremely fast and quite efficient for average programs, though it has woeful fragmentation properties if request sizes are all over the place (which is not usual for most programs).
Modern allocators like GNU's malloc implementation are complex, but have been built with many decades of experience and should be considered so good that it is very rare to need to write your own specialised suballocator.
You didn't find it in the manual because it's not specified by the standard. That is, most of the time x and y will be side by side (go ahead and cout<< hex << their addresses).
But nothing in the standard forces this so you can't rely on it.
Each process has different segments associated which are divided among the process address space :
1) The text segment :: Where your code is placed
2) Stack Segment :: Process stack
3) Data Segment :: This is where memory by "new" is reserved. Besides that it also store initialized and uninitialized static data (bss etc).
So , whenever you call a new function (which i guess uses malloc internally , but the new class makes it much safer to handle memory) it allocates the specified number of bytes in the data segment.
Ofcourse the address you print while running the program is virtual and needs to be translated to physical address..but that is not our headache and the OS Memory management unit does that for us.