How do segmented stacks work? This question also applies to Boost.Coroutine so I am using the C++ tag as well here. The main doubt comes from this article It looks like what they do is keep some space at the bottom of the stack and check if it has gotten corrupted by registering some sort of signal handler with the memory allocated there (perhaps via mmap and mprotect?) And then when they detect that they have run out of space they continue by allocating more memory and then continuing from there. 3 questions about this
Isn't this construct a user space thing? How do they control where the new stack is allocated and how do the instructions the program is compiled down to get aware of that?
A push instruction is basically just adding a value to the stack pointer and then storing the value in a register on the stack, then how can the push instruction be aware of where the new stack starts and correspondingly how can the pop know when it has to move the stack pointer back to the old stack?
They also say
After we've got a new stack segment, we restart the goroutine by retrying the function that caused us to run out of stack
what does this mean? Do they restart the entire goroutine? Won't this possibly cause non deterministic behavior?
How do they detect that the program has overrun the stack? If they keep a canary-ish memory area at the bottom then what happens when the user program creates an array big enough that overflows that? Will that not cause a stack overflow and is a potential security vulnerability?
If the implementations are different for Go and Boost I would be happy to know how either of them deal with this situation 🙂
I'll give you a quick sketch of one possible implementation.
First, assume most stack frames are smaller than some size. For ones that are larger, we can use a longer instruction sequence at entry to make sure there is enough stack space. Let's assume we're on an architecture that that has 4k pages and we're choosing 4k - 1 as the maximum size stack frame handled by the fast path.
The stack is allocated with a single guard page at the bottom. That is, a page that is not mapped for write. At function entry, the stack pointer is decremented by the stack frame size, which is less than the size of a page, and then the program arranges to write a value at the lowest address in the newly allocated stack frame. If the end of the stack has been reached, this write will cause a processor exception and ultimately be turned into some sort of upcall from the OS to the user program -- e.g. a signal in UNIX family OSes.
The signal handler (or equivalent) has to be able to determine this is a stack extension fault from the address of the instruction that faulted and the address it was writing to. This is determinable as the instruction is in the prolog of a function and the address being written to is in the guard page of the stack for the current thread. The instruction being in the prolog can be recognized by requiring a very specific pattern of instructions at the start of functions, or possibly by maintaining metadata about functions. (Possibly using traceback tables.)
At this point the handler can allocate a new stack block, set the stack pointer to the top of the block, do something to handle unchaining the stack block, and then call the function that faulted again. This second call is safe because the fault is in the function prolog the compiler generated and no side effects are allowed before validating there is enough stack space. (The code may also need to fixup the return address for architectures that push it onto the stack automatically. If the return address is in a register, it just needs to be in the same register when the second call is made.)
Likely the easiest way to handle unchaining is to push a small stack frame onto the new extension block for a routine that when returned to unchains the new stack block and frees the allocated memory. It then returns the processor registers to the state they were in when the call was made that caused the stack to need to be extended.
The advantage of this design is that the function entry sequence is very few instructions and is very fast in the non-extending case. The disadvantage is that in the case where the stack does need to be extended, the processor incurs an exception, which may cost much much more than a function call.
Go doesn't actually use a guard page if I understand correctly. Rather the function prolog explicitly checks the stack limit and if the new stack frame won't fit it calls a function to extend the stack.
Go 1.3 changed its design to not use a linked list of stack blocks. This is to avoid the trap cost if the extension boundary is crossed in both directions many times in a certain calling pattern. They start with a small stack, and use a similar mechanism to detect the need for extension. But when a stack extension fault does occur, the entire stack is moved to a larger block. This removes the need for unchaining entirely.
There are quite a few details glossed over here. (E.g. one may not be able to do the stack extension in the signal handler itself. Rather the handler can arrange to have the thread suspended and hand it off to a manager thread. One likely has to use a dedicated signal stack to handle the signal as well.)
Another common pattern with this sort of thing is the runtime requiring there to be a certain amount of valid stack space below the current stack frame for either something like a signal handler or for calling special routines in the runtime. Go works this way and the stack limit test guarantees a certain fixed amount of stack space is available below the current frame. One can e.g. call plain C functions on the stack so long as one guarantees they do not consume more than the fixed stack reserve amount. (One can use this to call C library routines in theory, though most of these have no formal specification of how much stack they might use.)
Dynamic allocation in the stack frame, such as alloca or stack allocated variable length arrays, add some complexity to the implementation. If the routine can compute the entire final size of the frame in the prolog then it is fairly straightforward. Any increase in the frame size while the routine is running likely has to be modeled as a new call, though with Go's new architecture that allows moving the stack, it is possible the alloca point in the routine can be made such that all the state allows a stack move to happen there.
Related
I am slightly confused on how a compiler stores variable on the stack. I read that c++ can store local variables on the stack, but if the stack is LIFO, how could it make sure to call the right variable when the variable is called in the program?
The run-time stack isn't simply a LIFO structure; it's a complex structure which supports random access.
Typically how it works on many platforms is that whe na function is entered, a new stack frame is pushed onto the stack (LIFO fashion). Local variables are stored within the frame and are accessed relative to a more or less stable pointer stored in a register.
When other functions are called, they push their own frames, but they restore everything before returning.
Run-time stacks also typically support the ad hoc pushing and popping of individual values, for the purpose of temporarily saving register values or for parameter passing.
A common implementation strategy for that is to allocate the frame first. For instance if a 512 byte stack frame is needed, the stack pointer is moved by 512 bytes. The stack pointer is then used freely for pushing and popping, provided it doesn't pop too far and start gobbling into the frame.
A separate frame pointer may be used which tracks the location of the frame. The frame is then accessed relative to that frame pointer, which allows the stack pointer to move without interfering with those accesses.
Compilers can generate code without the use of frame pointers also; if the stack pointer only moves in ways that a compiler knows about, it can adjust all the variable references. Over an area of the code where the compiler knows that the stack pointer has moved by four bytes due to something being pushed on it, it can adjust the stack-pointer-relative frame references by four bytes.
The basic principle is that when a given function is executing, then the stack is in the right state that is expected by that function (unless something has gone horribly wrong due to a bug causing corruption or something). The LIFO allocation strategy of the stack frames closely tracks the function calls and returns. Functions that are called must save and restore certain registers (the "callee-saved registers"), which helps to maintain the stable stack environment.
I am reading mlockall()'s manpage: http://man7.org/linux/man-pages/man2/mlock.2.html
It mentions
Real-time processes that are using mlockall() to prevent delays on page
faults should reserve enough locked stack pages before entering the time-
critical section, so that no page fault can be caused by function calls. This
can be achieved by calling a function that allocates a sufficiently large
automatic variable (an array) and writes to the memory occupied by this array in
order to touch these stack pages. This way, enough pages will be mapped for the
stack and can be locked into RAM. The dummy writes ensure that not even copy-
on-write page faults can occur in the critical section.
I am a bit confused by this statement:
This can be achieved by calling a function that allocates a sufficiently large
automatic variable (an array) and writes to the memory occupied by this array in
order to touch these stack pages.
All the automatic variables (variables on stack) are created "on the fly" on the stack when the function is called. So how can I achieve what the last statement says?
For example, let's say I have this function:
void foo() {
char a;
uint16_t b;
std::deque<int64_t> c;
// do something with those variables
}
Or does it mean before I call any function, I should call a function like this in main():
void reserveStackPages() {
int64_t stackPage[4096/8 * 1024 * 1024];
memset(stackPage, 0, sizeof(stackPage));
}
If yes, does it make a difference if I first allocate the stackPage variable on heap, write and then free? Probably yes, because heap and stack are 2 different region in the RAM?
std::deque exists above is just to bring up another related question -- what if I want to reserve memory for things using both stack pages and heap pages. Will calling "heap" version of reserveStackPages() help?
The goal is to minimize all the jitters in the application (yes, I know there are many other things to look at such as TLB miss, etc; just trying to deal with one kind of jitter at once, and slowly progressing into all).
Thanks in advance.
P.S. This is for a low latency trading application if it matters.
You generally don't need to use mlockall, unless you code (more or less hard) real-time applications (I actually never used it).
If you do need it, you'll better code in C (not in genuine C++) the most real-time parts of your code, because you surely want to understand the details of memory allocation. Notice that unless you dive into std::deque implementation, you don't exactly know where it is sitting (probably most of the data is heap allocated, even if your c is an automatic variable).
You should first understand in details the virtual address space of your process. For that, proc(5) is useful: from inside your process, you'll read /proc/self/maps (see this), from outside (e.g. some terminal) you'll do cat /proc/1234/maps for a process of pid 1234. Or use pmap(1).
because heap and stack are 2 different regions in the RAM?
In fact, your process' address space contains many segments (listed in /proc/1234/maps), much more that two. Typically every dynamically linked shared library (such as libc.so) brings a few segments.
Try cat /proc/self/maps and cat /proc/$$/maps in your terminal to get a better intuition about virtual address spaces. On my machine, the first gives 19 segments of the cat process -each displayed as a line- and the second 97 segments of the zsh (my shell) process.
To ensure that your stack has enough space, you indeed could call a function allocating a large enough automatic variable, like your reserveStackPages. Beware that call stacks are practically of limited size (a few megabytes usually, see also setrlimit(2)).
If you really need mlockall (which is unlikely) you might consider linking statically your program (to have less segments in your virtual address space).
Look also into madvise(2) (and perhaps mincore(2)). It is generally much more useful than mlockall. BTW, in practice, most of your virtual memory is in RAM (unless your system experiments thrashing, and then you'll see it immediately).
Read also Operating Systems: Three Easy Pieces to understand the role of paging.
PS. Nano-second sensitive applications does not make much sense (because of cache misses that the software does not control).
I'm using C++ and Windows.h in my source code. I read the CreateThread API in MSDN, but I still don't understand the essence of specifying stack size. By default it is 1 MB. But what will happen if I specify 32 bytes?
What does stack size in a thread define?
Please provide a thorough explanation and I'll appreciate it. Thanks.
The stack is used to store local variables, pass parameters in function calls, store return addresses. A thread's stack has a fixed size which is determined when the thread is created. That is the value that you are referring too.
The stack size is determined when the thread is created since it needs to occupy contiguous address space. That means that the entire address space for the thread's stack has to be reserved at the point of creating the thread.
If the stack is too small then it can overflow. That's an error condition known as stack overflow, from which this website took its name. When you call a function some or all of the following happens:
Parameters are pushed onto the stack.
The return address is pushed onto the stack.
A stack frame containing space for the function's local variables is created.
All of this consumes space from the stack. When the function in turn calls another function, more stack space is consumed. As the call stack goes deeper, more stack space is required.
The consequence therefore of setting the stack size too low is that you can exhaust the stack and overflow it. That is a terminal condition from which you cannot recover. Certainly 32 bytes (rounded up to one page which is 4096 bytes) is too small for almost all threads.
If you have a program with a lot of threads, and you know that the thread's don't need to reserve 1MB of stack size then there can be benefits to using a smaller stack size. Doing so can avoid exhausting the available process address space.
On the other hand you might have a program with a single thread that has deep call stacks that consume large amounts of stack space. In this scenario you might reserve more than the default 1MB.
However, unless you have strong reason to do otherwise, it is likely best to stick to the default stack size.
Stack size is just tradeoff between ability to create many threads and possibility of stack overflow in one of them.
The more stack size is, the less number of threads you can create and the less possibility of stack overflow is. You should worry about stack size only if you are going to create many threads (you will have to lower stack size but remember about stack overflow). Otherwise default value is suffice.
But what will happen if I specify 32 bytes?
I have not read the Windows documentation, but if Windows allows this (to specify only 32 bytes), you will most likely get a stack overflow. According to their documentation the value is rounded up to the page size in anycase, therefore in reality you stack size will be at least the size of a page. The created thread assumes that there exists enough "stack space" for it to use (for allocating automatic variables, storing function addresses etc), and allocates space according to it's needs. When there is not enough stack space, the stack allocator might use invalid memory, overriding memory used elsewhere.
What does stack size in a thread define?
It defines how much memory will be allocated for use by that thread's stack.
There is a good description of what exactly a thread call stack is here
I recently used the /FAsu Visual C++ compiler option to output the source + assembly of a particularly long member function definition. In the assembly output, after the stack frame is set up, there is a single call to a mysterious _chkstk() function.
The MSDN page on _chkstk() does not explain the reason why this function is called. I have also seen the Stack Overflow question Allocating a buffer of more a page size on stack will corrupt memory?, but I do not understand what the OP and the accepted answer are talking about.
What is the purpose of the _chkstk() CRT function? What does it do?
Windows pages in extra stack for your thread as it is used. At the end of the stack, there is one guard page mapped as inaccessible memory -- if the program accesses it (because it is trying to use more stack than is currently mapped), there's an access violation. The OS catches the fault, maps in another page of stack at the same address as the old guard page, creates a new guard page just beyond the old one, and resumes from the instruction that caused the violation.
If a function has more than one page of local variables, then the first address it accesses might be more than one page beyond the current end of the stack. Hence it would miss the guard page and trigger an access violation that the OS doesn't realise is because more stack is needed. If the total stack required is particularly huge, it could perhaps even reach beyond the guard page, beyond the end of the virtual address space assigned to stack, and into memory that's actually in use for something else.
So, _chkstk ensures that there is enough space for the local variables. You can imagine that it does this by touching the memory for the local variables at page-sized intervals, in increasing order, to ensure that it doesn't miss the guard page (so-called "stack probes"). I don't know whether it actually does that, though, possibly it takes a more direct route and instructs the OS to map in a certain amount of stack. Either way, if the total required is greater than the virtual address space available for stack, then the OS can complain about it instead of doing something undefined.
I looked at the code for __chkstk and it does do the repeated stack probes at one-page intervals. So this way, it doesn't need to make any calls to the OS. The parameter in rax is size of data you want to add. It ensures that the target address (current rsp - rax) is accessible. If rax > rsp, it does this for address 0. As an interesting shortcut, it first compares the address with gs:[10h], which is the current lowest page that is mapped; if the target address >= this, then it does nothing.
By the way, for 64-bit code at least, it is spelled with two underscores: __chkstk__.
Why infinite recursion leads to seg fault ?
Why stack overflow leads to seg fault.
I am looking for detailed explanation.
int f()
{
f();
}
int main()
{
f();
}
Every time you call f(), you increase the size of the stack - that's where the return address is stored so the program knows where to go to when f() completes. As you never exit f(), the stack is going to increase by at least one return address each call. Once the stack segment is full up, you get a segfault error. You'll get similar results in every OS.
Segmentation fault is a condition when your program tries to access a memory location that it is not allowed to access. Infinite recursion causes your stack to grow. And grow. And grow. Eventually it will grow to a point when it will spill into an area of memory that your program is forbidden to access by the operating system. That's when you get the segmentation fault.
Your system resources are finite. They are limited. Even if your system has the most memory and storage on the entire Earth, infinite is WAY BIGGER than what you have. Remember that now.
The only way to do something an "infinite number of times" is to "forget" previous information. That is, you have to "forget" what has been done before. Otherwise you have to remember what happened before and that takes storage of one form or another (cache, memory, disk space, writing things down on paper, ...)--this is inescapable. If you are storing things, you have a finite amount of space available. Recall, that infinite is WAY BIGGER than what you have. If you try to store an infinite amount of information, you WILL run out of storage space.
When you employ recursion, you are implicitly storing previous information with each recursive call. Thus, at some point you will exhaust your storage if you try to do this an infinite number of takes. Your storage space in this case is the stack. The stack is a piece of finite memory. When you use it all up and try to access beyond what you have, the system will generate an exception which may ultimately result in a seg fault if the memory it tried to access was write-protected. If it was not write-protected, it will keep on going, overwriting god-knows-what until such time as it either tries to write to memory that just does not exist, or it tries to write to some other piece of write protected memory, or until it corrupts your code (in memory).
It's still a stackoverflow ;-)
The thing is that the C runtime doesn't provide "instrumentalisation" like other managed languages do (e.g. Java, Python, etc.), so writing outside the space designated for the stack instead of causing a detailed exception just raises a lower level error, that has the generic name of "segmentation fault".
This is for performance reasons, as those memory access watchdogs can be set with help of hardware support with little or none overhead; I cannot remember the exact details now, but it's usually done by marking the MMU page tables or with the mostly obsolete segment offsets registers.
AFAIK: The ends of the stack are protected by addresses that aren't accessible to the process. This prevents the stack from growing over allocated data-structures, and is more efficient than checking the stack size explicitly, since you have to check the memory protection anyway.
A program copunter or instruction pointer is a register which contains the value of next instruction to be executed.
In a function call, the current value of program counter pushed into the stack and then program counter points to first instruction of the function. The old value is poped after returning from that function and assigned to program counter. In infinite recursion the value is pushed again and again and leads to the stack overflow.
It's essentially the same principle as a buffer overflow; the OS allocates a fixed amount of memory for the stack, and when you run out (stack overflow) you get undefined behavior, which in this context means a SIGSEGV.
The basic idea:
int stack[A_LOT];
int rsp=0;
void call(Func_p fn)
{
stack[rsp++] = rip;
rip = fn;
}
void retn()
{
rip = stack[--rsp];
}
/*recurse*/
for(;;){call(somefunc);}
eventually rsp moves past the end of the stack and you try to put the next return address in unallocated storage and your program barfs. Obviously real systems are a lot more complicated than that, but that could (and has) take up several large books.
At a "low" level, the stack is "maintained" through a pointer (the stack pointer), kept in a processor register. This register points to memory, since stack is memory after all. When you push values on the stack, its "value" is decremented (stack pointer moves from higher addresses to lower addresses). Each time you enter a function some space is "taken" from the stack (local variables); moreover, on many architectures the call to a subroutine pushes the return value on the stack (and if the processor has no a special register stack pointer, likely a "normal" register is used for the purpose, since stack is useful even where subroutines can be called with other mechanisms), so that the stack is at least diminuished by the size of a pointer (say, 4 or 8 bytes).
In an infinite recursion loop, in the best case only the return value causes the stack to be decremented... until it points to a memory that can't be accessed by the program. And you see the segmentation fault problem.
You may find interesting this page.