Suppose in a program we have implemented a stack. But who creates the stack ? Is it the processor, or operating system, or compiler?
Are you confusing the programs execution stack with the stack container?
You can't "implement" the execution stack, the OS will give you Virtual Address Space and locate there your stack pointer, so you just push and pop from it, you don't "create it", its there when you start.
If you mean the data structure: The processor executes the code. The code makes calls to the operating system to get the memory for the stack, and then manipulates it to form it into a stack. The compiler just turns the code you wrote into code the processor can understand.
If you mean the execution stack: The OS is responsible for loading a process into memory and setting up its memory space to form the stack.
Your program... it performs the required assembly. That assembly was inserted by the compiler in place of the function/function call based on the calling convention being used.
Learning about calling conventions would probably be the most effective way to answer your question.
None of the above. YOU created it when you implemented it. The compiler only translates your thoughts (expressed in a programming language) into machine or assembly code. The processor only runs that program that you wrote. The operating system (assuming one exists), provides mechanisms to facilitate giving you an execution space and memory to do it, but YOUR PROGRAM determines what happens in that execution space and memory.
If you want to implement a stack of your own, try using std::stack<>. If you're talking about the stack that local variables are on, that's created by the C++ runtime system.
"Suppose in a program we have implemented a stack."
Then you implemented it on the underlying, low level data structure like for example array. Your stack = array + functions (push(), pop()) working on an array to provide stack functionality.
"But who creates the stack ? Is it the processor, or operating system, or compiler?"
And who creates functions and array? Functions are created by you, then compiler translates functions to machine instructions and keeps this code in executable. Additionaly it produces a set of instructions to allocate some space in the memory for your array. So your program is a mix of instructions and space for an array. Then operating system loads your program and sends instructions to the processor. Processor performs this instructions and reads/writes data to your array.
Say you have a test C program:
int square( int val ) {
int result;
result = val * val;
return( result );
}
int main( void ) {
int store;
store = square( 3 );
return( 0 );
}
then you can produce the assembler output produced by the compiler using the command gcc -S test.c -o test.s (if you're on a Linux platform).
Looking at the generated code for just the square() function we get:
square:
pushl %ebp
movl %esp, %ebp
subl $16, %esp
movl 8(%ebp), %eax
imull 8(%ebp), %eax
movl %eax, -4(%ebp)
movl -4(%ebp), %eax
leave
ret
You can see that the compiler has generated code to move the stack pointer for the local variables in the routine.
The initialisation code for your program will have allocated a certain amount of memory for "the stack" by calling system (Operating Systems) memory allocation functions. Then it is up to the compiled program to choose how to utilise that area of memory.
Fortunately for you all of this is effectively handled by the compiler without you having to think about it (unless, of course, you're likely to have local variables that are too big for a standard stack size, in which case you may have to instruct your compiler, or thread library, to allocate more stack from the system).
Suppose in a program we have implemented a stack. But who creates the stack ?
Well if you implemented it, then by definition you created it. You need to be more specific w.r.t. context.
The standard runtime library or the linker-loader creates the stack. It is done in a little section of code that runs before your main. This code is inserted automatically by the linker when you link and it runs at runtime, setting up various things before your main is called, for example any statically initialized global variables. It usually sets up the stack too, although some OSes put this into OS code (the linker-loader) because they want to standardize stack implementation/shape on their systems.
the stack is embedded in the processor, it is the esp register, you need to learn a little win32 assembly programming in order to understad the stack
A stack is a Last In First Out (LIFO) list data structure. The stack is created by the program execution where the variables are stored, deleted as per the program execution requirement.
Related
Previously I had seen assembly of many functions in C++. In gcc, all of them start with these instructions:
push rbp
mov rbp, rsp
sub rsp, <X> ; <X> is size of frame
I know that these instructions store the frame pointer of previous function and then sets up a frame for current function. But here, assembly is neither asking for mapping memory (like malloc) and nor it is checking weather the memory pointed by rbp is allocated to the process.
So it assumes that startup code has mapped enough memory for the entire depth of call stack. So exactly how much memory is allocated for call stack? How does startup code can know the maximum depth of call stack?
It also means that, I can access array out of bound for a long distance since although it is not in current frame, it mapped to the process. So I wrote this code:
int main() {
int arr[3] = {};
printf("%d", arr[900]);
}
This is exiting with SIGSEGV when index is 900. But surprisingly not when index is 901. Similarly, it is exiting with SIGSEGV for some random indices and not for some. This behavior was observed when compiled with gcc-x86-64-11.2 in compiler explorer.
How does startup code can know the maximum depth of call stack?
It doesn't.
In most common implementation, the size of the stack is constant.
If the program exceeds the constant sized stack, that is called a stack overflow. This is why you must avoid creating large objects (which are typically, but not necessarily, arrays) in automatic storage, and why you must avoid recursion with linear depth (such as recursive linked list algorithms).
So exactly how much memory is allocated for call stack?
On most desktop/server systems it's configurable, and defaults to one to few megabytes. It can be much less on embedded systems.
This is exiting with SIGSEGV when index is 900. But surprisingly not when index is 901.
In both cases, the behaviour of the program is undefined.
Is it possible to know the allocated stack size?
Yes. You can read the documentation of the target system. If you intend to write a portable program, then you must assume the minimum of all target systems. For desktop/server, 1 megabyte that I mentioned is reasonable.
There is no standard way to acquire the size within C++.
Would a large amount of stack space required by a function prevent it from being inlined? Such as if I had a 10k automatic buffer on the stack, would that make the function less likely to be inlined?
int inlineme(int args) {
char svar[10000];
return stringyfunc(args, svar);
}
I'm more concerned about gcc, but icc and llvm would also be nice to know.
I know this isn't ideal, but I'm very curious. The code is probable also pretty bad on cache too.
Yes, the decision to inline or not depends on the complexity of the function, its stack and registers usage and the context in which the call is made. The rules are compiler- and target platform-dependent. Always check the generated assembly when performance matters.
Compare this version with a 10000-char array not being inlined (GCC 8.2, x64, -O2):
inline int inlineme(int args) {
char svar[10000];
return stringyfunc(args, svar);
}
int test(int x) {
return inlineme(x);
}
Generated assembly:
inlineme(int):
sub rsp, 10008
mov rsi, rsp
call stringyfunc(int, char*)
add rsp, 10008
ret
test(int):
jmp inlineme(int)
with this one with a much smaller 10-char array, which is inlined:
inline int inlineme(int args) {
char svar[10];
return stringyfunc(args, svar);
}
int test(int x) {
return inlineme(x);
}
Generated assembly:
test(int):
sub rsp, 24
lea rsi, [rsp+6]
call stringyfunc(int, char*)
add rsp, 24
ret
Such as if I had a 10k automatic buffer on the stack, would that make the function less likely to be inlined?
Not necessarily in general. In fact, inline expansion can sometimes reduce stack space usage due to not having to set up space for function arguments.
Expanding a "wide" call into a single frame which calls other "wide" functions can be a problem though, and unless the optimiser guards against that separately, it may have to avoid expansion of "wide" functions in general.
In case of recursion: Most likely yes.
An example of LLVM source:
if (IsCallerRecursive &&
AllocatedSize > InlineConstants::TotalAllocaSizeRecursiveCaller) {
InlineResult IR = "recursive and allocates too much stack space";
From GCC source:
For stack growth limits we always base the growth in stack usage
of the callers. We want to prevent applications from segfaulting
on stack overflow when functions with huge stack frames gets
inlined.
Controlling the limit, from GCC manual:
--param name=value
large-function-growth
Specifies maximal growth of large function caused by inlining in percents. For example, parameter value 100 limits large function growth to 2.0 times the original size.
large-stack-frame
The limit specifying large stack frames. While inlining the algorithm is trying to not grow past this limit too much.
large-stack-frame-growth
Specifies maximal growth of large stack frames caused by inlining in percents. For example, parameter value 1000 limits large stack frame growth to 11 times the original size.
Yes, partly because compilers do stack allocation for the whole function once in prologue/epilogue, not moving the stack pointer around as they enter/leave block scopes.
and each inlined call to inlineme() would need its own buffer.
No, I'm pretty sure compilers are smart enough to reuse the same stack space for different instances of the same function, because only one instance of that C variable can ever be in-scope at once.
Optimization after inlining can merge some of the operations of the inline function into calling code, but I think it would be rare for the compiler to end up with 2 versions of the array it wanted to keep around simultaneously.
I don't see why that would be a concern for inlineing. Can you give an example of how functions that require a lot of stack would be problematic to inline?
A real example of a problem it could create (which compiler heuristics mostly avoid):
Inlining if (rare_special_case) use_much_stack() into a recursive function that otherwise doesn't use much stack would be an obvious problem for performance (more cache and TLB misses), and even correctness if you recurse deep enough to actually overflow the stack.
(Especially in a constrained environment like Linux kernel stacks, typically 8kiB or 16kiB per thread, up from 4k on 32-bit platforms in older Linux versions. https://elinux.org/Kernel_Small_Stacks has some info and historical quotes about trying to get away with 4k stacks so the kernel didn't have to find 2 contiguous physical pages per task).
Compilers normally make functions allocate all the stack space they'll ever need up front (except for VLAs and alloca). Inlining an error-handling or special-case handling function instead of calling it in the rare case where it's needed will put a large stack allocation (and often save/restore of more call-preserved registers) in the main prologue/epilogue, where it affects the fast path, too. Especially if the fast path didn't make any other function calls.
If you don't inline the handler, that stack space will never be used if there aren't errors (or the special case didn't happen). So the fast-path can be faster, with fewer push/pop instructions and not allocating any big buffers before going on to call another function. (Even if the function itself isn't actually recursive, having this happen in multiple functions in a deep call tree could waste a lot of stack.)
I've read that the Linux kernel does manually do this optimization in a few key places where gcc's inlining heuristics make an unwanted decision to inline: break a function up into fast-path with a call to the slow path, and use __attribute__((noinline)) on the bigger slow-path function to make sure it doesn't inline.
In some cases not doing a separate allocation inside a conditional block is a missed optimization, but more stack-pointer manipulation makes stack unwinding metadata to support exceptions (and backtraces) more bloated (especially saving/restoring of call-preserved registers that stack unwinding for exceptions has to restore).
If you were doing a save and/or allocate inside a conditional block before running some common code that's reached either way (with another branch to decide which registers to restore in the epilogue), then there'd be no way for the exception handler machinery to know whether to load just R12, or R13 as well (for example) from where this function saved them, without some kind of insanely complicated metadata format that could signal a register or memory location to be tested for some condition. The .eh_frame section in ELF executables / libraries is bloated enough as is! (It's non-optional, BTW. The x86-64 System V ABI (for example) requires it even in code that doesn't support exceptions, or in C. In some ways that's good, because it means backtraces usually work, even passing an exception back up through a function would cause breakage.)
You can definitely adjust the stack pointer inside a conditional block, though. Code compiled for 32-bit x86 (with crappy stack-args calling conventions) can and does use push even inside conditional branches. So as long as you clean up the stack before leaving the block that allocated space, it's doable. That's not saving/restoring registers, just moving the stack pointer. (In functions built without a frame pointer, the unwind metadata has to record all such changes, because the stack pointer is the only reference for finding saved registers and the return address.)
I'm not sure exactly what the details are on why compiler can't / don't want to be smarter allocating large extra stack space only inside a block that uses it. Probably a good part of the problem is that their internals just aren't set up to be able to even look for this kind of optimization.
Related: Raymond Chen posted a blog about the PowerPC calling convention, and how there are specific requirements on function prologues / epilogues that make stack unwinding work. (And the rules imply / require the existence of a red zone below the stack pointer that's safe from async clobber. A few other calling conventions use red zones, like x86-64 System V, but Windows x64 doesn't. Raymond posted another blog about red zones)
This is merely out of interest and I personally use C++Builder 2009
Suppose I allocate: wchar_t Buffer[32] or I allocate wchar_t Buffer[512]
The second call allocates more memory, so you could argue that the second call is more expensive in terms of memory usage.
However, is anything else also possibly affected by allocating more memory this way ? Are there more instructions involved ? More CPU usage ?
Just wondering ?
However, is anything else also possibly affected by allocating more memory this way ?
There can be one related side effect: when you allocate more memory for your buffer, you increase the chance that the stack pages the program needs to access will be split across more cache lines, which may ultimately mean the CPU has to wait for a cache miss that otherwise wouldn't have happened. Note that there's no particular reason here to think you're using more of the buffer: the "problem" is that the CPU's likely to be asked to get data before and after the buffer, and all of that may be split across more cache lines. These stack pages around the buffer are likely to be accessed often enough to keep them in cache, but in doing so some cache content that hasn't been used for a while may be ejected, and if that's later needed then you have a cache miss. The granularity of the cache lines (how many bytes per "page") can also affect how this pans out.
This is usually totally insignificant, but you asked... ;-).
Are there more instructions involved ?
No more instructions are involved.
More CPU usage ?
In as much as time waiting for a cache is "usage".
This is ''allocating'' stack memory. All this requires is adjusting the stack pointer. If you write a function like:
void foo()
{
char c[32];
...
}
The resulting assembly looks like (on a 64-bit machine):
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $48, %rsp // This is the actual "allocation" on the stack
If you change this to char c[512], the only thing that changes is:
subq $528, %rsp // Allocation of 512 bytes on stack
There is no difference in CPU instructions or the time this takes. The only difference is the second uses up more of the limited amount of stack memory.
There won't be any difference in the instructions by allocating more size.
Also the stack memory is known at compile time and compiler generates the required instructions.
For Example ::
int main()
{
char Buffer[1024] ;
char Buffer2[ 512] ;
return 0 ;
}
00981530 push ebp
00981531 mov ebp,esp
00981533 sub esp,6DCh //6dch = 1756 just the esp is adjusted to allocate more memory
00981539 push ebx
int main()
{
char Buffer[32] ;
char Buffer2[ 512] ;
return 0 ;
}
00D51530 push ebp
00D51531 mov ebp,esp
00D51533 sub esp,2FCh //2fch = 764 now the esp is adjusted to 764 without any instruction change.
00D51539 push ebx
Are there more instructions involved ?
No, you can see the above example :)
More CPU usage ?
No, because same number of instructions got executed
More Memory usage ?
Yes, because more stack memory allocated.
In operating systems a memory management implementation called dynamic loading only loads routines as it is called instead of loading all routines in your program into the main memory. When the routine is loaded all its elements' addresses have to be loaded into the page table for address translations. The content corresponding to a particular address is loaded in unit called page.
Page sizes are of usually in smaller orders of 2kb or 4kb. If the content exceeds the page size then it is split up and occupies more than one page. When page fault occurs a new content is loaded overwriting the older content based on page replacement policies to a swap space. When the replaced content is again needed MMU will again load the content from the swap space to the page. This
Let's think what could happen if larger content is involved for loading and swapping, it's a performance issue and involves some cpu cycles.
As per your questions it's not an impact with arrays of wchar_t of size 32 or 512. But a different data structure having a size in Megabytes and an array of this structure in few thousands will make some impact on memory and CPU. I suggest you to have look here.
I think if you are calling the first one to the same extent, meaning the same amount of bits, then it would be better to use wchar_t Buffer[512] simply because it would take longer and, I believe, use more resources to start a call exit and then start another, and keep going. But with the second you have one start and then it is tied to that task, which is fine as long as you don't want to do anything else for a while. Hope that helped.
I know the maximum stack size usually is fixed on link (maybe on windows is that).
But I don't know when the program stack size ( not maximum stack size just used size) used is be fixed to OS. compile ? linked ? execute ?
like this:
int main(){ int a[10]; return 0;}
the program just use 10 * sizeof(int) stack. so, is the stack size fixed?
above all. if the heap size is changed when malloc or free?
Stack size is not explicitly provided to OS, when program is loaded. Instead, OS uses mechanism of page faults (if it is supported by MMU).
If you try to access memory which was not granted by operating system yet, MMU generates a page fault which is handled by OS. OS checks address of page fault and either expands stack by creating new memory page or if you have exhausted stack limits, handles it as stack overflow.
Consider following program running on x86 and Linux:
void foo(void) {
volatile int a = 10;
foo();
}
int main() {
foo();
}
It faults because of infinite recursion and stack overflow. It actually requires infinite stack to be completed. When program is loaded, OS allocates initial stack and writes it to %rsp (stack pointer). Let's look at foo() disassembly:
push %rbp
mov %rsp,%rbp <--- Save stackpointer to %rbp
sub $0x10,%rsp <--- Advance stack pointer by 16 bytes
movl $0xa,-0x4(%rbp) <--- Write memory at %rbp
callq 0x400500 <foo>
leaveq
retq
After at most 4096 / 16 = 256 calls of foo(), you will break page boundary by writing a memory at address X + 4096 where X is initial %rsp value. Then page fault will be generated, and OS provide new memory page for stack, allowing program to utilize it.
After about 500k of foo() calls (default Linux ulimit for stack), OS will detect that application utilizes too many stack pages and send SIGSEGV to it.
In an answer to a question I provided the following information:
The BSS/DATA segment contains all the global variables, initialized to a specific value or to zero by default. This segment is part of the executable image. At load time, the heap segment is added to this; however, it is not a "segment" but just the amount of extra data to be allocated as an extension of the loaded BSS/DATA segment. In the same way the stack "segment" is not a true segment but is added to the BSS+heap segment. The stack grows down whereas the heap grows up. If these overlap (more heap used and stack still growing) an "out of memory" error occurrs (heap) or "stack overflow" (stack) - this may be detected with the use of segment registers (Intel) to trigger a hardware generated exception or by using software checks.
This is the traditional way of laying out the segments. Think of older Intel chips where all progeram data must be in 64KB. With more modern chips the same layout is often used where address space of 32MB is used in this layout but only actual physical memory required is used. The stack can thus be pretty big.
This has been bothering me for a long time now: Lets say i have a function:
void test(){
int t1, t2, t3;
int t4 = 0;
int bigvar[10000];
// do something
}
How does the computer handle the memory allocations for the variables?
I've always thought that the variables space is saved in the .exe which the computer will then read, is this correct? But as far as i know, the bigvar array doesnt take 10000 int elements space in the .exe, since its uninitialized. So how does its memory allocation work when i call the function ?
Local variables like those are generally implemented using the processor's stack. That means that the only thing that the compiler needs to do is to compute the size of each variable, and add them together. The total sum is the amount to change the stack pointer with at the entry to the function, and to change back on exit. Each variable is then accessed with its relative offset into that block of memory on the stack.
Your code, when compiled in Linux, ends up looking like this in x86 assembler:
test:
pushl %ebp
movl %esp, %ebp
subl $40016, %esp
movl $0, -4(%ebp)
leave
ret
In the above, the constant $40016 is the space needed for the four 32-bit ints t1, t2, t3 and t4, while the remaining 40000 bytes account for the 10000-element array bigvar.
I can't add much to what have already been said, except for a few notes. You can actually put local variables into executable file and have them allocated in the data segment (and initialized) instead of the stack segment. To do that, declare them as static. But then all the invocations of the function would share the same variables while in stack each invocation creates a new set of variables. This can lead to a lot of troubles when the function is called simultaneously by several thread or when there is a recursion (try to imagine that). That's why most languages use stack for local variables and static is rarely used.
On some old compilers, I've met the behavior that the array is statically allocated. Meaning that it sets aside memory for it when it loads the program, and uses that space after that. This behavior is not safe (See Sergey's answer), nor do I expect it to be permitted according to the standards, but I have encountered it in the wild. (I have no memory of what compiler it was.)
For the most part, local variables are kept on the stack, together with return addresses and all that other stuff. This means the uninitialized values may contain sensitive information. This also includes arrays, as per unwind's answer.
Another valid implementation is that the variable found on the stack is a pointer, and that the compiler does the allocation and deallocation (presumably in an exception-safe manner) under the hood. This will conserve stack space (which has to be allocated before the program starts, and cannot easily be extended for x86 architectures) and is also quite useful for C standard VLA (variable length array, aka poor mans std::vector)