I’d like to understand the call-stack a bit better, and in an effort to do so, I’d like to work with it some more.
What will I do with this information? I have no idea! It’s fun to learn new things though, and this is my new thing for today.
I don’t like the call-stack in my programs being this mystical entity that I know nothing about. How big is it? How much memory is available on my current call-stack? Where is it? I don't know! And, I would like to know.
My current way to deal with the call-stack is to be aware of it, but to not knowingly interact with it until I hit a stack overflow error. That's just not good enough!
So in my quest, I’d like to figure out how to do the following:
I'd like to know the total size of the current call-stack that I'm operating on.
I'd like to calculate the total available memory available on my current call-stack.
I’d like to be able to find the address of where my call-stack starts.
The declaration could look like this, and be filled in for various platforms:
class call_stack
{
inline static void* base_address();
inline static void* end_address();
inline static std::size_t size();
inline static std::size_t remaining();
};
How is stack information defined on current desktop and mobile platforms, and how can I access it? How can I code this at compile time, or determine it at runtime?
Looking into this reminds me of a talk that Herb Sutter gave about atomics. He said something along the lines of “as C++ programmers, we like to push the big red buttons that say don’t touch.”
I realize there are many questions about the call-stack, and I believe that mine is both meaningful and unique.
Other questions on the call-stack topic don't ask about the call-stack in a similarly broad manner to what I'm asking here.
The Stack
In modern platforms, the stack structure is simplified to using a pointer to where the next item should be inserted. And most platforms use a register to do this.
+---+
| |
+---+
| |
+---+
| |
+---+
| | <-- Next available stack slot
+---+
An item is written at the present location of the stack pointer, then the stack pointer is incremented.
The CALL Stack
The minimum that needs to be pushed onto the stack is the return address or address of next instruction after the call. This is about as common as you get. Anything else that is pushed onto the stack depends on the protocol set up by the compiler or operating system.
For example, the compiler my choose to place parameters into registers rather than pushing them onto the stack.
The order of the parameters depends ultimately on the compiler (and language). The compiler may push the leftmost value first or the last parameter first.
Stack Starting Address
The Stack Starting Address is usually determined by the operating systems, or on embedded systems, at a fixed location. There is no guarantee that the operating system will place your program's stack in the same location for every invocation.
Stack Size
There are two terms here: capacity and content. Stack size can refer to the number of elements in the stack (content) or the capacity that the stack can hold.
There is no fixed, common limit here. Platforms differ. Usually the OS is involved in allocating the capacity of the stack. And on many platforms, the OS doesn't check to see if you have exceeded the capacity.
Some Operating Systems provide mechanisms so that the executable can adjust the capacity of the stack. Most OS providers usually provide an average amount.
Sharing Memory with Heap
A common setup is to have the stack grow towards the heap. So all the extra memory for a process is allocated so that, for example, the stack grows from the beginning of the extra memory down and the heap allocates from the bottom up. This allows for programs that use little dynamic memory to have more stack space and those that use little stack space to have more dynamic memory. The big issue is when they cross. No notification, just undefined behavior.
Accessing Stack Information
Most of the time, I never look at the values of the stack pointer or register. This is usually on performance critical systems that have restricted memory capacities (like embedded systems).
The stack is usually reviewed by the debugger to provide a call trace.
None of what you ask can be done in standard c++ since the stack is not specified in the standard.
You may be able to access that information by reading cpu registers in assembly language. How to do that depends on the cpu architecture, the OS and possibly the calling convention used by the compiler. Best place to find the information you're looking for is the manual for the architecture, OS, etc. The platfrom may also provide the information through system calls or virtual filesystems.
As an example, here's a quick look at the wikipedia page for the common x86 architecture
SP/ESP/RSP: Stack pointer for top address of the stack.
BP/EBP/RBP: Stack base pointer for holding the address of the current stack frame.
You can unwind the stack and find find the top of the stack in the first call frame. How to unwind the stack is again, specific to calling convention used. Subtracting the base of the first stack frame with the current stack pointer would give you the current size of the stack. Also remember that each thread has their own call stack so you must subtract from the bottom of the stack of the correct thread.
But keep in mind that:
Although the main registers (with the exception of the instruction pointer) are "general-purpose" in the 32-bit and 64-bit versions of the instruction set and can be used for anything...
You should check the manual of your target platform before assuming what the registers are used for.
Getting the remaining / total maximum stack space can be a bit trickier. In Linux the stack size is typically limited during runtime. The limit can be accessed through the /proc/ filesystem or using system calls. In Windows, the maximum stack size can be set at linktime and should be accessible in the executable file headers.
Below is an example program that works on Linux. I read the start of stack from /proc/<pid>/stat. I also provide an example for unwinding and for that I use a library that abstracts away all the OS / architecture specific assembly code. The stack is unwound all the way up to the initialization code before main and the stack space used by that is accounted for.
I use SP register instead of BP to get the bottom of the stack in the fisrt call frame because BP does not exist in some architectures and on my platform it was zero in the initialization frames. That means that the bottom is off by the size of the first call frame and is therefore just an approximation. See it on coliru, unfortunately access to /proc/<pid>/stat is denied there.
#include <iostream>
using namespace std;
#include <fstream>
#include <sstream>
#include <unistd.h>
// read bottom of stack from /proc/<pid>/stat
unsigned long bottom_of_stack() {
unsigned long bottom = 0;
ostringstream path;
path << "/proc/" << getpid() << "/stat";
ifstream stat(path.str());
// possibly not the best way to parse /proc/pid/stat
string ignore;
if(stat.is_open()) {
// startstack is the 28th field
for(int i = 1; i < 28; i++)
getline(stat, ignore, ' ');
stat >> bottom;
}
return bottom;
}
#include <sys/resource.h>
rlim_t get_max_stack_size() {
rlimit limits;
getrlimit(RLIMIT_STACK, &limits);
return limits.rlim_cur;
}
#define UNW_LOCAL_ONLY
#include <libunwind.h>
// using global variables for conciseness
unw_cursor_t frame_cursor;
unw_context_t unwind_context;
// approximate bottom of stack using SP register and unwinding
unw_word_t appr_bottom_of_stack() {
unw_word_t bottom;
unw_getcontext(&unwind_context);
unw_init_local(&frame_cursor, &unwind_context);
do {
unw_get_reg(&frame_cursor, UNW_REG_SP, &bottom);
} while(unw_step(&frame_cursor) > 0);
return bottom;
}
// must not inline since that would change behaviour
unw_word_t __attribute__((noinline)) current_sp() {
unw_word_t sp;
unw_getcontext(&unwind_context);
unw_init_local(&frame_cursor, &unwind_context);
unw_step(&frame_cursor); // step to frame before this function
unw_get_reg(&frame_cursor, UNW_REG_SP, &sp);
return sp;
}
// a little helper for absolute delta of unsigned integers
#include <algorithm>
template<class UI>
UI abs_udelta(UI left, UI right) {
return max(left,right) - min(left,right);
}
unsigned long global_bottom;
rlim_t global_max;
// a test function to grow the call stack
int recurse(int index) {
if(index < 2 ) {
auto stack_size = abs_udelta(current_sp(), global_bottom);
cout << "Current stack size: " << stack_size
<< "\tStack left: " << global_max - stack_size << '\n';
return index;
}
return recurse(index - 1) + recurse(index - 2); // do the fibonacci
}
int main() {
global_max = get_max_stack_size();
global_bottom = bottom_of_stack();
auto appr_bottom = appr_bottom_of_stack();
cout << "Maximum stack size: "
<< global_max << '\n';
cout << "Approximate bottom of the stack by unwinding: "
<< (void*)appr_bottom << '\n';
if(global_bottom > 0) {
cout << "Bottom of the stack in /proc/<pid>/stat: "
<< (void*)global_bottom << '\n';
cout << "Approximation error: "
<< abs_udelta(global_bottom, appr_bottom) << '\n';
} else {
global_bottom = appr_bottom;
cout << "Could not parse /proc/<pid>/stat" << '\n';
}
// use the result so call won't get optimized out
cout << "Result of recursion: " << recurse(6);
}
Output:
Maximum stack size: 8388608
Approximate bottom of the stack by unwinding: 0x7fff64570af8
Bottom of the stack in /proc/<pid>/stat: 0x7fff64570b00
Approximation error: 8
Current stack size: 640 Stack left: 8387968
Current stack size: 640 Stack left: 8387968
Current stack size: 576 Stack left: 8388032
Current stack size: 576 Stack left: 8388032
Current stack size: 576 Stack left: 8388032
Current stack size: 576 Stack left: 8388032
Current stack size: 576 Stack left: 8388032
Current stack size: 512 Stack left: 8388096
Current stack size: 576 Stack left: 8388032
Current stack size: 576 Stack left: 8388032
Current stack size: 512 Stack left: 8388096
Current stack size: 512 Stack left: 8388096
Current stack size: 512 Stack left: 8388096
Result of recursion: 8
linux: start here http://man7.org/linux/man-pages/man3/backtrace.3.html
windows: start here: http://msdn.microsoft.com/en-us/library/windows/desktop/bb204633(v=vs.85).aspx
Here's a way to get a rough estimate of the size of the current call stack.
In main(), save the address of argc to a global variable. This is somewhat close to where your stack starts. Then when you want to check the current size, take the address of the first argument to your current function and subtract it from your saved value. This will be less accurate if you have a large amount of data in automatic variables in main() or the current function (I'm not sure which one - it may vary by platform).
There are probably cases where the stack grows dynamically by chaining blocks, in which case this technique will not be accurate.
In a multithreaded program, you would have to track the start of each thread's stack separately. Similarly to main, get the address of one of the parameters to the top-level thread function when the thread starts up.
I wish I had thought to save the link to the page where I found someone doing something similar. They got the address of one of the automatic variables instead of one of the parameters, which would still give a very similar result.
Related
Since the stack grows downwards, ie towards numerically smaller memory addresses why does &i < &j is true. Correct me if I'm wrong, but I'd imagine this was a design decision of C creators (that C++ maintains). But I wonder why though.
It is also strange that a heap-allocated object pin lies at numerically higher memory address than a stack variable and this also contradicts the fact that the heap lies at numerically smaller memory addresses than the stack (and increases upwards).
#include <iostream>
int main()
{
int i = 5; // stack allocated
int j = 2; // stack allocated
int *pi = &i; // stack allocated
int *pj = &j; // stack allocated
std::cout << std::boolalpha << '\n';
std::cout << (&i < &j) && (pi < pj) << '\n'; // true
struct S
{
int in;
};
S *pin // stack allocated
= new S{10}; // heap allocated
std::cout << '\n' << (&(pin->in) > &i) << '\n'; // true
std::cout << ((void*)pin > (void*)pi) << '\n'; // true
}
Am I right so far and if so why C designers reversed this situation that numerically smaller memory addresses appear higher (at least when you compare the pointers or through the addressof operator &). Was this done just 'to make things work'?
Correct me if I'm wrong, but I'd imagine this was a design decision of C creators
It is not part of the design of the C language, nor C++. In fact, there is no such thing as "heap" or "stack" memory recognised by these standards.
It is an implementation detail. Each implementation of each language may do this differently.
Ordered comparisons between pointers to unrelated objects such as &i < &j or (void*)pin > (void*)pi have an unspecified result. Neither is guaranteed to be less or greater than the other.
For what it's worth, your example program outputs three counts of "false" on my system.
The compiler generated code that isn't allocating space for each individual variable in order, but allocating a block for those local variables, and thus can arrange them within that block however it chooses.
Usually, all the local variables of one function are allocated as one block, during function entry. Therefore you will only see the stack growing downward if you compare the address of a local variable allocated in an outer function with the address of a local variable allocated in an inner function.
It's really rather easy: such a stack is an implementation detail. The C and C++ language spec doesn't even need to refer to it. A conforming C or C++ implementation does not need to use a stack! And if it does use a stack, still the language spec doesn't guarantee that the addresses on it are allocated in any particular pattern.
Finally, the variables may be stored in registers, or as immediate values in the code text, and not in data memory. Then: taking the address of such a variable is a self-fulfilling prophecy: the language spec forces the value to a memory location, and the address of that is provided to you - this usually wrecks performance, so don't take addresses of things you don't need to know the address of.
A simple cross-platform example (it does the right thing on both gcc and msvc).
#ifdef _WIN32
#define __attribute__(a)
#else
#define __stdcall
#endif
#ifdef __cplusplus
extern "C" {
#endif
__attribute__((stdcall)) void __stdcall other(int);
void test(){
int x = 7;
other(x);
int z = 8;
other(z);
}
#ifdef __cplusplus
}
#endif
Any reasonable compiler won't put x nor z in memory unnecessarily. They will be either stored in registers, or will be pushed onto the stack as immediate values.
Here's x86-64 output from gcc 9.2 - note that no memory loads nor stores are present, and there's tail call optimization!
gcc -m64 -Os
test:
push rax
mov edi, 7
call other
mov edi, 8
pop rdx
jmp other
On x86, we can force a stdcall calling convention that uses stack to pass all parameters: even then, the value 7 and 8 is never in a stack location for a variable. It is pushed directly to the stack when other gets called, and it doesn't exist on the stack beforehand:
gcc -m32 -fomit-frame-pointer -Os
test:
sub esp, 24
push 7
call other
push 8
call other
add esp, 24
ret
We all know that stack is growing downward, so it's really a straightforward assumption that if we find the address of the last declared variable, we will get out the smallest address in stack, so we could just assume that this address will be our residual available stack.
And i did it, and i got just humongous address {0x000000dc9354f540} = {947364623680} we know that stack growing downward and we know that we can't go lower than 0.
so a bit of math:
947364623680 / (1024*1024*1024) = 882.302060425
--> Do they imply that i have 882Gb of stack on my machine?!
I test it and obviously get the stack overflow exception after allocating additional 2mb on stack:
uint8 array[1024*1024*2] = {};
And there my question come WTF is this, and how can i get my actual stack size?
Since you question has a tag "visual-studio-debugging" I assume you use windows.
First you should get the current stack pointer. Either get an address of a local dummy variable (like you did now), or by raw asm read esp/rsp, or get an address of a local dummy variable (like you did now), or get CPU register via Win32 API call to GetThreadContext).
Now, in order to find out the available stack size you may use VirtualQuery to see the starting address of this virtual memory region (aka allocation base address). Basically subtracting those pointers would give you the remaining stack size (precision up to the size of the current stack frame).
Long time ago I've written an article about this subject, including querying the currently allocated/reserved stack size. You can find out more info there if you want:
Do they imply that i have 882Gb of stack on my machine?!
It has nothing to do with the "stack on your machine". It's about virtual address space, which has nothing to do with the physical storage (RAM + page files) available in the system.
Another approach to get an approximate value of the stack space left at any given point in a win32 application would be something like the following function. It uses structured exception handling to catch the stack overflow exception.
Note: #valdo's solution is the correct solution. I'm posting this answer because it's kind of an interesting way to solve it. It's going to be very slow because it's runtime is linear (in terms of stack size), as opposed to constant runtime of #valdo's solution.
static uint64_t GetAvailableStackSpace()
{
volatile uint8_t var;
volatile uint8_t* addr = &var;
volatile uint8_t sink;
auto filter = [](unsigned int code) -> int
{
return (code == EXCEPTION_STACK_OVERFLOW) ? EXCEPTION_EXECUTE_HANDLER : EXCEPTION_CONTINUE_SEARCH;
};
__try
{
while (true)
{
addr = addr - 1024;
sink = *addr;
}
}
__except (filter(GetExceptionCode()))
{
return (&var - addr);
}
return 0;
}
This is an implementation of the VirtualQuery technique mentioned by #valdo.
This function returns an approximate number of bytes of stack available. I tested this on Windows x64.
static uint64_t GetAvailableStackSpace()
{
volatile uint8_t var;
MEMORY_BASIC_INFORMATION mbi;
auto virtualQuerySuccess = VirtualQuery((LPCVOID)&var, &mbi, sizeof(mbi));
if (!virtualQuerySuccess)
{
return 0;
}
return &var - mbi.AllocationBase;
}
I was just going through this Wikipedia entry. Out of curiosity to find the stack size allocated to a simple process, i tried this
int main()
{
static int count = 0;
cout<<" Count = "<<count++<<endl;
main();
return 0;
}
Compiler DevC++
I got this :-
Till this point everything is fine, understandable, by the way from last digit i.e. 43385 can i guess the maximum stack size - on 32 bit machine (what if i say 4 bytes(4 bytes for return address on stack for each call), i may sound silly on this.
Now if i modify my program to :-
void foo()
{
static int count = 0;
cout<<" Count = "<<count++<<endl;
foo();
}
int main()
{
foo();
return 0;
}
In this i get Stack Over flow at count :- 130156 (ok, fine)
But my question if, i add one function in between main and foo, i get this count decrements by 1(130155), if 2 functions in b/w foo and main count is decremented by 2(130154) and so on. Why is this behavior? Is it because 1 space is being consumed for each function address.
Firstly correct your program by adding Count++ (silly).
Stack size is not fixed, most compilers let you specify the Stack size. Stack size is also dependent on some factors like Platform, toolchain, ulimit,and other parameters.There are many static and dynamic properties that can influence it.
There are three kinds of memory limits:
for 32-bit (windows)
Static data - 2GB
Dynamic data - 2GB
Stack data - 1GB (the stack size is set by the linker, the default is 1MB. This can be increased using the Linker property System > Stack Reserve Size).
by using your program you can guess the current stack size.
The memory-limits-applications-windows Stack Overflow Recursion in C c-maximum-stack-size-of-program will help you.
Inspired by this question.
Apparently in the following code:
#include <Windows.h>
int _tmain(int argc, _TCHAR* argv[])
{
if( GetTickCount() > 1 ) {
char buffer[500 * 1024];
SecureZeroMemory( buffer, sizeof( buffer ) );
} else {
char buffer[700 * 1024];
SecureZeroMemory( buffer, sizeof( buffer ) );
}
return 0;
}
compiled with default stack size (1 megabyte) with Visual C++ 10 with optimizations on (/O2) a stack overflow occurs because the program tries to allocate 1200 kilobytes on stack.
The code above is of course slightly exaggerated to show the problem - uses lots of stack in a rather dumb way. Yet in real scenarios stack size can be smaller (like 256 kilobytes) and there could be more branches with smaller objects that would induce a total allocation size enough to overflow the stack.
That makes no sense. The worst case would be 700 kilobytes - it would be the codepath that constructs the set of local variables with the largest total size along the way. Detecting that path during compilation should not be a problem.
So the compiler produces a program that tries to allocate even more memory than the worst case. According to this answer LLVM does the same.
That could be a deficiency in the compiler or there could be some real reason for doing it this way. I mean maybe I just don't understand something in compilers design that would explain why doing allocation this way is necessary.
Why would the compiler want a program allocate more memory than the code needs in the worst case?
I can only speculate that this optimization was deemed too unimportant by the compiler designers. Or perhaps, there is some subtle security reason.
BTW, on Windows, stack is reserved in its entirety when the thread starts execution, but is committed on as-needed basis, so you are not really spending much "real" memory even if you reserved a large stack.
Reserving a large stack can be a problem on 32-bit system, where having large number of threads can eat the available address space without really committing much memory. On 64-bit, you are golden.
It could be down to your use of SecureZeroMemory. Try replacing it with regular ZeroMemory and see what happens- the MSDN page essentially indicates that SZM has some additional semantics beyond what it's signature implies, and they could be the cause of the bug.
The following code when compiled using GCC 4.5.1 on ideone places the two arrays at the same address:
#include <iostream>
int main()
{
int x;
std::cin >> x;
if (x % 2 == 0)
{
char buffer[500 * 1024];
std::cout << static_cast<void*>(buffer) << std::endl;
}
if (x % 3 == 0)
{
char buffer[700 * 1024];
std::cout << static_cast<void*>(buffer) << std::endl;
}
}
input: 6
output:
0xbf8e9b1c
0xbf8e9b1c
The answer is probably "use another compiler" if you want this optimization.
OS Pageing and byte alignment could be a factor. Also housekeeping may use extra stack along with space required for calling other functions within that function.
I am initializing millions of classes that are of the following type
template<class T>
struct node
{
//some functions
private:
T m_data_1;
T m_data_2;
T m_data_3;
node* m_parent_1;
node* m_parent_2;
node* m_child;
}
The purpose of the template is to enable the user to choose float or double precision, with the idea being that by node<float> will occupy less memory (RAM).
However, when I switch from double to float the memory footprint of my program does not decrease as I expect it to. I have two questions,
Is it possible that the compiler/operating system is reserving more space than required for my floats (or even storing them as a double). If so, how do I stop this happening - I'm using linux on 64 bit machine with g++.
Is there a tool that lets me determine the amount of memory used by all the different classes? (i.e. some sort of memory profiling) - to make sure that the memory isn't being goobled up somewhere else that I haven't thought of.
If you are compiling for 64-bit, then each pointer will be 64-bits in size. This also means that they may need to be aligned to 64-bits. So if you store 3 floats, it may have to insert 4 bytes of padding. So instead of saving 12 bytes, you only save 8. The padding will still be there whether the pointers are at the beginning of the struct or the end. This is necessary in order to put consecutive structs in arrays to continue to maintain alignment.
Also, your structure is primarily composed of 3 pointers. The 8 bytes you save take you from a 48-byte object to a 40 byte object. That's not exactly a massive decrease. Again, if you're compiling for 64-bit.
If you're compiling for 32-bit, then you're saving 12 bytes from a 36-byte structure, which is better percentage-wise. Potentially more if doubles have to be aligned to 8 bytes.
The other answers are correct about the source of the discrepancy. However, pointers (and other types) on x86/x86-64 are not required to be aligned. It is just that performance is better when they are, which is why GCC keeps them aligned by default.
But GCC provides a "packed" attribute to let you exert control over this:
#include <iostream>
template<class T>
struct node
{
private:
T m_data_1;
T m_data_2;
T m_data_3;
node* m_parent_1;
node* m_parent_2;
node* m_child;
} ;
template<class T>
struct node2
{
private:
T m_data_1;
T m_data_2;
T m_data_3;
node2* m_parent_1;
node2* m_parent_2;
node2* m_child;
} __attribute__((packed));
int
main(int argc, char *argv[])
{
std::cout << "sizeof(node<double>) == " << sizeof(node<double>) << std::endl;
std::cout << "sizeof(node<float>) == " << sizeof(node<float>) << std::endl;
std::cout << "sizeof(node2<float>) == " << sizeof(node2<float>) << std::endl;
return 0;
}
On my system (x86-64, g++ 4.5.2), this program outputs:
sizeof(node<double>) == 48
sizeof(node<float>) == 40
sizeof(node2<float>) == 36
Of course, the "attribute" mechanism and the "packed" attribute itself are GCC-specific.
In addtion to the valid points that Nicol makes:
When you call new/malloc, it doesn't necessarily correspond 1 to 1 with a call the the OS to allocate memory. This is because in order to reduce the number of expensive syste, calls, the heap manager may allocate more than is requested, and then "suballocate" chunks of that when you call new/malloc. Also, memory can only be allocated 4kb at a time (typically - this is the minimum page size). Essentially, there may be chunks of memory allocated that are not currently actively used, in order to speed up future allocations.
To answer your questions directly:
1) Yes, the runtime will very likely allocate more memory then you asked for - but this memory is not wasted, it will be used for future news/mallocs, but will still show up in "task manager" or whatever tool you use. No, it will not promote floats to doubles. The more allocations you make, the less likely this edge condition will be the cause of the size difference, and the items in Nicol's will dominate. For a smaller number of allocations, this item is likely to dominate (where "large" and "small" depends entirely on your OS and Kernel).
2) The windows task manager will give you the total memory allocated. Something like WinDbg will actually give you the virtual memory range chunks (usually allocated in a tree) that were allocated by the run-time. For Linux, I expect this data will be available in one of the files in the /proc directory associated with your process.