When a binary (C/C++) is executed under Linux,
How is the stack initialized for the process?
How does the stack grow and up to what limit?
Using ulimit, I can have a limit number and by using setrlimit, I can modify it, but up to what limit, how can I determine it?
Is the same stack size allocated for all executing processes?
As you can see in the code below, I have recursively called func() for push operation only, and the stack grew up to around approximately 8 MB. And it crashed (stack overflow!).
void func()
{
static int i=0;
int arr[1024]={0};
printf("%d KB pushed on stack!\n",++i*sizeof(int));
func();
}
int main()
{
func();
return 0;
}
output snippet:
8108 KB pushed on stack!
8112 KB pushed on stack!
8116 KB pushed on stack!
8120 KB pushed on stack!
Segmentation fault (core dumped)
Where did these approximately 8 MB come from?
Stack is one of the various memory region that is associated to a process at startup time and may vary during runtime. Others can be text/code, heap, static/bss, etc.
Each time you call a function the stack grows. A stack frame is added on top of it. A stack frame is what is necessary to a given function to be executed (parameters, return value, local variables). Each time you return from a function, the stack shrinks by the same amount it grew.
You can try to estimate how deep you function call tree will be (f calls g which in turn calls h, depth is 3 calls, so 3 stack frames).
Yes there is a default value that was estimated by OS designers. That size is in general sufficient.
This is a default constant associated to your OS.
How stack is initialized for its process?
It depends on the architecture, but in general, the kernel allocates some virtual memory in your process's VM, and sets the stack pointer register to point to the top of it.
How stack grows and up to what limit?
Every function call reserves more space on the stack using an architecturally defined procedures. This is typically referred to as a "function prologue".
Using ulimit, I can have limit number and using setrlimit, I can modify it but up to what limit, how can I determine it?
ulimit -s will tell you the maximum stack size (in KB) for the current process (and all child processes which will inherit this value, unless overridden).
Does same stack size is allocated for all executing process?
See previous answer.
Related:
Is there a limit of stack size of a process in linux
Related
On man page for pthread_attr_setstacksize
https://man7.org/linux/man-pages/man3/pthread_attr_setstacksize.3.html
A thread's stack size is fixed at the time of thread creation. Only the main thread can dynamically grow its stack.
my understanding on linux pthread, the main thread stack size is limited to ulimit -s value on main thread creating. Although it maps phy to virt on demand of stack usage, the size is not grow any more.
What does the dynamically grow mean here? Does it imply main thread stack size can grow exceed ulimit -s?
The value set by ulimit -s (aka setrlimit(RLIMIT_STACK, ...)), usually 8 MB by default, is the maximum stack size. Initially, a much smaller amount of virtual memory will be allocated and mapped (perhaps just a few kb). When the stack grows larger than the amount actually allocated, it triggers a page fault. The kernel then compares the current usage with the maximum value set in the rlimit. If the maximum has not been reached, the kernel allocates more pages of virtual memory and maps them into place, then returns control to the process; this is completely transparent. If the maximum is reached, it kills the process with SIGSEGV.
It would be inefficient if the system had to reserve a full 8 MB of virtual memory for every process, when most will use far less. By allocating it only as needed, you can still have hundreds of processes, each with an 8 MB stack limit, even if the machine has only (let's say) 64 MB of memory + swap total. It's a form of overcommitment.
Also keep in mind that a process can call setrlimit itself at run time and increase its own maximum stack size, so long as nothing else has been mapped into that address space. The main thread's stack is traditionally located near the top of virtual memory, with everything else near the bottom, so that there is a lot of free address space in between, and so increasing the maximum beyond its initial 8 MB limit is usually possible. However, the stacks of other threads necessarily must be allocated elsewhere, and it is not really possible to ensure that there is a lot of free address space for them to grow into.
As in title, can someone make sense for me more about heap and stack in CUDA? Does it have any different with original heap and stack in CPU memory?
I got a problem when I increase stack size in CUDA, it seem to have its limitation, because when I set stack size over 1024*300 (Tesla M2090) by cudaDeviceSetLimit, I got an error: argument invalid.
Another problem I want to ask is: when I set heap size to very large number (about 2GB) to allocate my RTree (data structure) with 2000 elements, I got an error in runtime: too many resources requested to launch
Any idea?
P/s: I launch with only single thread (kernel<<<1,1>>>)
About stack and heap
Stack is allocated per thread and has an hardware limit (see below).
Heap reside in global memory, can be allocated using malloc() and must be explicitly freed using free() (CUDA doc).
This device functions:
void* malloc(size_t size);
void free(void* ptr);
can be useful but I would recommend to use them only when they are really needed. It would be a better approach to rethink the code to allocate the memory using the host-side functions (as cudaMalloc).
The stack size has an hardware limit which can be computed (according to this answer by #njuffa) by the minimum of:
amount of local memory per thread
available GPU memory / number of SMs / maximum resident threads per SM
As you are increasing the size, and you are running only one thread, I guess your problem is the second limit, which in your case (TESLA M2090) should be: 6144/16/512 = 750KB.
The heap has a fixed size (default 8MB) that must be specified before any call to malloc() by using the function cudaDeviceSetLimit. Be aware that the memory allocated will be at least the size requested due to some allocation overhead.
Also it is worth mentioning that the memory limit is not per-thread but instead has the lifetime of the CUDA context (until released by a call to free()) and can be used by thread in a subsequent kernel launch.
Related posts on stack: ... stack frame for kernels, ... local memory per cuda thread
Related posts on heap: ... heap memory ..., ... heap memory limitations per thread
Stack and heap are different things. Stack represents the per thread stack, heap represents the per context runtime heap that device malloc/new uses to allocate memory. You set stack size with the cudaLimitStackSize flag, and runtime heap with the cudaLimitMallocHeapSize flag, both passed to the cudaDeviceSetLimit API.
It sounds like you are wanting to increase the heap size, but are trying to do so by changing the stack size. On the other hand, if you need a large stack size, you may have to reduce the number of threads per block you use in order to avoid kernel launch failures.
I am trying to understand a program which uses multi-threading with shared-memory. The parent thread calls the following function and I don't quite understand how it works.
#define MAX_STACK_SIZE 16384 // 16KB of stack
/*!
* Writes to a 16 KB buffer on the stack. If we are using 4K pages for our
* stack, this will make sure that we won't have a page fault when the stack
* grows. Also mlock's all pages associated with the current process, which
* prevents the program from being swapped out. If we do run out of
* memory, the robot program will be killed by the OOM process killer (and
* leaves a log) instead of just becoming unresponsive.
*/
void HardwareBridge::prefaultStack() {
printf("[Init] Prefault stack...\n");
volatile char stack[MAX_STACK_SIZE];
memset(const_cast<char*>(stack), 0, MAX_STACK_SIZE);
if (mlockall(MCL_CURRENT | MCL_FUTURE) == -1) {
initError(
"mlockall failed. This is likely because you didn't run robot as "
"root.\n",
true);
}
}
//Parent Thread
void HardwareBridge::run(){
printf("[HardwareBridge] Init stack\n");
prefaultStack();
//printf("[HardwareBridge] Init scheduler\n"); // Commented because unrelated to current question
//setupScheduler();
// Calls multiple threads here
for(;;){
usleep(10000000);
}
}
Can someone explain what's the purpose of this function. Based on the comment, I could understand that it prevent the stack size from growing beyond 16KB. However, the shared memory are predominantly allocated dynamically using new keyword in the program. Isn't dynamic memory allocation takes place in the heap rather than the stack? How does the function helps in this scenario.
Based on the comment, I could understand that it prevent the stack size from growing beyond 16KB.
That's not what the comment says and not what the function does.
Can someone explain what's the purpose of this function.
The comment explains it. The function does two things:
It pre-allocates 16K of stack.
It "locks" the allocated memory which prevents it from being swapped to disk.
These two things guarantee that there won't be a page fault when the stack usage grows (as long as it doesn't grow beyond 16K).
However, the shared memory are predominantly allocated dynamically
True. This means that shared, or other dynamic memory allocation is irrelevant to the function.
In prefaultStack() you make sure that stack is incremented at least 16K from its previous size. Then by calling mlockall() you lock current memory pages to memory, preventing them from being swapped out. After that you exit the function.
I would say that the only real effect of this is that you ensure that no matter what, the calling thread will have at least 16K of stack available even if later on some hungry process eats up all remaining memory.
as per my understanding, the reply is inline
Based on the comment, I could understand that it prevent the stack size from growing beyond 16KB.
Nope. prefaultStack function has char[MAX_STACK_SIZE] on its stack, so stack segment of main process will be of size 4 pages + (stack size for main function). And any virtual memory pages of this process (along with any allocated in future as stack or heap grows) will not be swapped out to the swap area because mlockall is called with MCL_CURRENT and MCL_FUTURE https://linux.die.net/man/2/mlockall. This is the only functionality of this function. Nothing related to dynamic memory, heap or shared memory.
However, the shared memory are predominantly allocated dynamically using new keyword in the program.
you are dealing with the multi-threading, so dynamic memory or heap address space is shared among the threads. This code does nothing with respect to shared memory between two processes.
Isn't dynamic memory allocation takes place in the heap rather than the stack? How does the function helps in this scenario.
Dynamic memory is allocated from the heap. And this function does not deal
in any way with heap. This code only makes sure that all the stack and heap pages (CURRENT allocated and FUTURE allocated) of the process never swapped out to the swap area preventing any page faults which are time-consuming costlier operations.
I'm using C++ and Windows.h in my source code. I read the CreateThread API in MSDN, but I still don't understand the essence of specifying stack size. By default it is 1 MB. But what will happen if I specify 32 bytes?
What does stack size in a thread define?
Please provide a thorough explanation and I'll appreciate it. Thanks.
The stack is used to store local variables, pass parameters in function calls, store return addresses. A thread's stack has a fixed size which is determined when the thread is created. That is the value that you are referring too.
The stack size is determined when the thread is created since it needs to occupy contiguous address space. That means that the entire address space for the thread's stack has to be reserved at the point of creating the thread.
If the stack is too small then it can overflow. That's an error condition known as stack overflow, from which this website took its name. When you call a function some or all of the following happens:
Parameters are pushed onto the stack.
The return address is pushed onto the stack.
A stack frame containing space for the function's local variables is created.
All of this consumes space from the stack. When the function in turn calls another function, more stack space is consumed. As the call stack goes deeper, more stack space is required.
The consequence therefore of setting the stack size too low is that you can exhaust the stack and overflow it. That is a terminal condition from which you cannot recover. Certainly 32 bytes (rounded up to one page which is 4096 bytes) is too small for almost all threads.
If you have a program with a lot of threads, and you know that the thread's don't need to reserve 1MB of stack size then there can be benefits to using a smaller stack size. Doing so can avoid exhausting the available process address space.
On the other hand you might have a program with a single thread that has deep call stacks that consume large amounts of stack space. In this scenario you might reserve more than the default 1MB.
However, unless you have strong reason to do otherwise, it is likely best to stick to the default stack size.
Stack size is just tradeoff between ability to create many threads and possibility of stack overflow in one of them.
The more stack size is, the less number of threads you can create and the less possibility of stack overflow is. You should worry about stack size only if you are going to create many threads (you will have to lower stack size but remember about stack overflow). Otherwise default value is suffice.
But what will happen if I specify 32 bytes?
I have not read the Windows documentation, but if Windows allows this (to specify only 32 bytes), you will most likely get a stack overflow. According to their documentation the value is rounded up to the page size in anycase, therefore in reality you stack size will be at least the size of a page. The created thread assumes that there exists enough "stack space" for it to use (for allocating automatic variables, storing function addresses etc), and allocates space according to it's needs. When there is not enough stack space, the stack allocator might use invalid memory, overriding memory used elsewhere.
What does stack size in a thread define?
It defines how much memory will be allocated for use by that thread's stack.
There is a good description of what exactly a thread call stack is here
Suppose we have the function :
void foo(int x)
{
foo(x);
}
on my machine (i7) will run approximately 260k times and generate segmentation fault. any idea why that happens ?
Every time you call a function, it require space on the runtime stack. This is where variables local to that function have their memory allocated. What's happening is that you're recursing so many times that you're running out of stack space -- a stack overflow. (The name of this site!)
See also: http://en.wikipedia.org/wiki/Stack_overflow
Every time a function is called system stores its call in stack, in this case system will continue storing function calls until system stack becomes full. this state is called stack overflow.