On man page for pthread_attr_setstacksize
https://man7.org/linux/man-pages/man3/pthread_attr_setstacksize.3.html
A thread's stack size is fixed at the time of thread creation. Only the main thread can dynamically grow its stack.
my understanding on linux pthread, the main thread stack size is limited to ulimit -s value on main thread creating. Although it maps phy to virt on demand of stack usage, the size is not grow any more.
What does the dynamically grow mean here? Does it imply main thread stack size can grow exceed ulimit -s?
The value set by ulimit -s (aka setrlimit(RLIMIT_STACK, ...)), usually 8 MB by default, is the maximum stack size. Initially, a much smaller amount of virtual memory will be allocated and mapped (perhaps just a few kb). When the stack grows larger than the amount actually allocated, it triggers a page fault. The kernel then compares the current usage with the maximum value set in the rlimit. If the maximum has not been reached, the kernel allocates more pages of virtual memory and maps them into place, then returns control to the process; this is completely transparent. If the maximum is reached, it kills the process with SIGSEGV.
It would be inefficient if the system had to reserve a full 8 MB of virtual memory for every process, when most will use far less. By allocating it only as needed, you can still have hundreds of processes, each with an 8 MB stack limit, even if the machine has only (let's say) 64 MB of memory + swap total. It's a form of overcommitment.
Also keep in mind that a process can call setrlimit itself at run time and increase its own maximum stack size, so long as nothing else has been mapped into that address space. The main thread's stack is traditionally located near the top of virtual memory, with everything else near the bottom, so that there is a lot of free address space in between, and so increasing the maximum beyond its initial 8 MB limit is usually possible. However, the stacks of other threads necessarily must be allocated elsewhere, and it is not really possible to ensure that there is a lot of free address space for them to grow into.
Related
As in title, can someone make sense for me more about heap and stack in CUDA? Does it have any different with original heap and stack in CPU memory?
I got a problem when I increase stack size in CUDA, it seem to have its limitation, because when I set stack size over 1024*300 (Tesla M2090) by cudaDeviceSetLimit, I got an error: argument invalid.
Another problem I want to ask is: when I set heap size to very large number (about 2GB) to allocate my RTree (data structure) with 2000 elements, I got an error in runtime: too many resources requested to launch
Any idea?
P/s: I launch with only single thread (kernel<<<1,1>>>)
About stack and heap
Stack is allocated per thread and has an hardware limit (see below).
Heap reside in global memory, can be allocated using malloc() and must be explicitly freed using free() (CUDA doc).
This device functions:
void* malloc(size_t size);
void free(void* ptr);
can be useful but I would recommend to use them only when they are really needed. It would be a better approach to rethink the code to allocate the memory using the host-side functions (as cudaMalloc).
The stack size has an hardware limit which can be computed (according to this answer by #njuffa) by the minimum of:
amount of local memory per thread
available GPU memory / number of SMs / maximum resident threads per SM
As you are increasing the size, and you are running only one thread, I guess your problem is the second limit, which in your case (TESLA M2090) should be: 6144/16/512 = 750KB.
The heap has a fixed size (default 8MB) that must be specified before any call to malloc() by using the function cudaDeviceSetLimit. Be aware that the memory allocated will be at least the size requested due to some allocation overhead.
Also it is worth mentioning that the memory limit is not per-thread but instead has the lifetime of the CUDA context (until released by a call to free()) and can be used by thread in a subsequent kernel launch.
Related posts on stack: ... stack frame for kernels, ... local memory per cuda thread
Related posts on heap: ... heap memory ..., ... heap memory limitations per thread
Stack and heap are different things. Stack represents the per thread stack, heap represents the per context runtime heap that device malloc/new uses to allocate memory. You set stack size with the cudaLimitStackSize flag, and runtime heap with the cudaLimitMallocHeapSize flag, both passed to the cudaDeviceSetLimit API.
It sounds like you are wanting to increase the heap size, but are trying to do so by changing the stack size. On the other hand, if you need a large stack size, you may have to reduce the number of threads per block you use in order to avoid kernel launch failures.
I'm using C++ and Windows.h in my source code. I read the CreateThread API in MSDN, but I still don't understand the essence of specifying stack size. By default it is 1 MB. But what will happen if I specify 32 bytes?
What does stack size in a thread define?
Please provide a thorough explanation and I'll appreciate it. Thanks.
The stack is used to store local variables, pass parameters in function calls, store return addresses. A thread's stack has a fixed size which is determined when the thread is created. That is the value that you are referring too.
The stack size is determined when the thread is created since it needs to occupy contiguous address space. That means that the entire address space for the thread's stack has to be reserved at the point of creating the thread.
If the stack is too small then it can overflow. That's an error condition known as stack overflow, from which this website took its name. When you call a function some or all of the following happens:
Parameters are pushed onto the stack.
The return address is pushed onto the stack.
A stack frame containing space for the function's local variables is created.
All of this consumes space from the stack. When the function in turn calls another function, more stack space is consumed. As the call stack goes deeper, more stack space is required.
The consequence therefore of setting the stack size too low is that you can exhaust the stack and overflow it. That is a terminal condition from which you cannot recover. Certainly 32 bytes (rounded up to one page which is 4096 bytes) is too small for almost all threads.
If you have a program with a lot of threads, and you know that the thread's don't need to reserve 1MB of stack size then there can be benefits to using a smaller stack size. Doing so can avoid exhausting the available process address space.
On the other hand you might have a program with a single thread that has deep call stacks that consume large amounts of stack space. In this scenario you might reserve more than the default 1MB.
However, unless you have strong reason to do otherwise, it is likely best to stick to the default stack size.
Stack size is just tradeoff between ability to create many threads and possibility of stack overflow in one of them.
The more stack size is, the less number of threads you can create and the less possibility of stack overflow is. You should worry about stack size only if you are going to create many threads (you will have to lower stack size but remember about stack overflow). Otherwise default value is suffice.
But what will happen if I specify 32 bytes?
I have not read the Windows documentation, but if Windows allows this (to specify only 32 bytes), you will most likely get a stack overflow. According to their documentation the value is rounded up to the page size in anycase, therefore in reality you stack size will be at least the size of a page. The created thread assumes that there exists enough "stack space" for it to use (for allocating automatic variables, storing function addresses etc), and allocates space according to it's needs. When there is not enough stack space, the stack allocator might use invalid memory, overriding memory used elsewhere.
What does stack size in a thread define?
It defines how much memory will be allocated for use by that thread's stack.
There is a good description of what exactly a thread call stack is here
When a binary (C/C++) is executed under Linux,
How is the stack initialized for the process?
How does the stack grow and up to what limit?
Using ulimit, I can have a limit number and by using setrlimit, I can modify it, but up to what limit, how can I determine it?
Is the same stack size allocated for all executing processes?
As you can see in the code below, I have recursively called func() for push operation only, and the stack grew up to around approximately 8 MB. And it crashed (stack overflow!).
void func()
{
static int i=0;
int arr[1024]={0};
printf("%d KB pushed on stack!\n",++i*sizeof(int));
func();
}
int main()
{
func();
return 0;
}
output snippet:
8108 KB pushed on stack!
8112 KB pushed on stack!
8116 KB pushed on stack!
8120 KB pushed on stack!
Segmentation fault (core dumped)
Where did these approximately 8 MB come from?
Stack is one of the various memory region that is associated to a process at startup time and may vary during runtime. Others can be text/code, heap, static/bss, etc.
Each time you call a function the stack grows. A stack frame is added on top of it. A stack frame is what is necessary to a given function to be executed (parameters, return value, local variables). Each time you return from a function, the stack shrinks by the same amount it grew.
You can try to estimate how deep you function call tree will be (f calls g which in turn calls h, depth is 3 calls, so 3 stack frames).
Yes there is a default value that was estimated by OS designers. That size is in general sufficient.
This is a default constant associated to your OS.
How stack is initialized for its process?
It depends on the architecture, but in general, the kernel allocates some virtual memory in your process's VM, and sets the stack pointer register to point to the top of it.
How stack grows and up to what limit?
Every function call reserves more space on the stack using an architecturally defined procedures. This is typically referred to as a "function prologue".
Using ulimit, I can have limit number and using setrlimit, I can modify it but up to what limit, how can I determine it?
ulimit -s will tell you the maximum stack size (in KB) for the current process (and all child processes which will inherit this value, unless overridden).
Does same stack size is allocated for all executing process?
See previous answer.
Related:
Is there a limit of stack size of a process in linux
int A[10000000]; //This gives a segmentation fault
int *A = (int*)malloc(10000000*sizeof(int));//goes without any set fault.
Now my question is, just out of curiosity, that if ultimately we are able to allocate higher space for our data structures, say for example, BSTs and linked lists created using the pointers approach in C have no as such memory limit(unless the total size exceeds the size of RAM for our machine) and for example, in the second statement above of declaring a pointer type, why is that we can't have an array declared of higher size(until it reaches the memory limit!!)...Is this because the space allocated is contiguous in a static sized array?.But then from where do we get the guarantee that in the next 1000000 words in RAM no other piece of code would be running...??
PS: I may be wrong in some of the statements i made..please correct in that case.
Firstly, in a typical modern OS with virtual memory (Linux, Windows etc.) the amount of RAM makes no difference whatsoever. Your program is working with virtual memory, not with RAM. RAM is just a cache for virtual memory access. The absolute limiting factor for maximum array size is not RAM, it is the size of the available address space. Address space is the resource you have to worry about in OSes with virtual memory. In 32-bit OSes you have 4 gigabytes of address space, part of which is taken up for various household needs and the rest is available to you. In 64-bit OSes you theoretically have 16 exabytes of address space (less than that in practical implementations, since CPUs usually use less than 64 bits to represent the address), which can be perceived as practically unlimited.
Secondly, the amount of available address space in a typical C/C++ implementation depends on the memory type. There's static memory, there's automatic memory, there's dynamic memory. The address space limits for each memory type are pre-set in advance by the compiler. Which raises the question: where are you declaring your large array? Which memory type? Automatic? Static? You provided no information, but this is absolutely necessary. If you are attempting to declare it as a local variable (automatic memory), then no wonder it doesn't work, since automatic memory (aka "stack memory") has very limited address space assigned to it. Your array simply does not fit. Meanwhile, malloc allocates dynamic memory, which normally has the largest amount of address space available.
Thirdly, many compilers provide you with options that control the initial distribution of address space between different kinds of memory. You can request a much larger stack size for your program by manipulating such options. Quite possibly you can request a stack so large, than your local array will fit in it without any problems. But in practice, for obvious reasons, it makes very little sense to declare huge arrays as local variables.
Assuming local variables, this is because on modern implementations automatic variables will be allocated on the stack which is very limited in space. This link gives some of the common stack sizes:
platform default size
=====================================
SunOS/Solaris 8172K bytes
Linux 8172K bytes
Windows 1024K bytes
cygwin 2048K bytes
The linked article also notes that the stack size can be changed for example in Linux, one possible way from the shell before running your process would be:
ulimit -s 32768 # sets the stack size to 32M bytes
While malloc on modern implementations will come from the heap, which is only limited to the memory you have available to the process and in many cases you can even allocate more than is available due to overcommit.
I THINK you're missing the difference between total memory, and your programs memory space. Your program runs in an environment created by your operating system. It grants it a specific memory range to the program, and the program has to try to deal with that.
The catch: Your compiler can't 100% know the size of this range.
That means your compiler will successfully build, and it will REQUEST that much room in memory when the time comes to make the call to malloc (or move the stack pointer when the function is called). When the function is called (creating a stack frame) you'll get a segmentation fault, caused by the stack overflow. When the malloc is called, you won't get a segfault unless you try USING the memory. (If you look at the manpage for malloc() you'll see it returns NULL when there's not enough memory.)
To explain the two failures, your program is granted two memory spaces. The stack, and the heap. Memory allocated using malloc() is done using a system call, and is created on the heap of your program. This dynamically accepts or rejects the request and returns either the start address, or NULL, depending on a success or fail. The stack is used when you call a new function. Room for all the local variables is made on the stack, this is done by program instructions. Calling a function can't just FAIL, as that would break program flow completely. That causes the system to say "You're now overstepping" and segfault, stopping the execution.
What is the difference between reserve argument and commit argument to CreateThread Windows API function?
I can't understand the following lines ..
The reserve argument sets the amount of address space the system should reserve for the thread's stack. The default is 1 MB.
The commit argument specifies the amount of physical storage that should be initially committed to the stack's reserved region.
these two lines you will find them in this paragraph which explains one of the parameters of the CreateThread function in c++
cbStackSize
The cbStackSize parameter specifies how much address space the thread can use for its own stack. Every thread owns its own
stack. When CreateProcess starts a process, it internally calls CreateThread to initialize the process' primary thread. For the
cbStackSize parameter, CreateProcess uses a value stored inside the executable file. You can control this value using the
linker's /STACK switch:
/STACK:[ reserve][, commit]
The reserveargument sets the amount of address space the system should reserve for the thread's stack. The default is 1 MB.
The commitargument specifies the amount of physical storage that should be initially committed to the stack's reserved region.
The distinction is distinction between virtual and physical memory.
In any operating system worthy of that name, including Windows, pointers don't designate locations on the memory chip directly. They are locations in process-specific virtual memory space and the operating system then allocates parts of the physical memory chip to store the content of the parts where the process actually stores anything on demand. And may swap some data to disk when it runs out of RAM.
The reserve is the size of continuous virtual memory block to allocate for the stack. Below and above the range other things will be stored, so the reserve puts upper limit on how big the stack can grow.
Fortunately virtual memory is usually plentiful. You have 2GiB on 32-bit Windows, 3GiB if you link with /LARGEADDRESSAWARE flag and huge amount if you compile for 64-bits (x64). Only exception is WinCE before 5.0, where you only have 32MiB. So unless you are creating zillions of threads, you can be generous here and you should be, because if you don't have enough, the process will crash.
The commit is the size of physical memory that the system should preallocate for the stack. This makes the system immediately get some space in the physical memory, which is a shared resource and may be scarce. It may need to swap or discard it's previous content to get it. When you exceed it, the system will automatically scramble some more at the cost of small delay. So the only thing you gain by increasing the value here is a little speed-up if you actually have need for the memory. It's slow down if you don't. So you should be conservative here.
The stack is where local variables are placed. If you use large local buffers—and it is often reasonable since stack allocation is much faster than heap allocation (via malloc/new/anything that uses std::allocator)—you need to reserve enough stack. If you don't, the 1MiB is usually plenty.
The reserve places a ceiling on how much stack space the thread will have. The commit places a floor on it. So it starts out consuming commit amount of memory and stops consuming when it hits reserve.
Every process has an address space. Stack of each thread is located somewhere in this space. When the tread is created, OS allocates piece of address space of the size reserve.
But it does not assign any real memory to all this space. It assigns only commit amount of memory.
The stack can grow over the time and OS will add more pages to it, extending the commit amount. But it cannot grow indefinitely. It cannot grow bigger than reserve.