I'm using C++ and Windows.h in my source code. I read the CreateThread API in MSDN, but I still don't understand the essence of specifying stack size. By default it is 1 MB. But what will happen if I specify 32 bytes?
What does stack size in a thread define?
Please provide a thorough explanation and I'll appreciate it. Thanks.
The stack is used to store local variables, pass parameters in function calls, store return addresses. A thread's stack has a fixed size which is determined when the thread is created. That is the value that you are referring too.
The stack size is determined when the thread is created since it needs to occupy contiguous address space. That means that the entire address space for the thread's stack has to be reserved at the point of creating the thread.
If the stack is too small then it can overflow. That's an error condition known as stack overflow, from which this website took its name. When you call a function some or all of the following happens:
Parameters are pushed onto the stack.
The return address is pushed onto the stack.
A stack frame containing space for the function's local variables is created.
All of this consumes space from the stack. When the function in turn calls another function, more stack space is consumed. As the call stack goes deeper, more stack space is required.
The consequence therefore of setting the stack size too low is that you can exhaust the stack and overflow it. That is a terminal condition from which you cannot recover. Certainly 32 bytes (rounded up to one page which is 4096 bytes) is too small for almost all threads.
If you have a program with a lot of threads, and you know that the thread's don't need to reserve 1MB of stack size then there can be benefits to using a smaller stack size. Doing so can avoid exhausting the available process address space.
On the other hand you might have a program with a single thread that has deep call stacks that consume large amounts of stack space. In this scenario you might reserve more than the default 1MB.
However, unless you have strong reason to do otherwise, it is likely best to stick to the default stack size.
Stack size is just tradeoff between ability to create many threads and possibility of stack overflow in one of them.
The more stack size is, the less number of threads you can create and the less possibility of stack overflow is. You should worry about stack size only if you are going to create many threads (you will have to lower stack size but remember about stack overflow). Otherwise default value is suffice.
But what will happen if I specify 32 bytes?
I have not read the Windows documentation, but if Windows allows this (to specify only 32 bytes), you will most likely get a stack overflow. According to their documentation the value is rounded up to the page size in anycase, therefore in reality you stack size will be at least the size of a page. The created thread assumes that there exists enough "stack space" for it to use (for allocating automatic variables, storing function addresses etc), and allocates space according to it's needs. When there is not enough stack space, the stack allocator might use invalid memory, overriding memory used elsewhere.
What does stack size in a thread define?
It defines how much memory will be allocated for use by that thread's stack.
There is a good description of what exactly a thread call stack is here
Related
As in title, can someone make sense for me more about heap and stack in CUDA? Does it have any different with original heap and stack in CPU memory?
I got a problem when I increase stack size in CUDA, it seem to have its limitation, because when I set stack size over 1024*300 (Tesla M2090) by cudaDeviceSetLimit, I got an error: argument invalid.
Another problem I want to ask is: when I set heap size to very large number (about 2GB) to allocate my RTree (data structure) with 2000 elements, I got an error in runtime: too many resources requested to launch
Any idea?
P/s: I launch with only single thread (kernel<<<1,1>>>)
About stack and heap
Stack is allocated per thread and has an hardware limit (see below).
Heap reside in global memory, can be allocated using malloc() and must be explicitly freed using free() (CUDA doc).
This device functions:
void* malloc(size_t size);
void free(void* ptr);
can be useful but I would recommend to use them only when they are really needed. It would be a better approach to rethink the code to allocate the memory using the host-side functions (as cudaMalloc).
The stack size has an hardware limit which can be computed (according to this answer by #njuffa) by the minimum of:
amount of local memory per thread
available GPU memory / number of SMs / maximum resident threads per SM
As you are increasing the size, and you are running only one thread, I guess your problem is the second limit, which in your case (TESLA M2090) should be: 6144/16/512 = 750KB.
The heap has a fixed size (default 8MB) that must be specified before any call to malloc() by using the function cudaDeviceSetLimit. Be aware that the memory allocated will be at least the size requested due to some allocation overhead.
Also it is worth mentioning that the memory limit is not per-thread but instead has the lifetime of the CUDA context (until released by a call to free()) and can be used by thread in a subsequent kernel launch.
Related posts on stack: ... stack frame for kernels, ... local memory per cuda thread
Related posts on heap: ... heap memory ..., ... heap memory limitations per thread
Stack and heap are different things. Stack represents the per thread stack, heap represents the per context runtime heap that device malloc/new uses to allocate memory. You set stack size with the cudaLimitStackSize flag, and runtime heap with the cudaLimitMallocHeapSize flag, both passed to the cudaDeviceSetLimit API.
It sounds like you are wanting to increase the heap size, but are trying to do so by changing the stack size. On the other hand, if you need a large stack size, you may have to reduce the number of threads per block you use in order to avoid kernel launch failures.
I am slightly confused on how a compiler stores variable on the stack. I read that c++ can store local variables on the stack, but if the stack is LIFO, how could it make sure to call the right variable when the variable is called in the program?
The run-time stack isn't simply a LIFO structure; it's a complex structure which supports random access.
Typically how it works on many platforms is that whe na function is entered, a new stack frame is pushed onto the stack (LIFO fashion). Local variables are stored within the frame and are accessed relative to a more or less stable pointer stored in a register.
When other functions are called, they push their own frames, but they restore everything before returning.
Run-time stacks also typically support the ad hoc pushing and popping of individual values, for the purpose of temporarily saving register values or for parameter passing.
A common implementation strategy for that is to allocate the frame first. For instance if a 512 byte stack frame is needed, the stack pointer is moved by 512 bytes. The stack pointer is then used freely for pushing and popping, provided it doesn't pop too far and start gobbling into the frame.
A separate frame pointer may be used which tracks the location of the frame. The frame is then accessed relative to that frame pointer, which allows the stack pointer to move without interfering with those accesses.
Compilers can generate code without the use of frame pointers also; if the stack pointer only moves in ways that a compiler knows about, it can adjust all the variable references. Over an area of the code where the compiler knows that the stack pointer has moved by four bytes due to something being pushed on it, it can adjust the stack-pointer-relative frame references by four bytes.
The basic principle is that when a given function is executing, then the stack is in the right state that is expected by that function (unless something has gone horribly wrong due to a bug causing corruption or something). The LIFO allocation strategy of the stack frames closely tracks the function calls and returns. Functions that are called must save and restore certain registers (the "callee-saved registers"), which helps to maintain the stable stack environment.
How do segmented stacks work? This question also applies to Boost.Coroutine so I am using the C++ tag as well here. The main doubt comes from this article It looks like what they do is keep some space at the bottom of the stack and check if it has gotten corrupted by registering some sort of signal handler with the memory allocated there (perhaps via mmap and mprotect?) And then when they detect that they have run out of space they continue by allocating more memory and then continuing from there. 3 questions about this
Isn't this construct a user space thing? How do they control where the new stack is allocated and how do the instructions the program is compiled down to get aware of that?
A push instruction is basically just adding a value to the stack pointer and then storing the value in a register on the stack, then how can the push instruction be aware of where the new stack starts and correspondingly how can the pop know when it has to move the stack pointer back to the old stack?
They also say
After we've got a new stack segment, we restart the goroutine by retrying the function that caused us to run out of stack
what does this mean? Do they restart the entire goroutine? Won't this possibly cause non deterministic behavior?
How do they detect that the program has overrun the stack? If they keep a canary-ish memory area at the bottom then what happens when the user program creates an array big enough that overflows that? Will that not cause a stack overflow and is a potential security vulnerability?
If the implementations are different for Go and Boost I would be happy to know how either of them deal with this situation 🙂
I'll give you a quick sketch of one possible implementation.
First, assume most stack frames are smaller than some size. For ones that are larger, we can use a longer instruction sequence at entry to make sure there is enough stack space. Let's assume we're on an architecture that that has 4k pages and we're choosing 4k - 1 as the maximum size stack frame handled by the fast path.
The stack is allocated with a single guard page at the bottom. That is, a page that is not mapped for write. At function entry, the stack pointer is decremented by the stack frame size, which is less than the size of a page, and then the program arranges to write a value at the lowest address in the newly allocated stack frame. If the end of the stack has been reached, this write will cause a processor exception and ultimately be turned into some sort of upcall from the OS to the user program -- e.g. a signal in UNIX family OSes.
The signal handler (or equivalent) has to be able to determine this is a stack extension fault from the address of the instruction that faulted and the address it was writing to. This is determinable as the instruction is in the prolog of a function and the address being written to is in the guard page of the stack for the current thread. The instruction being in the prolog can be recognized by requiring a very specific pattern of instructions at the start of functions, or possibly by maintaining metadata about functions. (Possibly using traceback tables.)
At this point the handler can allocate a new stack block, set the stack pointer to the top of the block, do something to handle unchaining the stack block, and then call the function that faulted again. This second call is safe because the fault is in the function prolog the compiler generated and no side effects are allowed before validating there is enough stack space. (The code may also need to fixup the return address for architectures that push it onto the stack automatically. If the return address is in a register, it just needs to be in the same register when the second call is made.)
Likely the easiest way to handle unchaining is to push a small stack frame onto the new extension block for a routine that when returned to unchains the new stack block and frees the allocated memory. It then returns the processor registers to the state they were in when the call was made that caused the stack to need to be extended.
The advantage of this design is that the function entry sequence is very few instructions and is very fast in the non-extending case. The disadvantage is that in the case where the stack does need to be extended, the processor incurs an exception, which may cost much much more than a function call.
Go doesn't actually use a guard page if I understand correctly. Rather the function prolog explicitly checks the stack limit and if the new stack frame won't fit it calls a function to extend the stack.
Go 1.3 changed its design to not use a linked list of stack blocks. This is to avoid the trap cost if the extension boundary is crossed in both directions many times in a certain calling pattern. They start with a small stack, and use a similar mechanism to detect the need for extension. But when a stack extension fault does occur, the entire stack is moved to a larger block. This removes the need for unchaining entirely.
There are quite a few details glossed over here. (E.g. one may not be able to do the stack extension in the signal handler itself. Rather the handler can arrange to have the thread suspended and hand it off to a manager thread. One likely has to use a dedicated signal stack to handle the signal as well.)
Another common pattern with this sort of thing is the runtime requiring there to be a certain amount of valid stack space below the current stack frame for either something like a signal handler or for calling special routines in the runtime. Go works this way and the stack limit test guarantees a certain fixed amount of stack space is available below the current frame. One can e.g. call plain C functions on the stack so long as one guarantees they do not consume more than the fixed stack reserve amount. (One can use this to call C library routines in theory, though most of these have no formal specification of how much stack they might use.)
Dynamic allocation in the stack frame, such as alloca or stack allocated variable length arrays, add some complexity to the implementation. If the routine can compute the entire final size of the frame in the prolog then it is fairly straightforward. Any increase in the frame size while the routine is running likely has to be modeled as a new call, though with Go's new architecture that allows moving the stack, it is possible the alloca point in the routine can be made such that all the state allows a stack move to happen there.
I understand that if you have a multithreaded application, and you need to allocate a lot of memory, then you should allocate on heap. Stack space is divided up amongst threads of your application, thus the size of stack for each thread gets smaller as you create new threads. Thus, if you tried to allocate lots of memory on stack, it could overflow. But, assuming that you have a single-threaded application, is the stack size essentially the same as that for heap?
I read elsewhere that stack and heap don't have a clearly defined boundary in the address space, rather that they grow into each other.
P.S. Lifetime of the objects being allocated is not an issue. The objects gets created first thing in the program, and gets cleaned at exit. I don't have to worry about it going out of scope, and thus getting cleaned from stack space.
No, stack size is not the same as heap. Stack objects get pushed/popped in a LIFO manner, and used for things such as program flow. For example, arguments are "pushed" into the stack before a function call, then "popped" into function arguments to be accessed. Recursion therefore, uses a lot of stack space if you go too deep. Heap is really for pointers and allocated memory. In the real world, the stack is like the gears in your clock, and the heap is like your desk. Your clock sits on your desk, because it takes up room - but you use it for something completely different than your desk.
Check out this question on Stack Overflow:
Why is memory split up into stack and heap?'
int A[10000000]; //This gives a segmentation fault
int *A = (int*)malloc(10000000*sizeof(int));//goes without any set fault.
Now my question is, just out of curiosity, that if ultimately we are able to allocate higher space for our data structures, say for example, BSTs and linked lists created using the pointers approach in C have no as such memory limit(unless the total size exceeds the size of RAM for our machine) and for example, in the second statement above of declaring a pointer type, why is that we can't have an array declared of higher size(until it reaches the memory limit!!)...Is this because the space allocated is contiguous in a static sized array?.But then from where do we get the guarantee that in the next 1000000 words in RAM no other piece of code would be running...??
PS: I may be wrong in some of the statements i made..please correct in that case.
Firstly, in a typical modern OS with virtual memory (Linux, Windows etc.) the amount of RAM makes no difference whatsoever. Your program is working with virtual memory, not with RAM. RAM is just a cache for virtual memory access. The absolute limiting factor for maximum array size is not RAM, it is the size of the available address space. Address space is the resource you have to worry about in OSes with virtual memory. In 32-bit OSes you have 4 gigabytes of address space, part of which is taken up for various household needs and the rest is available to you. In 64-bit OSes you theoretically have 16 exabytes of address space (less than that in practical implementations, since CPUs usually use less than 64 bits to represent the address), which can be perceived as practically unlimited.
Secondly, the amount of available address space in a typical C/C++ implementation depends on the memory type. There's static memory, there's automatic memory, there's dynamic memory. The address space limits for each memory type are pre-set in advance by the compiler. Which raises the question: where are you declaring your large array? Which memory type? Automatic? Static? You provided no information, but this is absolutely necessary. If you are attempting to declare it as a local variable (automatic memory), then no wonder it doesn't work, since automatic memory (aka "stack memory") has very limited address space assigned to it. Your array simply does not fit. Meanwhile, malloc allocates dynamic memory, which normally has the largest amount of address space available.
Thirdly, many compilers provide you with options that control the initial distribution of address space between different kinds of memory. You can request a much larger stack size for your program by manipulating such options. Quite possibly you can request a stack so large, than your local array will fit in it without any problems. But in practice, for obvious reasons, it makes very little sense to declare huge arrays as local variables.
Assuming local variables, this is because on modern implementations automatic variables will be allocated on the stack which is very limited in space. This link gives some of the common stack sizes:
platform default size
=====================================
SunOS/Solaris 8172K bytes
Linux 8172K bytes
Windows 1024K bytes
cygwin 2048K bytes
The linked article also notes that the stack size can be changed for example in Linux, one possible way from the shell before running your process would be:
ulimit -s 32768 # sets the stack size to 32M bytes
While malloc on modern implementations will come from the heap, which is only limited to the memory you have available to the process and in many cases you can even allocate more than is available due to overcommit.
I THINK you're missing the difference between total memory, and your programs memory space. Your program runs in an environment created by your operating system. It grants it a specific memory range to the program, and the program has to try to deal with that.
The catch: Your compiler can't 100% know the size of this range.
That means your compiler will successfully build, and it will REQUEST that much room in memory when the time comes to make the call to malloc (or move the stack pointer when the function is called). When the function is called (creating a stack frame) you'll get a segmentation fault, caused by the stack overflow. When the malloc is called, you won't get a segfault unless you try USING the memory. (If you look at the manpage for malloc() you'll see it returns NULL when there's not enough memory.)
To explain the two failures, your program is granted two memory spaces. The stack, and the heap. Memory allocated using malloc() is done using a system call, and is created on the heap of your program. This dynamically accepts or rejects the request and returns either the start address, or NULL, depending on a success or fail. The stack is used when you call a new function. Room for all the local variables is made on the stack, this is done by program instructions. Calling a function can't just FAIL, as that would break program flow completely. That causes the system to say "You're now overstepping" and segfault, stopping the execution.