How to set openmp thread stack to unlimited? - fortran

Can someone tell me how to set OpenMP stack size to unlimited?
Like this link: Why Segmentation fault is happening in this openmp code?
I also have a project written by Fortran (customer‘s complex code), if I set OMP_STACKSIZE, the project is running normally. If I unset it, the project fails.
But, different input data have different OMP_STACKSIZE, so I must try it for each inputdata, (because I must save memory).
Can I set the OpenMP stack like pthread (ulimit -s unlimited)? Or have some way to set omp stack size dynamically?
I'm using RHEL 6.1, and the Intel compiler.
Thanks a lot!

There is big difference between how the stacks of the main thread and of the worker threads are implemented.
The "unlimited" stack of the main thread starts at the highest virtual address available in user mode and grows downwards until it meets the program break (the end of the data segment) or hits another memory allocation (either named or anonymous mapping) at which point the program crashes.
Any additional stacks have to be placed somewhere in memory between the program break and the bottom of the main stack. They cannot have an arbitrary extendible length since their initial placements (i.e. the distance between their beginnings) determines their maximum sizes (and vice versa - the specified maximum sizes determine their initial placement). This is the reason why the Linux implementation of pthread_create(3) (used by virtually all OpenMP runtimes in order to create new threads) states:
On Linux/x86-32, the default stack size for a new thread is 2 megabytes. Under the NPTL threading implementation, if the RLIMIT_STACK soft resource limit at the time the program started has any value other than "unlimited", then it determines the default stack size of new threads. Using pthread_attr_setstacksize(3), the stack size attribute can be explicitly set in the attr argument used to create a thread, in order to obtain a stack size other than the default.
In other words, the answer is no - you cannot specify unlimited stack size for threads other than the main one.

Related

How do segmented stacks work

How do segmented stacks work? This question also applies to Boost.Coroutine so I am using the C++ tag as well here. The main doubt comes from this article It looks like what they do is keep some space at the bottom of the stack and check if it has gotten corrupted by registering some sort of signal handler with the memory allocated there (perhaps via mmap and mprotect?) And then when they detect that they have run out of space they continue by allocating more memory and then continuing from there. 3 questions about this
Isn't this construct a user space thing? How do they control where the new stack is allocated and how do the instructions the program is compiled down to get aware of that?
A push instruction is basically just adding a value to the stack pointer and then storing the value in a register on the stack, then how can the push instruction be aware of where the new stack starts and correspondingly how can the pop know when it has to move the stack pointer back to the old stack?
They also say
After we've got a new stack segment, we restart the goroutine by retrying the function that caused us to run out of stack
what does this mean? Do they restart the entire goroutine? Won't this possibly cause non deterministic behavior?
How do they detect that the program has overrun the stack? If they keep a canary-ish memory area at the bottom then what happens when the user program creates an array big enough that overflows that? Will that not cause a stack overflow and is a potential security vulnerability?
If the implementations are different for Go and Boost I would be happy to know how either of them deal with this situation 🙂
I'll give you a quick sketch of one possible implementation.
First, assume most stack frames are smaller than some size. For ones that are larger, we can use a longer instruction sequence at entry to make sure there is enough stack space. Let's assume we're on an architecture that that has 4k pages and we're choosing 4k - 1 as the maximum size stack frame handled by the fast path.
The stack is allocated with a single guard page at the bottom. That is, a page that is not mapped for write. At function entry, the stack pointer is decremented by the stack frame size, which is less than the size of a page, and then the program arranges to write a value at the lowest address in the newly allocated stack frame. If the end of the stack has been reached, this write will cause a processor exception and ultimately be turned into some sort of upcall from the OS to the user program -- e.g. a signal in UNIX family OSes.
The signal handler (or equivalent) has to be able to determine this is a stack extension fault from the address of the instruction that faulted and the address it was writing to. This is determinable as the instruction is in the prolog of a function and the address being written to is in the guard page of the stack for the current thread. The instruction being in the prolog can be recognized by requiring a very specific pattern of instructions at the start of functions, or possibly by maintaining metadata about functions. (Possibly using traceback tables.)
At this point the handler can allocate a new stack block, set the stack pointer to the top of the block, do something to handle unchaining the stack block, and then call the function that faulted again. This second call is safe because the fault is in the function prolog the compiler generated and no side effects are allowed before validating there is enough stack space. (The code may also need to fixup the return address for architectures that push it onto the stack automatically. If the return address is in a register, it just needs to be in the same register when the second call is made.)
Likely the easiest way to handle unchaining is to push a small stack frame onto the new extension block for a routine that when returned to unchains the new stack block and frees the allocated memory. It then returns the processor registers to the state they were in when the call was made that caused the stack to need to be extended.
The advantage of this design is that the function entry sequence is very few instructions and is very fast in the non-extending case. The disadvantage is that in the case where the stack does need to be extended, the processor incurs an exception, which may cost much much more than a function call.
Go doesn't actually use a guard page if I understand correctly. Rather the function prolog explicitly checks the stack limit and if the new stack frame won't fit it calls a function to extend the stack.
Go 1.3 changed its design to not use a linked list of stack blocks. This is to avoid the trap cost if the extension boundary is crossed in both directions many times in a certain calling pattern. They start with a small stack, and use a similar mechanism to detect the need for extension. But when a stack extension fault does occur, the entire stack is moved to a larger block. This removes the need for unchaining entirely.
There are quite a few details glossed over here. (E.g. one may not be able to do the stack extension in the signal handler itself. Rather the handler can arrange to have the thread suspended and hand it off to a manager thread. One likely has to use a dedicated signal stack to handle the signal as well.)
Another common pattern with this sort of thing is the runtime requiring there to be a certain amount of valid stack space below the current stack frame for either something like a signal handler or for calling special routines in the runtime. Go works this way and the stack limit test guarantees a certain fixed amount of stack space is available below the current frame. One can e.g. call plain C functions on the stack so long as one guarantees they do not consume more than the fixed stack reserve amount. (One can use this to call C library routines in theory, though most of these have no formal specification of how much stack they might use.)
Dynamic allocation in the stack frame, such as alloca or stack allocated variable length arrays, add some complexity to the implementation. If the routine can compute the entire final size of the frame in the prolog then it is fairly straightforward. Any increase in the frame size while the routine is running likely has to be modeled as a new call, though with Go's new architecture that allows moving the stack, it is possible the alloca point in the routine can be made such that all the state allows a stack move to happen there.

Stack allocation for C++ green threads

I'm doing some research in C++ green threads, mostly boost::coroutine2 and similar POSIX functions like makecontext()/swapcontext(), and planning to implement a C++ green thread library on top of boost::coroutine2. Both require the user code to allocate a stack for every new function/coroutine.
My target platform is x64/Linux. I want my green thread library to be suitable for general use, so the stacks should expand as required (a reasonable upper limit is fine, e.g. 10MB), it would be great if the stacks could shrink when too much memory is unused (not required). I haven't figured out an appropriate algorithm to allocate stacks.
After some googling, I figured out a few options myself:
use split stack implemented by the compiler (gcc -fsplit-stack), but split stack has performance overhead. Go has already moved away from split stack due to performance reasons.
allocate a large chunk of memory with mmap() hope the kernel is smart enough to leave the physical memory unallocated and allocate only when the stacks are accessed. In this case, we are at the mercy of the kernel.
reserve a large memory space with mmap(PROT_NONE) and setup a SIGSEGV signal handler. In the signal handler, when the SIGSEGV is caused by stack access (the accessed memory is inside the large memory space reserved), allocate needed memory with mmap(PROT_READ | PROT_WRITE). Here is the problem for this approach: mmap() isn't asynchronous safe, cannot be called inside a signal handler. It still can be implemented, very tricky though: create another thread during program startup for memory allocation, and use pipe() + read()/write() to send memory allocation information from the signal handler to the thread.
A few more questions about option 3:
I'm not sure the performance overhead of this approach, how well/bad the kernel/CPU performs when the memory space is extremely fragmented due to thousands of mmap() call ?
Is this approach correct if the unallocated memory is accessed in kernel space ? e.g. when read() is called ?
Are there any other (better) options for stack allocation for green threads ? How are green thread stacks allocated in other implementations, e.g. Go/Java ?
The way that glibc allocates stacks for normal C programs is to mmap a region with the following mmap flag designed just for this purpose:
MAP_GROWSDOWN
Used for stacks. Indicates to the kernel virtual memory system
that the mapping should extend downward in memory.
For compatibility, you should probably use MAP_STACK too. Then you don't have to write the SIGSEGV handler yourself, and the stack grows automatically. The bounds can be set as described here What does "ulimit -s unlimited" do?
If you want a bounded stack size, which is normally what people do for signal handlers if they want to call sigaltstack(2), just issue an ordinary mmap call.
The Linux kernel always maps physical pages that back virtual pages, catching the page fault when a page is first accessed (perhaps not in real-time kernels but certainly in all other configurations). You can use the /proc/<pid>/pagemap interface (or this tool I wrote https://github.com/dwks/pagemap) to verify this if you are interested.
Why mmap? When you allocate with new (or malloc) the memory is untouched and definitely not mapped.
const int STACK_SIZE = 10 * 1024*1024;
char*p = new char[STACK_SIZE*numThreads];
p now has enough memory for the threads you want. When you need the memory, start accessing p + STACK_SIZE * i

Reducing the heap size of a C++ program after large calculation

Consider an MPI application based on two steps which we shall call load and globalReduce. Just for simplicity the software is being described as such yet there is a lot more going on, so it is not purely a Map/Reduce problem.
During the load step, all ranks in each given node are queued so that one and only one rank has full access to all memory of the node. The reason for this design arises from the fact that during the load stage, there is a set of large IO blocks being read, and they all need to be loaded in memory before a local reduction can take place. We shall call the result of this local reduction a named variable myRankVector. Once the myRankVector variable is obtained, the IO blocks are released. The variable myRankVector itself uses little memory, so while during its creation the node can be using all the memory, after completion the rank only needs to use 2-3 GB to hold myRankVector.
During the globalReduce stage in the node, it is expected all ranks in the node had loaded their corresponding globalReduce.
So here is my problem, while I have ensured that there are absolutely not any memory leaks (I program using shared pointers, I double checked with Valgrind, etc.), I am positive that the heap remains expanded even after all the destructors have released the IO blocks. When the next rank in the queue comes to do its job, it starts asking for lots of memory just as the previous rank did and of course the program gets the Linux kill yielding "Out of memory: Kill process xxx (xxxxxxxx) score xxxx or sacrifice child". It is clear why this is the case, the second rank in the queue wants to use all the memory yet the first rank remains with a large heap.
So, after the setting the context of this question: is there a way to manually reduce the heap size in C++ to truly release memory not being used?
Thanks.
Heaps are implemented using mmap on linux, and you would need to use your own heap, which you can dispose and munmap completely.
The munmap would free the space required.
Look at code in boost : pool for an implementation which would allow you to manage the underlying heaps independently.
In my experience, it is very difficult to manage std containers with custom allocators, as they are class derived, rather than instance derived.
So, after the setting the context of this question: is there a way to manually reduce the heap size in C++ to truly release memory not being used?
That's operating system dependent, but most probably not possible.
Most operating systems leave you with memory allocations you've done from a single process until that process is completely done and killed.
Could shared memory solve your problem (even if you do not want to share this memory)?
You can allocate a block of shared memory in your "load" phase and unattach it after "myRankVector" is calculated.
(see shmget, shmat, shmdt, shmctl( ..., IPC_RMID, . ) )

What does stack size in a thread define in C++?

I'm using C++ and Windows.h in my source code. I read the CreateThread API in MSDN, but I still don't understand the essence of specifying stack size. By default it is 1 MB. But what will happen if I specify 32 bytes?
What does stack size in a thread define?
Please provide a thorough explanation and I'll appreciate it. Thanks.
The stack is used to store local variables, pass parameters in function calls, store return addresses. A thread's stack has a fixed size which is determined when the thread is created. That is the value that you are referring too.
The stack size is determined when the thread is created since it needs to occupy contiguous address space. That means that the entire address space for the thread's stack has to be reserved at the point of creating the thread.
If the stack is too small then it can overflow. That's an error condition known as stack overflow, from which this website took its name. When you call a function some or all of the following happens:
Parameters are pushed onto the stack.
The return address is pushed onto the stack.
A stack frame containing space for the function's local variables is created.
All of this consumes space from the stack. When the function in turn calls another function, more stack space is consumed. As the call stack goes deeper, more stack space is required.
The consequence therefore of setting the stack size too low is that you can exhaust the stack and overflow it. That is a terminal condition from which you cannot recover. Certainly 32 bytes (rounded up to one page which is 4096 bytes) is too small for almost all threads.
If you have a program with a lot of threads, and you know that the thread's don't need to reserve 1MB of stack size then there can be benefits to using a smaller stack size. Doing so can avoid exhausting the available process address space.
On the other hand you might have a program with a single thread that has deep call stacks that consume large amounts of stack space. In this scenario you might reserve more than the default 1MB.
However, unless you have strong reason to do otherwise, it is likely best to stick to the default stack size.
Stack size is just tradeoff between ability to create many threads and possibility of stack overflow in one of them.
The more stack size is, the less number of threads you can create and the less possibility of stack overflow is. You should worry about stack size only if you are going to create many threads (you will have to lower stack size but remember about stack overflow). Otherwise default value is suffice.
But what will happen if I specify 32 bytes?
I have not read the Windows documentation, but if Windows allows this (to specify only 32 bytes), you will most likely get a stack overflow. According to their documentation the value is rounded up to the page size in anycase, therefore in reality you stack size will be at least the size of a page. The created thread assumes that there exists enough "stack space" for it to use (for allocating automatic variables, storing function addresses etc), and allocates space according to it's needs. When there is not enough stack space, the stack allocator might use invalid memory, overriding memory used elsewhere.
What does stack size in a thread define?
It defines how much memory will be allocated for use by that thread's stack.
There is a good description of what exactly a thread call stack is here

Concurrent writes in the same global memory location

I have several blocks, each having some integers in a shared memory array of size 512. How can I check if the array in every block contains a zero as an element?
What I am doing is creating an array that resides in the global memory. The size of this array depends on the number of blocks, and it is initialized to 0. Hence every block writes to a[blockid] = 1 if the shared memory array contains a zero.
My problem is when I have several threads in a single block writing at the same time. That is, if the array in the shared memory contains more than one zero, then several threads will write a[blockid] = 1. Would this generate any problem?
In other words, would it be a problem if 2 threads write the exact same value to the exact same array element in global memory?
For a CUDA program, if multiple threads in a warp write to the same location then the location will be updated but it is undefined how many times the location is updated (i.e. how many actual writes occur in series) and it is undefined which thread will write last (i.e. which thread will win the race).
For devices of compute capability 2.x, if multiple threads in a warp write to the same address then only one thread will actually perform the write, which thread is undefined.
From the CUDA C Programming Guide section F.4.2:
If a non-atomic instruction executed by a warp writes to the same location in global memory for more than one of the threads of the warp, only one thread performs a write and which thread does it is undefined.
See also section 4.1 of the guide for more info.
In other words, if all threads writing to a given location write the same value, then it is safe.
In the CUDA execution model, there are no guarantees that every simultaneous write from threads in the same block to the same global memory location will succeed. At least one write will work, but it isn't guaranteed by the programming model how many write transactions will occur, or in what order they will occur if more than one transaction is executed.
If this is a problem, then a better approach (from a correctness point of view), would be to have only one thread from each block do the global write. You can either use a shared memory flag set atomically or a reduction operation to determine whether the value should be set. Which you choose might depend on how many zeros there are likely to be. The more zeroes there are, the more attractive the reduction will be. CUDA includes warp level __any() and __all() operators which can be built into a very efficient boolean reduction in a few lines of code.
Yes, it will be a problem called as Race Condition.
You should consider synchronizing access to the global data through process Semaphores
While not a mutex or semaphore, CUDA does contain a synchronization primative you can utilize for serializing access to a given code segment or memory location. Through the __syncthreads() function, you can create a barrier so that any given thread blocks at the point of the command call until all the threads in a given block have executed the __syncthreads() command. That way you can hopefully serialize access to your memory location and avoid a situation where two threads need to write to the same memory location at the same time. The only warning is that all the threads have to at some point execute __syncthreads(), or else you will end up with a dead-lock situation. So don't place the call inside some conditional if-statement where some threads may never execute the command. If you do approach your problem like this, there will need to be some provision made for the threads that don't initially call __syncthreads() to call the function later in order to avoid deadlock.