Stack overflow despite dynamically allocated arrays in Modern Fortran - fortran

I am currently bumping into stack overflow issues while trying to use large high-dimensional arrays in an Abaqus user material (UMAT) subroutine written in Modern Fortran. To indicate the magnitude, there are about 15 4-D and 5-D double-precision arrays and derived types with sizes like (100,12,52,10), (6,7,100,100,6) among many other double precision and integer variables local to the sub-program units (smaller subroutines contained in modules). I am testing it on a single element model resulting in 32 calls to the UMAT subroutine, which is clearly not that intensive. Below is the error message from the Abaqus/Standard solver.
*** Error: Runtime stack limit has been exceeded.
This may be caused by user subroutines with large data structures allocated on
the stack or recursion. For suggestions on how to resolve this problem, please
refer to the chapter "Ensuring thread safety" in the ABAQUS documentation.
*** ERROR CATEGORY: ELEMENT LOOP
I learnt that automatic arrays could be the issue (Stack overflow in Fortran 90, Anything helpful about Fortran stack overflow?) and looked at the memory management solutions to use heap instead of stack memory. So far I have tried dynamic allocation and de-allocation using Fortran allocatable arrays and Abaqus-specific thread-safe allocatable arrays (https://abaqus-docs.mit.edu/2017/English/SIMACAESUBRefMap/simasub-c-localarrays.htm). Solver executes successfully when running with smaller (both dynamic/static) arrays e.g. (20,12,52,3),(6,7,20,20,6) but none of them solved the issue when trying those large arrays.
I don't have much background in memory and process management and have tried all options that I could find on the internet. I am unable to provide the code blocks as it is a large one. Hope the provided info would suffice to get a reasonable picture.
What would be the stack size for the above array sizes? Could it be possible that the other double precision variables or the intermediate calculations performed with the arrays are still too many to exceed the stack limit within 32 calls to the subroutine? What else could be the reason for the stack memory bottleneck? Any suggestions to resolve this would be helpful.

Related

Using whole stack memory

Hello I heard that in c++ stack memory is being used for "normal" variables. How do I make stack full? I tried to use ton of arrays but it didnt help. How big is stack and where is it located?
The C++ language doesn't specify such thing as "stack". It is an implementation detail, and as such it doesn't make sense deliberating about unless we are discussing a particular implementation of C++.
But yes, in a typical C++ implementation, automatic variables are stored on the execution stack.
How do I make stack full?
Step 1: Use a language implementation that has limited stack size. This is quite common.
Step 2: Create an automatic variable that exceeds the limit. Or nest too many non-tail-recursive function calls. If you're lucky, the program may crash.
You wouldn't want stack to be exhausted in production use.
How big is stack
Depends on language implementation. It may even be configurable. The default is one to a few megabytes on common desktop/server systems. Less on embedded systems.
and where is it located?
Somewhere in memory where the language implementation has chosen.
The most important thing to take out of this is that the memory available for automatic variables is typically limited. As such:
Don't use large automatic variables.
Don't use recursion when asymptotic growth of depth is linear or worse.
Don't let user input affect the amount or size of automatic variables or depth of recursion without constraint.
Hello I heard that in c++ stack memory is being used for "normal" variables.
Local (automatic) variables declared in a function or main are allocated memory mostly on stack (or register) and are deallocated when the execution is done.
How do I make stack full? I tried to use ton of arrays but it didnt help.
Using ton of arrays, many recursive calls, parameter passing large structs that contain ton of arrays are ways. Another way might also be to reduce stack size: -Wl,--stack,number (for gcc)
How big is stack and where is it located?
It depends on platform, operating system so on. Standard does not determine any stack size. Its location is determined by OS before the program starts. OS allocates a memory for stack from virtual memory.

Heap vs stack allocation in a large project

I have read numerous other answers on this topic but they don't quite answer what I'm looking for. Examples:
Class members and explicit stack/heap allocation
When should a class be allocated on the stack instead of the heap
Member function memory allocation stack or heap?
C++ stack vs heap allocation
These answers cover the mechanical differences between the two (automatic vs manual memory management, variable lifetimes, etc) but I am more interested in best practices, and how to write code that can scale.
Context
I am writing a class which processes a large stream of data, say 10-100s's of GB. Let's assume that the performance bottleneck is how fast my class can process the data, e.g. the source and destination of the data are both fast.
My class works by splitting the data into chunks of size N bytes, and processing. The optimal size N for maximal throughput depends on the processing performed, which is only known at runtime. N can range from 10's of bytes up to 1000's of bytes. If I did everything in the stack, for say N = 256, the total sum of member variables in the class is <1MB.
I also tried stack allocating different sizes of arrays for a small set of different Ns, and using only one at a given time. This is so far the fastest implementation. Nevertheless, comparing implementations that use all stack vs heap, performance difference is fairly small to use the heap, so that ends up being simpler.
Questions
If I make the choice to use stack vs. heap now, how does that affect future users of my class? For example, in theory one could write a program that has 100's of instances of this class. If I used all stack, and the user put all my instances on the stack, it would blow up.
How is stack usage factored into the design of a large hierarchical system? I don't see that mentioned when I read online and books. Mostly the stack is mentioned in the context of excessive recursion, trying to outright declare 100MB array, etc.
Generally, is the author of a class (whose underlying workings are abstracted away) supposed to give the end user some information about the stack footprint? Or some direction on when/whether heap allocation is required?

Fortran, Open MP, indirect recursion, and limited stack memory

There are many responses on other posts related to the issue of stack space, OpenMP, and how to deal with it. However, I could not find information to truly understand why OpenMP adjusts the compiler options:
What is the reasoning behind why -fopenmp in gfortran implies -frecursive?
The documentation says:
Allow indirect recursion by forcing all local arrays to be allocated on the stack
However, I don't have the context to understand this. Why would parallelization require indirect recursion?
Why would parallelization want all local arrays to be on the stack?
I wish to understand so I know the consequences of overriding these options, say, with -fmax-stack-var-size=n, to avoid issues with stack overflows.
Without -frecursive, the compiler will put local variables exceeding the limit -fmax-stack-var-size= in static memory instead of the stack. That is, they will behave as if they have the SAVE attribute, and they are shared among all the threads. These semantics are non-sensical for a multi-threaded program, hence -fopenmp implies -frecursive.
Due to the increasing prevalence of multi-threaded programs, and because F2018 specifies that procedures are recursive by default, this behavior will change in a future release of GFortran, most likely by switching to heap allocation when exceeding the size limit for stack variables instead of using static memory. But for now, this is not an option.

UNIX: What should be Stack Size (ulimit -s) in UNIX? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
How can I calculate the minimum Stack Size required for my program in UNIX, so that my program never get crashed.
Suppose my program is
int main()
{
int number;
number++;
return 0;
}
1) What can be the Stack Size requried to run for this program? How it is calculated ?
2) My Unix system gives ulimit -s 512000. Is this value 512MB really required for my small program?
3) And what if I have a big program having Multithreads, some 500 functions, including some libraries, Macros, Dynamically allocated memory etc. How much Stack Size is required for that ?
Your program in itself uses a few bytes - 1 int, but there is of course the part of the runtime that comes BEFORE main as well to take into account. But it's unlikely to be more than a few dozen bytes, maybe a couple of hundred bytes at a stretch. Since the minimum stack size in any modern OS is "one page" = 4KB, this should easily fit in that.
51200 = 51.2MB, but that seems quite high. On my Linux Fedora 16 x86-64 machine, it is 8192.
Threads don't really matter, as each thread has its own stack. The number of functions are in themselves not a huge contributor to stack usage. Running out of stack is nearly always caused by large local variables and/or deep recursion. For any program that is more than a little bit complex, calculating precise stack usage can be quite tricky. Typically, it involves running the program a lot and seeing if the stack "explodes". If not, you have enough stack. Library functions, generally speaking, tends to not use huge amounts of stack, but there are always exceptions.
To examplify:
void func()
{
int x, y, z;
float w;
...
}
This function takes up approximately 16 bytes of stack, plus the general overhead of calling a function, typically 1-3 "machine words" (4-12 bytes on a 32-bit machine, 8-24 bytes for a 64-bit machine).
void func2()
{
int x[10000];
...
}
This function will take 40000 bytes of stack-space. Obviously, you don't need many recursive calls to this function to run out of stack.
There is no magic way to tell how much space will your program require on the stack. It'd depend on what the code is actually doing. Infinite (or very deep) recursion would result in stack overflow even if the program doesn't seem to do anything.
As an example, see the following:
$ ulimit
unlimited
$ echo "foo(){foo();} main(){foo();}" | gcc -x c -
$ ./a.out
Segmentation fault (core dumped)
Most people rely on the stack being “large” and their programs not using all of it, simply because the size has been set so large that programs rarely fail because they run out of stack space unless they use very large arrays with automatic storage duration.
This is an engineering failure, in the sense that it is not engineering: A known and largely preventable source of complete failure is uncontrolled.
In general, it can be difficult to compute the actual stack needs of a program. Especially when there is recursion, a compiler cannot generally predict how many times a routine will be called recursively, so it cannot know how many times that routine will need stack space. Another complication is calls to addresses prepared at run-time, such as calls to virtual functions or through other pointers-to-functions.
However, compilers and linkers could provide some assistance. For any routine that uses a fixed amount of stack space, a compiler, in theory, could provide that information. A routine may include blocks that are or are not executed, and each block might have different stack space requirements. This would interfere with a compiler providing a fixed number for the routine, but a compiler might provide information about each block individually and/or a maximum for the routine.
Linkers could, in theory, examine the call tree and, if it is static and is not recursive, provide a maximum stack use for the linked program. They could also provide the stack use along a particular call subchain (e.g., from one routine through the chain of calls that leads to the same routine being called recursively) so that a human could then apply knowledge of the algorithm to multiple the stack use of the subchain by the maximum number of times it might be called recursively).
I have not seen compilers or linkers with these features. This suggests there is little economic incentive for developing these features.
There are times when stack use information is important. Operating system kernels may have a stack that is much more limited than user processes, so the maximum stack use of the kernel code ought (as a good engineering practice) to be calculated so that the stack size can be set appropriately (or the code redesigned to use less stack).
If you have a critical need for calculating stack space requirements, you can examine the assembly code generated by the compiler. In many routines on many computing platforms, a fixed number is subtracted from the stack pointer at the beginning of the routine. In the absence of additional subtractions or “push” instructions, this is the stack use of the routine, excluding further stack used by subroutines it calls. However, routines may contain blocks of code that contain additional stack allocations, so you must be careful about examining the generated assembly code to ensure you have found all stack adjustments.
Routines may also contain stack allocations computed at run-time. In a situation where calculating stack space is critical, you might avoid writing code that causes such allocations (e.g., avoid using C’s variable-length array feature).
Once you have determined the stack use of each routine, you can determine the total stack use of the program by adding the stack use of each routine along various routine-call paths (including the stack use of the start routine that runs before main is called).
This sort of calculation of the stack use of a complete program is generally difficult and is rarely performed.
You can generally estimate the stack use of a program by knowing how much data it “needs” to do its work. Each routine generally needs stack space for the objects it uses with automatic storage duration plus some overhead for saving processor registers, passing parameters to subroutines, some scratch work, and so on. Many things can alter stack use, so only an estimate can be obtained this way. For example, your sample program does not need any space for number. Since no result of declaring or using number is ever printed, the optimizer in your compiler can eliminate it. Your program only needs stack space for the start routine; the main routine does not need to do anything except return zero.

How can I determine appropriate stack and heap sizes for ARM Cortex, using C++

The cortex M3 processor startup file allows you to specify the amount of RAM dedicated to the stack and the heap. For a c++ code base, is there a general rule of thumb or perhaps some more explicit way to determine the values for the stack and heap sizes? For example, would you count the number and size of unique objects, or maybe use the compiled code size?
The cortex M3 processor startup file
allows you to specify the amount of
RAM dedicated to the stack and the
heap.
That is not a feature of the Cortex-M3, but rather the start-up code provided by your development toolchain. It is the way the Keil ARM-MDK default start-up files for M3 work. It is slightly unusual; more commonly you would specify a stack size, and any remaining memory after stack and static memory allocation by the linker becomes the heap; this is arguably better since you do not end up with a pool of unusable memory. You could modify that and use an alternative scheme, but you'd need to know what you are doing.
If you are using Keil ARM-MDK, the linker options --info=stack and --callgraph add information to the map file that aids stack requirement analysis. These and other techniques are described here.
If you are using an RTOS or multi-tasking kernel, each task will have its own stack. The OS may provide stack analysis tools, Keil's RTX kernel viewer shows current stack usage but not peak stack usage (so is mostly useless, and it only works correctly for tasks with default stack lengths).
If you have to implement stack checking tools yourself, the normal method is to fill the stack with a known value, and starting from the high address, inspect the value until you find the first value that is not the fill byte, this will give the likley high tide mark of the stack. You can implement code to do this, or you can manually fill the memory from the debugger, and then monitor stack usage in a debugger memory window.
Heap requirement will depend on the run-time behaviour of your code; you'll have to analyse that yourself however in ARM/Keil Realview, the MemManage Exception handler will be called when C++'s new throws an exception; I am not sure if malloc() does that or simply returns NULL. You can place a breakpoint in the exception handler or modify the handler to emit an error message to detect heap exhaustion during testing. There is also a a __heapstats() function that can be used to output heap information. It has a somewhat cumbersome interface, I wrapped it thus:
void heapinfo()
{
typedef int (*__heapprt)(void *, char const *, ...);
__heapstats( (__heapprt)std::fprintf, stdout ) ;
}
The compiled code size will not help as the code does not run in the stack nor the heap. Cortex-M3 devices are typically implemented on microcontrollers with built in Flash and a relatively small amount of RAM. In this configuration, the code will typically run from Flash.
The heap is used for dynamic memory allocation. Counting the number of unique objects will give you a rough estimate but you also have to account for any other elements that use dynamic memory allocation (using the new keyword in C++). Generally, dynamic memory allocation is avoided in embedded systems for the precise reason that heap size is hard to manage.
The stack will be used for variable passing, local variables, and context saving during exception handling routines. It is generally hard to get a good idea of stack usage unless you're code allocates a large block of local memory or a large objects. One technique that may help is to allocate all of the available RAM you have for the stack. Fill the stack with a known pattern (0x00 or 0xff are not the best choices since these values occur frequently), run the system for a while then examine the stack to see how much was used. Admittedly, this not a very precise nor scientific approach but still helpful in many cases.
The latest version of the IAR Compiler has a feature that will determine what stack size you need, based on a static analysis of your code (assuming you don't have any recursion).
The general approach, if you don't have an exact number, is to make as big as you can, and then when you start running out of memory, start trimming the stack down until your program crashes due to a stack over flow. I wish that was a joke, but that is the way it is usually done.
Reducing until it crashes is a quick ad-hoc way. You can also fill the stack with a known value, say, 0xCCCC, and then monitor maximum stack usage by scanning for the 0xCCCC.
It's imperfect, but much better than looking for a crash.
The rationale being, reducing stack size does not guarantee that stack overflow will munch something "visible".