Fortran, Open MP, indirect recursion, and limited stack memory - fortran

There are many responses on other posts related to the issue of stack space, OpenMP, and how to deal with it. However, I could not find information to truly understand why OpenMP adjusts the compiler options:
What is the reasoning behind why -fopenmp in gfortran implies -frecursive?
The documentation says:
Allow indirect recursion by forcing all local arrays to be allocated on the stack
However, I don't have the context to understand this. Why would parallelization require indirect recursion?
Why would parallelization want all local arrays to be on the stack?
I wish to understand so I know the consequences of overriding these options, say, with -fmax-stack-var-size=n, to avoid issues with stack overflows.

Without -frecursive, the compiler will put local variables exceeding the limit -fmax-stack-var-size= in static memory instead of the stack. That is, they will behave as if they have the SAVE attribute, and they are shared among all the threads. These semantics are non-sensical for a multi-threaded program, hence -fopenmp implies -frecursive.
Due to the increasing prevalence of multi-threaded programs, and because F2018 specifies that procedures are recursive by default, this behavior will change in a future release of GFortran, most likely by switching to heap allocation when exceeding the size limit for stack variables instead of using static memory. But for now, this is not an option.

Related

Using whole stack memory

Hello I heard that in c++ stack memory is being used for "normal" variables. How do I make stack full? I tried to use ton of arrays but it didnt help. How big is stack and where is it located?
The C++ language doesn't specify such thing as "stack". It is an implementation detail, and as such it doesn't make sense deliberating about unless we are discussing a particular implementation of C++.
But yes, in a typical C++ implementation, automatic variables are stored on the execution stack.
How do I make stack full?
Step 1: Use a language implementation that has limited stack size. This is quite common.
Step 2: Create an automatic variable that exceeds the limit. Or nest too many non-tail-recursive function calls. If you're lucky, the program may crash.
You wouldn't want stack to be exhausted in production use.
How big is stack
Depends on language implementation. It may even be configurable. The default is one to a few megabytes on common desktop/server systems. Less on embedded systems.
and where is it located?
Somewhere in memory where the language implementation has chosen.
The most important thing to take out of this is that the memory available for automatic variables is typically limited. As such:
Don't use large automatic variables.
Don't use recursion when asymptotic growth of depth is linear or worse.
Don't let user input affect the amount or size of automatic variables or depth of recursion without constraint.
Hello I heard that in c++ stack memory is being used for "normal" variables.
Local (automatic) variables declared in a function or main are allocated memory mostly on stack (or register) and are deallocated when the execution is done.
How do I make stack full? I tried to use ton of arrays but it didnt help.
Using ton of arrays, many recursive calls, parameter passing large structs that contain ton of arrays are ways. Another way might also be to reduce stack size: -Wl,--stack,number (for gcc)
How big is stack and where is it located?
It depends on platform, operating system so on. Standard does not determine any stack size. Its location is determined by OS before the program starts. OS allocates a memory for stack from virtual memory.

Allocating on heap vs allocating on stack in recursive functions

When I'm defining a recursive function, is it better/safer to allocate local variables on heap, and then clean them up before the function returns, than to allocate them on stack. The stack size on embedded systems is very limited, and there is a danger of stack overflow, when the recursion runs too deep.
The answer depends on your application domain and the specifics of your platform. "Embedded system" is a fuzzy term. A big oscilloscope running MS Windows or Linux is on one end of the spectrum. Most of it can be programmed like a normal PC. They should not fail, but if they do, one simply reboots them.
On the other end of the spectrum are controllers in safety-critical fields. Consider these Siemens switches which must react under all circumstances within milliseconds. Failure is not an option. They probably do not run Windows. Available resources are limited. Programming rules and procedures are much different here.
Now let's examine the options you have, recursion (or not!) with dynamic or automatic memory allocation.
Performance-wise the stack is much faster, so it is preferred. Dynamic allocation also involves some memory overhead for book keeping which is big if the data units are small. And there can be issues with memory fragmentation which cannot occur with automatic memory (although the scenarios which lead to fragmentation -- different lifetimes of objects -- are probably not directly solvable without dynamic allocation).
But it is true that on some systems the stack size is (much) smaller than the heap size; you must read about the memory layout on your system. The oscilloscope will have lots of memory and big stacks; the power switch will not.
If you are concerned about running out of memory I would follow Christian's advice to avoid recursion altogether and instead use a loop. Iteration may just keep the memory usage flat. Besides, recursion always uses stack space, e.g. for the return address and -value. The idea to allocate "local" variables dynamically would only make sense for larger data structures because you would still have to hold pointers to the data as automatic variables, which use up space on the stack, too (and would increase overall memory foot print).
In general, on systems with limited resources it is important to limit the maximum resource use by your program. Time is also a resource; dynamic allocation makes real time near impossible.
The application domain dictates the safety requirements. In safety-critical fields (pacemaker!) the program must not fail at all. That ideal is impossible to achieve unless the program is trivial, but great efforts are made to come close. In other fields a program may fail in a defined way under circumstances it cannot handle, but it must not fail silently or in undefined ways (for example, by overwriting data). For example, instead of allocating unknown amounts of data dynamically one would just have a pre-defined array of fixed size for data and use the elements in the array instead, with bound checks.
When I'm defining a recursive function, is it better/safer to allocate local variables on heap, and then clean them up before the function returns, than to allocate them on stack.
You have both the C and C++ tags. Its a valid question for both, but I can only comment on C++.
In C++, its better to use the heap even though its slightly less efficient. That's be cause new can fail if you run out of heap memory. In the case of failure, new will throw an exception. However, running out of stack space does not cause an exception, and its one of the reasons alloca is frowned upon in C++.
I think generally speaking you should avoid recursion in embedded systems altogether, as every time the function is called the return address is pushed to the stack. This could cause an unexpected overflow. Try switching to a loop.
Back to your question though, mallocing will be slower but safer. If there is no heap space then malloc will return an error and you can safely clean up the memory. This comes at a large cost in speed as malloc can be quite slow.
If you know how many iterations you expect ahead of time, you have the option of mallocing an array of the variables that you need. That way you will only malloc once which saves time, and won't risk unexpectedly filling up the heap or stack. Also you will only be left with one variable to free.

Is program runtime affected by where the objects reside in the memory?

My program allocates all of its resources which is slightly below 1MB in startup and no more, except primitive local variables. The allocation took place originally by malloc, so on the heap, but I wondered whether there will be any difference by putting them on the stack.
In various tests with program runtime from 3 seconds to 3 minutes. Accessing the stack steadily appears to be faster up to 10%. All I changed was whether to malloc the structs or to declare them as automatic variables.
Another interesting fact I found is that when I declare the objects as static. The program will run 20~30% slower. I have no idea why. I double checked whether I made a mistake but the only difference really is whether static is there or not. Do static variables go somewhere else in the stack than automatic variables?
Before I had quite an opposite experience that in a C++ class, when I made a const member array from non-static to static, the program did run faster. The memory consumption was same because there was only one instance of that object.
Is program runtime affected by where the objects reside in the memory? Even if so, can't the compiler manage to place the objects in the right place for maximum efficiency?
Well, yeah, program performance is affected by where objects reside in memory.
The problem is, unless you have intimate knowledge of how your compiler works and how it uses features of your particular host system (operating system services, hardware, processor cache, etc), and how those things are configured, you will not be able to consistently exploit it. Even if you do succeed, small changes (e.g. upgrading a compiler, changing optimisation settings, a change of process quotas, changing amount of physical memory, changing a hard drive [e.g. that is used for swap space]) can affect performance, and it won't always be easy to predict whether a change will improve or degrade performance). Performance is sensitive to all these things - and to the interaction between them, in ways that are not always obvious without close analysis.
Is program runtime affected by where the objects reside in the memory?
Yes, program performance will be affected by the location of an object in memory, among other factors.
Whenever an object in the heap is accessed, it's done by dereferencing a pointer. Dereferencing pointers requires additional computation to find the next memory address and every additional pointer that's between you and your data (e.g. pointer-> pointer -> actual data) will make it slightly worse. In addition to this, it increases the cache miss rate for the CPU because the data that it expects to access next is not really in contiguous memory blocks. In other words, the assumption the CPU made to try and optimize its pipeline turns out to be false, and it pays the penalty for an incorrect prediction.
Even if so, can't the compiler manage to place the objects in the
right place for maximum efficiency?
The C/C++ compilers will place the objects at whatever location you tell it to. If you're using malloc, you shouldn't expect the compiler to put things on the stack, and vice-versa.
The compiler wouldn't be able to do this even in principle because malloc dynamically allocates memory at runtime, long after the compiler's job has ended. What is known at compile time is the size for the stack, but how the contents of this memory are organized depends entirely on you.
You might be able to get some benefits from compiler optimization settings, but most of your benefits in optimization efforts will be better spent improving data structures and/or algorithms being used.

Disadvantages of using large variables/arrays on the stack?

What are the disadvantages, if any, of defining large arrays or objects on the stack? Take the following example:
int doStuff() {
int poolOfObjects[1500];
// do stuff with the pool
return 0;
}
Are there any performance issues with this? I've heard of stack overflow issues, but I'm assuming that this array isn't big enough for that.
Stack overflow is a problem, if
the array is larger than the thread stack
the call tree is very deep, or
the function uses recursion
In all other cases, stack allocation is a very fast and efficient way to get memory. Calling a large number of constructors and destructors can be slow, however, so if your constructor/destructor are non-trivial, you might want to look at a longer-lived pool.
As you have mentioned, overrunning the stack is the primary issue. Even if your particular case isn't very large, consider what will happen if the function is recursive.
Even worse, if the function is called from a recursive function, it may be inlined - thus leading to "surprise" stackoverflow problems. (I've run into this issue multiple times with the Intel Compiler.)
As far as performance issues go, these are better measured than guessed. But having a very large array on the stack could potentially hurt data-locality if it separates other variables.
Other than that, stack allocation is dirt-cheap and faster than heap allocation. On some compilers (like MSVC), using more than 4k stack will make the compiler generate a buffer security check. (But it can be disabled though.)
https://github.com/kennethlaskoski/SubtleBug
This code contains a bug that I call subtle because the binary runs flawlessly on iOS Simulator and crashes shamefully on a real device.
Spoiler ahead: what happens on a real device is a classic stack overflow due to allocation of a very large variable.
The simulator uses the much larger x86 stack, so all goes well.
http://en.wikipedia.org/wiki/Stack_overflow#Very_large_stack_variables
The stack overflow issues should be your main technical concern, but if this code is going to be used by others its careful to consider if the magic size of 1500 might disrupt someone's usage of this function.

How can I determine appropriate stack and heap sizes for ARM Cortex, using C++

The cortex M3 processor startup file allows you to specify the amount of RAM dedicated to the stack and the heap. For a c++ code base, is there a general rule of thumb or perhaps some more explicit way to determine the values for the stack and heap sizes? For example, would you count the number and size of unique objects, or maybe use the compiled code size?
The cortex M3 processor startup file
allows you to specify the amount of
RAM dedicated to the stack and the
heap.
That is not a feature of the Cortex-M3, but rather the start-up code provided by your development toolchain. It is the way the Keil ARM-MDK default start-up files for M3 work. It is slightly unusual; more commonly you would specify a stack size, and any remaining memory after stack and static memory allocation by the linker becomes the heap; this is arguably better since you do not end up with a pool of unusable memory. You could modify that and use an alternative scheme, but you'd need to know what you are doing.
If you are using Keil ARM-MDK, the linker options --info=stack and --callgraph add information to the map file that aids stack requirement analysis. These and other techniques are described here.
If you are using an RTOS or multi-tasking kernel, each task will have its own stack. The OS may provide stack analysis tools, Keil's RTX kernel viewer shows current stack usage but not peak stack usage (so is mostly useless, and it only works correctly for tasks with default stack lengths).
If you have to implement stack checking tools yourself, the normal method is to fill the stack with a known value, and starting from the high address, inspect the value until you find the first value that is not the fill byte, this will give the likley high tide mark of the stack. You can implement code to do this, or you can manually fill the memory from the debugger, and then monitor stack usage in a debugger memory window.
Heap requirement will depend on the run-time behaviour of your code; you'll have to analyse that yourself however in ARM/Keil Realview, the MemManage Exception handler will be called when C++'s new throws an exception; I am not sure if malloc() does that or simply returns NULL. You can place a breakpoint in the exception handler or modify the handler to emit an error message to detect heap exhaustion during testing. There is also a a __heapstats() function that can be used to output heap information. It has a somewhat cumbersome interface, I wrapped it thus:
void heapinfo()
{
typedef int (*__heapprt)(void *, char const *, ...);
__heapstats( (__heapprt)std::fprintf, stdout ) ;
}
The compiled code size will not help as the code does not run in the stack nor the heap. Cortex-M3 devices are typically implemented on microcontrollers with built in Flash and a relatively small amount of RAM. In this configuration, the code will typically run from Flash.
The heap is used for dynamic memory allocation. Counting the number of unique objects will give you a rough estimate but you also have to account for any other elements that use dynamic memory allocation (using the new keyword in C++). Generally, dynamic memory allocation is avoided in embedded systems for the precise reason that heap size is hard to manage.
The stack will be used for variable passing, local variables, and context saving during exception handling routines. It is generally hard to get a good idea of stack usage unless you're code allocates a large block of local memory or a large objects. One technique that may help is to allocate all of the available RAM you have for the stack. Fill the stack with a known pattern (0x00 or 0xff are not the best choices since these values occur frequently), run the system for a while then examine the stack to see how much was used. Admittedly, this not a very precise nor scientific approach but still helpful in many cases.
The latest version of the IAR Compiler has a feature that will determine what stack size you need, based on a static analysis of your code (assuming you don't have any recursion).
The general approach, if you don't have an exact number, is to make as big as you can, and then when you start running out of memory, start trimming the stack down until your program crashes due to a stack over flow. I wish that was a joke, but that is the way it is usually done.
Reducing until it crashes is a quick ad-hoc way. You can also fill the stack with a known value, say, 0xCCCC, and then monitor maximum stack usage by scanning for the 0xCCCC.
It's imperfect, but much better than looking for a crash.
The rationale being, reducing stack size does not guarantee that stack overflow will munch something "visible".