gfortran flag for intel's -heap-arrays [size] - fortran

What is the gfortran flag equivalent for intel ifort's
-heap-arrays [size]

I've found this:
-fmax-stack-var-size=n This option specifies the size in bytes of the largest array that will be put on the stack; if the size is exceeded
static memory is used (except in procedures marked as RECURSIVE). Use
the option -frecursive to allow for recursive procedures which do not
have a RECURSIVE attribute or for parallel programs. Use
-fno-automatic to never use the stack. This option currently only affects local arrays declared with constant bounds, and may not apply
to all character variables. Future versions of GNU Fortran may improve
this behavior.
The default value for n is 32768.
from gfortran's website. I think it'll do the trick.

This is an old question, but the accepted answer is not fully correct and I would like to add context for future users like me who come across the post looking for answers.
I believe both intel's ifort and gcc's gfortran have some byte limit where arrays above said limit are not allocated on the stack, but instead are in static memory.
Intel's: -heap-arrays [size], will put any array bigger than [size] kilobytes on the heap, instead of in static memory or on the stack depending on the size.
Gcc does not have this option and instead only has -fmax-stack-var-size=n, where any variable above n bytes is not placed on the stack. The documentation (https://gcc.gnu.org/onlinedocs/gfortran/Code-Gen-Options.html) says:
if the size is exceeded static memory is used (except in procedures marked as RECURSIVE).
The key difference here is that these large variables are NOT guaranteed to be placed on the heap.
Therefore the two options from intel and gcc are not identical, and more care needs to be taken to ensure large arrays in gfortran are not shared in static memory.

Related

C++ Initialize Array Waste Value

I often make mistakes not to initialize the values of the array.
Theoretically, we know that in that case, the arrangement should have a waste value.
In practice, however, many values are initialized to zero.
Therefore, program satisfied with the small examples.
This makes debugging difficult.
Can you tell me why this is happening?
Uninitialized values usually appear to be zero in simple test cases because modern operating systems blank memory before handing it to processes as a security precaution. This won't hold if your program has been running for awhile, so don't depend on it. This applies to both automatic (stack) variables and heap allocations. For stack allocations it's actually worse as the variable can take on a value that the variable can't possibly contain normally, potentially crashing your program outright. When dealing with the Itanium processor, it could crash with a memory fault even when assigning an uninitialized integer variable to another variable.
Or try it in DOS. It will not work because DOS doesn't blank memory.
On the other hand, static and global allocations are guaranteed to be zeroed if not initialized by the standard.
If you want to be warned about uninitialised memory, both g++ and clang++ support the Memory Sanitizer. Just add -fsanitize=memory to your compiler flags and you'll get runtime errors when you read uninitialised memory.

Memory considerations for an array in a function with a variable size

I have the size of the array as a variable instead of as an actual number. For my program I call the function diagonalize three times with different values of array_size -- would the array be allocated and deallocated for each value of array_size, or would only one array be used and overwritten during the program? The code is below. Would it just be better to make three separate diagonalize functions which each internally declare an array with the size given by a unique global constant? Unfortunately I have to use arrays instead of vectors.
#include <mkl.h>
#include "mkl_lapacke.h"
void diagonalize(unsigned long long int array_size) {
lapack_complex_double z[array_size];
}
Details:
I am compiling with icpc -std=c++11 -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lpthread -ldl 10_site_main.cpp
icpc version 19.0.4.243 (gcc version 4.8.5 compatibility)
In
void diagonalize(unsigned long long int array_size) {
lapack_complex_double z[array_size];
}
z is variable-length array (VLA) feature from C99, C++ standard doesn't have it, but some C++ compilers support it as an extension for built-in types.
The default stack size is around 8MB on Linux, so using unsigned long long int for array_size is an overkill, unsigned would suffice.
Notably, Linux kernel got rid of all VLAs in 2018 because it can underflow/overflow the stack and corrupt it and hence provide vectors of attacks for kernel exploits.
Whether it underflows the stack depends on how much stack space is available when the function is called and that depends on the current call chain. In one call chain there can be a lot of stack space, in another - not so much. Which makes it hard to guarantee or prove that the function always has enough stack space for the VLA. Not using VLAs eliminates these opportunities to underflow/overflow the stack and corrupt it, which are the most popular and easy avenues for exploits.
would the array be allocated and deallocated for each value of array_size, or would only one array be used and overwritten during the program?
It is an automatic function local variable allocated in the function stack on each call. The allocation reserves the stack space and it normally takes one CPU instruction, like sub rsp, vla-size-in-bytes on x86-64. When a function returns all its automatic function variables cease to exist.
See https://en.cppreference.com/w/cpp/language/storage_duration, automatic storage duration for full details.
Would it just be better to make three separate diagonalize functions which each internally declare an array with the size given by a unique global constant?
It would make no difference because all automatic variables are destroyed and cease to exist when a function returns.

Initializing C++ array in constant time

char buffer[1000] = {0};
This initializes all 1000 elements to 0. Is this constant time? If not, why?
It seems like the compiler could optimize this to O(1) based on the following facts:
The array is of fixed size and known at compile time
The array is located on the stack, which means that presumably the executable could contain this data in the data segment of the executable (on Windows) as a chunk of data that is already filled with 0's.
Note that answers can be general to any compiler, but I'm specifically interested in answers tested on the MSVC compiler (any version) on Windows.
Bonus Points: Links to any articles, white papers, etc. on the details of this would be greatly appreciated.
If it's inside a function, no, it's not constant time.
Your second assumption isn't correct:
"The array is located on the stack, which means that presumably the executable could contain this data in the data segment of the executable (on Windows) as a chunk of data that is already filled with 0's."
The stack isn't already filled with zeros. It's filled with junk leftover from previous function calls.
So it's not possible to do it in O(1) because it will have to zero it.
It can be O(1) only as a global variable. If it is a local variable (on stack) it is O(n) where n is the size of array.
Stack is a shared memory, you need to actively zero always you want to have 1000 zeros in there. An array like you defined is not implemented as a pointer to data segment, it is 1000 variables on the stack and must be initialized in O(1000).
EDIT: Dani is right, I have to fix my statement: If it is a global array, it is initialized when program starts. And it is O(n) as well.
It will never be constant time, global or not. its true the compiler initializes that, but the operating system must load all the file into memory, which takes O(n) time.
The array is located on the stack, which means that presumably the executable could contain this data in the data segment of the executable (on Windows) as a chunk of data that is already filled with 0's.
What if you recurse into the function that defines array? The global DATA segment would need to have a copy of this array for each function call to allow each function to have its own array to work over. The compiler would have to run your code to decide the maximum recursions are going to happen.
Also what happens when you have multiple threads in your program and each calls foo? All the sudden you have shared stuff in DATA that has to be locked. The locking might cause more performance problems than getting rid of the initialization.
I wouldn't worry about it too much too. Most platforms have fairly efficient ways of zero filling memory. Unless you profile it and find a problem, don't sweat it.
As others have pointed out, assumption 2 is wrong. Stack variables are allocated at run-time in O(1) time but are not normally initialised unless you're running a debug build.
PUSH ebp
MOV ebp, esp
SUB esp, 10
// function body code goes here
Here, the stack pointer 'esp' is decremented by 10 to make room for some local function variables. They are not initalised... that would require a loop.
This article seems friendly enough.
If it's a global static, the "constant" here is ZERO - initialization is done at COMPILE TIME.

Variable Length Array overhead in C++?

Looking at this question: Why does a C/C++ compiler need know the size of an array at compile time ? it came to me that compiler implementers should have had some times to get their feet wet now (it's part of C99 standard, that's 10 years ago) and provide efficient implementations.
However it still seems (from the answers) to be considered costly.
This somehow surprises me.
Of course, I understand that a static offset is much better than a dynamic one in terms of performance, and unlike one suggestion I would not actually have the compiler perform a heap allocation of the array since this would probably cost even more [this has not been measured ;)]
But I am still surprised at the supposed cost:
if there is no VLA in a function, then there would not be any cost, as far I can see.
if there is one single VLA, then one can either put it before or after all the variables, and therefore get a static offset for most of the stack frame (or so it seems to me, but I am not well-versed in stack management)
The question arise of multiple VLAs of course, and I was wondering if having a dedicated VLA stack would work. This means than a VLA would be represented by a count and a pointer (of known sizes therefore) and the actual memory taken in an secondary stack only used for this purpose (and thus really a stack too).
[rephrasing]
How VLAs are implemented in gcc / VC++ ?
Is the cost really that impressive ?
[end rephrasing]
It seems to me it can only be better than using, say, a vector, even with present implementations, since you do not incur the cost of a dynamic allocation (at the cost of not being resizable).
EDIT:
There is a partial response here, however comparing VLAs to traditional arrays seem unfair. If we knew the size beforehand, then we would not need a VLA. In the same question AndreyT gave some pointers regarding the implementation, but it's not as precise as I would like.
How VLAs are implemented in gcc / VC++ ?
AFAIK VC++ doesn't implement VLA. It's a C++ compiler and it supports only C89 (no VLA, no restrict). I don't know how gcc implements VLAs but the fastest possible way is to store the pointer to the VLA and its size in the static portion of the stack-frame. This way you can access one of the VLAs with performance of a constant-sized array (it's the last VLA if the stack grows downwards like in x86 (dereference [stack pointer + index*element size + the size of last temporary pushes]), and the first VLA if it grows upwards (dereference [stackframe pointer + offset from stackframe + index*element size])). All the other VLAs will need one more indirection to get their base address from the static portion of the stack.
[ Edit: Also when using VLA the compiler can't omit stack-frame-base pointer, which is redundant otherwise, because all the offsets from the stack pointer can be calculated during compile time. So you have one less free register. — end edit ]
Is the cost really that impressive ?
Not really. Moreover, if you don't use it, you don't pay for it.
[ Edit: Probably a more correct answer would be: Compared to what? Compared to a heap allocated vector, the access time will be the same but the allocation and deallocation will be faster. — end edit ]
If it were to be implemented in VC++, I would assume the compiler team would use some variant of _alloca(size). And I think the cost is equivalent to using variables with greater than 8-byte alignment on the stack (such as __m128); the compiler has to store the original stack pointer somewhere, and aligning the stack requires an extra register to store the unaligned stack.
So the overhead is basically an extra indirection (you have to store the address of VLA somewhere) and register pressure due to storing the original stack range somewhere as well.

Stack Overflow in Fortran program

I have a problem with my simple Fortran program. I am working in Fortran 77, using Compaq Visual Fortran. The program structure must be in the form of a main and a subroutine, because it is part of a big program related to the finite element method.
My issue is that I would like to set the values 10000 & 10000 for NHELE and NVELE respectively, but when I run the code, the program stops and gives the following error:
forrt1: server <170>: program Exception - stack overflow
I've tried iteratively reducing the required values, until I reached 507 & 507. At this point the code runs without errors.
However, increasing the values to 508 & 508 causes the same error to reappear.
I think the problem is related to the subroutine NIGTEE, because when I rearrange the program without it, everything works fine.
I've tried increasing the stack size to a maximum by using the menu project>>settings>>link>>output>>reserve & commit
but this didn't make a difference.
How can I solve this problem?
Here is my program:
PARAMETER(NHELE=508,NVELE=508)
PARAMETER(NHNODE=NHELE+1,NVNODE=NVELE+1)
PARAMETER(NTOTALELE=NHELE*NVELE)
DIMENSION MELE(NTOTALELE,4)
CALL NIGTEE(NHELE,NVELE,NHNODE,NVNODE,NTOTALELE,MELE)
OPEN(UNIT=7,FILE='MeshNO For Rectangular.TXT',STATUS='UNKNOWN')
WRITE(7,500) ((MELE(I,J),J=1,4),I=1,NTOTALELE)
500 FORMAT(4I20)
STOP
END
SUBROUTINE NIGTEE(NHELE,NVELE,NHNODE,NVNODE,NTOTALELE,MELE)
DIMENSION NM(NVNODE,NHNODE),NODE(4)
DIMENSION MELE(NTOTALELE,4)
KK=0
DO 20 I=1,NVNODE
DO 20 J=1,NHNODE
KK=KK+1
NM(I,J)=KK
20 CONTINUE
KK=0
DO 30 I=1,NVELE
DO 30 J=1,NHELE
NODE(1)=NM(I,J)
NODE(2)=NM(I,J+1)
NODE(3)=NM(I+1,J+1)
NODE(4)=NM(I+1,J)
KK=KK+1
DO 50 II=1,4
50 MELE(KK,II)=NODE(II)
30 CONTINUE
RETURN
END
Thanks.
Update:
Here's your actual problem. Your NM array is being declared as being a two-dimensional array of NHNODE cells by NVNODE rows. If that is 10,000 by 10,000, then you will need more than 381 megabytes of memory to allocate this array alone, aside from any other memory being used by your program. (By contrast, if the array is 500 by 500, you only need about 1 megabyte of memory for the same array.)
The problem is that old Fortran would allocate all the arrays directly in the code segment or on the stack. The concept of an OS "heap" (general purpose memory for large objects) had been invented by 1977, but Fortran 77 still didn't have any constructs for making use of it. So every time your subroutine is called, it has to push the stack pointer to make room for 381 megabytes of space on the stack. This is almost certainly larger than the amount of space your operating system is allowing for the stack segment, and you are overflowing stack memory (and hence getting a stack overflow).
The solution is to allocate that memory from a different place. I know in old Fortran it is possible to use COMMON blocks to statically allocate memory directly from your code segment. You still can't dynamically allocate more, so your subroutine can't be reentrant, but if your subroutine only gets called once at a time (which it appears to be) this may be the best solution.
A better solution would be to switch to Fortran 90 or newer and use the ALLOCATE keyword to dynamically allocate the arrays on the heap instead of the stack. Then you can allocate as large a chunk as your OS can give you, but you won't have to worry about overflowing the stack, since the memory will be coming from another place.
You may be able to fix this by changing it in the compiler, as M.S.B. suggests, but a better solution is to simply fix the code.
Does that compiler have an option to put arrays on the heap?
You could try a different compiler, such as one that is still supported. Fortran 95 compilers will compile FORTRAN 77. There are many choices, including open source. Intel Visual Fortran, the successor to Compaq Visual Fortran, has the heap option on Windows & Mac OS X for placing automatic and temporary arrays on the heap.
MELE is actually a larger array then NM: 10000 x 10000 x 4 x 4, versus 10001 x 100001 x 4 (supposing 4 byte numbers, as did Daniel) -- 1.49 GB versus 381 kB. MELE is declared in your main program and, from your tests, is acceptable, even though it is larger. So either adding NM pushes the memory usage over a limit (the total for these two arrays is 1.86 GB) or the difference in the declaration matters. The size of MELE is known at compile time, that of NM only at run time, so probably the compiler allocates the memory differently. Really in this program the size of NM is known, but in the subroutine the dimensions are received as arguments, so to the compiler the size is unknown. If you change this, the compiler may change how it allocates the memory for NM and the program may run. Don't pass the dimensions of NM as arguments -- make them constants. The elegant way would be to make an include file with three PARAMETER statements setting up the array sizes, then include that include file wherever needed. Quick and dirty, as a test, would be to repeat identical PARAMETER statements in the subroutine -- but then you have the same information twice, which has to be changed twice if you make changes. In either case, you have to remove the array dimensions from the subroutine arguments in both the call and subroutine declaration, or use different names, because the same variable in a subroutine can't be a parameter and an argument. If that doesn't work, declare NM in the main program and pass it as an argument to the subroutine.
Re the COMMON block -- the dimensions need to be known at compile time, and so can't be subroutine arguments -- same as above. As Daniel explained, putting the array into COMMON would definitely cause it not to be on the stack.
This is beyond the language standard -- how the compiler provides memory is an implementation detail, "under the hood". So the solution is partially guess work. The manual or help for the compiler might have answers, e.g., a heap allocation option.
Stack overflows related to array size is a warning sign that they are being pushed whole onto the call stack, instead of on the heap. Have you tried making the array variables allocatable? (I'm not sure if this is possible in F77, though)