Is a matrix automatically deallocated at the end? [duplicate] - fortran

I am interested in the difference between alloc_array and automatic_array in the following extract:
subroutine mysub(n)
integer, intent(in) :: n
integer :: automatic_array(n)
integer, allocatable :: alloc_array(:)
allocate(alloc_array(n))
...[code]...
I am familiar enough with the basics of allocation (not so much on advanced techniques) to know that allocation allows you to change the size of the array in the middle of the code (as pointed out in this question), but I'm interested in considering the case where you don't need to change the size of the array; they might be passed onto other subroutines for operation, but the only purpose of both variables in the code and any subroutine is to hold the data of an array of dimension n (and maybe change the data, but not the size).
(1) Is there any difference in memory usage? I am not an expert in low level procedures, but I have a very slight knowledge of how they matter and how they can impact on the higher level programming (kind of experience I'm talkng about: once trying to run a big code in fortran I was getting a mistake I didn't understand, sysadmin told me "oh, yeah, you are probably saturating the stack; try adding this line in your running script"; anything that gives me insight into how to consider this things when actually coding and not having to patch them later is welcomed). I've been told by people that it might be dependent on many other things like compiler or architecture, but I interpreted from those responses that they were not completely sure of exactly how this was so. Is it so absolutely dependant on a multitude of factors or is there a default/intended behavior in the coding that may then be over-riden by optional compiling keywords or system preferences?
(2) Would the subroutines have different interface needs? Again, not an expert, but it had happened to me before that because of the way I declare variables of subroutine, I end up having to put the subroutines in a module. I've been given to understand this may vary depending on whether I use things that are special for allocatable variables. I am thinking about the case in which everything I do with the variables can be done both by allocatables and automatics, not intentionally using anything specific of allocatables (other than allocation before usage, that is).
Finally, in case this is of use: the reason I am asking is because we are developing in a group and we have recently noticed different people use those two declarations in different ways and we needed to determine if this is something that can be left to personal preference or if there might be any reasons why it might be a good idea to set a clear criteria (and how to set that criteria). I don't need extremely detailed answers, I am trying to determine if this is something I should be doing research about to be careful on how we use it and in what aspects of it should the research be directed.
Though I would be interested to know of "interesting tricks" than can be done with allocation but are not directly related to the need of having size variability, I am leaving those for a possible future follow-up question and focusing here on the strictly functional differences (meaning: what I am explicitly telling compilers to do with my code). The two items I mentioned are the thing I could come up with due to previous experiences, but any other important one that I am missing and should consider, please do mention them.

Because gfortran or ifort + Linux(x86_64) are among the most popular combinations used for HPC, I made some performance comparison between local allocatable vs automatic arrays for these combinations. The CPU used is Xeon E5-2650 v2#2.60GHz, and the compilers are gfortran4.8.2 and ifort14.0. The test program is like the following.
In test.f90:
!------------------------------------------------------------------------
subroutine use_automatic( n )
integer :: n
integer :: a( n ) !! local automatic array (with unknown size at compile-time)
integer :: i
do i = 1, n
a( i ) = i
enddo
call sub( a )
end
!------------------------------------------------------------------------
subroutine use_alloc( n )
integer :: n
integer, allocatable :: a( : ) !! local allocatable array
integer :: i
allocate( a( n ) )
do i = 1, n
a( i ) = i
enddo
call sub( a )
deallocate( a ) !! not necessary for modern Fortran but for clarity
end
!------------------------------------------------------------------------
program main
implicit none
integer :: i, nsizemax, nsize, nloop, foo
common /dummy/ foo
nloop = 10**7
nsizemax = 10
do i = 1, nloop
nsize = mod( i, nsizemax ) + 1
call use_automatic( nsize )
! call use_alloc( nsize )
enddo
print *, "foo = ", foo !! to check if sub() is really called
end
In sub.f90:
!------------------------------------------------------------------------
subroutine sub( a )
integer a( * )
integer foo
common /dummy/ foo
foo = a( 1 )
ends
In the above program, I tried avoiding compiler optimization that eliminates a(:) itself (i.e., no operation) by placing sub() in a different file and making the interface implicit. First, I compiled the program using gfortran as
gfortran -O3 test.f90 sub.f90
and tested different values of nsizemax while keeping nloop = 10^7. The result is in the following table (time is in sec, measured several times by the time command).
nsizemax use_automatic() use_alloc()
10 0.30 0.31 # average result
50 0.48 0.47
500 1.0 0.90
5000 4.3 4.2
100000 75.6 75.7
So the overall timing seems almost the same for two calls when -O3 is used (but see Edit for different options). Next, I compiled with ifort as
[O3] ifort -O3 test.f90 sub.f90
or
[O3h] ifort -O3 -heap-arrays test.f90 sub.f90
In the former case the automatic array is stored on the stack, while when -heap-arrays is attached the array is stored on the heap. The obtained result is
use_automatic() use_alloc()
[O3] [O3h] [O3] [O3h]
10 0.064 0.39 0.48 0.48
50 0.094 0.56 0.65 0.66
500 0.45 1.03 1.12 1.12
5000 3.8 4.4 4.4 4.4
100000 74.5 75.3 76.5 75.5
So for ifort, the use of automatic arrays seems beneficial when relatively small arrays are mainly used. On the other hand, gfortran -O3 shows no difference because both arrays are treated the same way (see Edit for more details).
Additional comparison:
Below is the result for Oracle Fortran compiler 12.4 for Linux (used with f90 -O3). The overall trend seems similar; automatic arrays are faster for small n, indicating the internal use of stack.
nsizemax use_automatic() use_alloc()
10 0.16 0.45
50 0.17 0.62
500 0.37 0.97
5000 2.04 2.67
100000 65.6 65.7
Edit
Thanks to Vladimir's comment, it has turned out that gfortran -O3 put automatic arrays (with unknown size at compile-time) on the heap. This explains why use_automatic() and use_alloc() did not make any difference above. So I made another comparison between different options below:
[O3] gfortran -O3
[O5] gfortran -O5
[O3s] gfortran -O3 -fstack-arrays
[Of] gfortran -Ofast # this includes -fstack-arrays
Here, -fstack-arrays means that the compiler puts all local arrays with unknown size on the stack. Note that this flag is enabled by default with -Ofast. The obtained result is
nsizemax use_automatic() use_alloc()
[Of] [O3s] [O5] [O3] [Of] [O3s] [O5] [O3]
10 0.087 0.087 0.29 0.29 0.29 0.29 0.29 0.29
50 0.15 0.15 0.43 0.43 0.45 0.44 0.44 0.45
500 0.57 0.56 0.84 0.84 0.92 0.92 0.92 0.92
5000 3.9 3.9 4.1 4.1 4.2 4.2 4.2 4.2
100000 75.1 75.0 75.6 75.6 75.6 75.3 75.7 76.0
where the average of ten measurements are shown. This table demonstrates that if -fstack-arrays is included, the execution time for small n becomes shorter. This trend is consistent with the results obtained for ifort above.
It should be mentioned, however, that the above comparison probably corresponds to the "best-case" scenario that highlights the difference between them, so the timing difference can be much smaller in practice. For example, I have compared the timing for the above options by using some other program (involving both small and large arrays), and the results were not much affected by the stack options. Also the result should depend on machine architecture as well as compilers, of course. So your mileage may vary.

For the sake of clarity, I'll briefly mention terminology. The two arrays are both local variables and arrays of rank 1.
alloc_array is an allocatable array;
automatic_array is an explicit-shape automatic object.
Being local variables their scope is that of the procedure. Automatic arrays and unsaved allocatable arrays come to an end when execution of the procedure completes (with the allocatable array being deallocated); automatic objects cannot be saved and saved allocatable objects are not deallocated on completion of execution.
Again, as in the linked question, after the allocation statement both arrays are of size n. These are still two very different things. Of course, the allocatable array can have its allocation status changed and its allocation moved. I'll leave both of those (mostly) out of the scope of this answer. An allocatable array, of course, doesn't have to have these things changed once it's been allocated.
Memory usage
What was partly contentious about a previous revision of the question is how ill-defined the concept of memory usage is. Fortran, as a language definition, tells us that both arrays come to be the same size and they'll have the same storage layout, and are both contiguous. Beyond that, much follows terms you'll hear a lot: implementation specific and processor dependent.
In a comment you expressed interest in ifort. So that I don't wander too far, I'll stick to that one compiler. Other compilers have similar concepts, albeit with different names and options.
Often, ifort will place automatic objects and array temporaries onto stack. There is a (default) compiler option -no-heap-arrays described as having effect
The compiler puts automatic arrays and temporary arrays in the stack storage area.
Using the alternative option -heap-arrays allows one to control that slightly:
This option puts automatic arrays and arrays created for temporary computations on the heap instead of the stack.
There is a possibility to control size thresholds for which heap/stack would be chosen (when that is known at compile-time):
If the compiler cannot determine the size at compile time, it always puts the automatic array on the heap.
As n isn't a constant, one would expect automatic_array to be on the heap with this option, regardless of the size specified. To determine the size, n, of the array at compile time, the compiler would potentially need to do quite a bit of code analysis, even if it is possible.
There's probably more to be said, but this answer would be far too long if I tried. One thing to note, however, is that automatic local objects and (post-Fortran 90) allocatable local objects can be expected not to leak memory.
Interface needs
There is nothing special about the interface requirements of the subroutine mysub: local variables have no impact on that. Any program unit calling that would be happy with an implicit interface. What you are asking about is how the two local arrays can be used.
This largely comes down to what uses the two arrays can be put to.
If the dummy argument of a second procedure has the allocatable attribute then only the allocatable array here can be passed to that procedure. It will also need to have an explicit interface. This is true whether or not the procedure changes the allocation.
Of course, both arrays could be passed as arguments to a dummy argument without the allocatable attribute and then we don't have different interface requirements.
Anyway, why would one want to pass an argument to an allocatable dummy when there will be no change in allocation status, etc.? There are good reasons:
there may be a code path in the procedure which does have an allocation change (controlled by a switch, say);
allocatable dummy arguments also pass bounds;
etc.,
This second one is more obvious if the subroutine had specification
subroutine mysub(n)
integer, intent(in) :: n
integer :: automatic_array(2:n+1)
integer, allocatable :: alloc_array(:)
allocate(alloc_array(2:n+1))
Finally, an automatic object has quite strict conditions on its size. n here is clearly allowed, but things don't have to be much more complicated before allocation is the only plausible way. Depending on how much one wants to play with block constructs.
Taking also a comment from IanH: if we have a very large n the automatic object is likely to lead to crash-and-burn. With the allocatable, one could use the stat= option to come to some amicable agreement with the compiler run-time.

Related

fortran limits of vectorization

I have written a function that returns a vector A equal to the product of a sparse matrix Sparse by another vector F. The non-zero values of the matrix are in Sparse(nnz), rowind(nnz) and colind(nnz) each contain the row and column of each particular value of Sparse. It was relatively simple to vectorize the (now commented) inner loop by the two lines beneath do kx.... I cannot see how to vectorize the outer loop, since pos has different size for different kx.
The question is : can the outer loop (do kx=1,nxy) be vectorized, and if yes how?
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Vladimir F correctly surmises that I come from the Python/Octave world. I have moved (back) to fortran to get more performance out of my hardware, as the PDE I solve become larger. As of a half hour ago, vectorization meant to get rid of do loops, something that fortran seems very good at: the time savings involved in replacing the "inner loop" (do ky=1,size(pos)..) by the two lines above is astonishing. I look at the info given by gfortran (really gcc?) when -fopt-info is invoked and see loop modification is often used. I will immediately go and read about SIMD and array notation. Please,please if there are good sources on this topic please let me know.
In reply to Holz, there are myriad ways to store sparse matrices, usually resulting in lowering the rank of the operator by 1: The example I cooked up involves forcing and solution vectors that are evaluated at each position in some field,and therefore have rank 1. The operator that relates then (S, as in A= S . F) is two dimensional BUT sparse. It is stored in such a way that only nonzero values are kept. If there are nnz non-zero values in S, then Sp, the sparse equivalent to S, is Sp(1:nnz). If pos represents the location within that sequence of some number Sp(pos), then the column and row position in the original matrix S is given by colind(pos) and rowind(pos).
With that background, I might enlarge the question to: What is the very best (measured by execution time) that can be done to accomplish the multiplication?
pure function SparseMul(Sparse,F) result(A)
implicit none
integer (kind=4),allocatable :: pos(:)
integer (kind=4) :: kx,ky ! gp counters
real (kind=8),intent(in) :: Sparse(:),F(:)
real (kind=8),allocatable :: A(:)
allocate(A(nxy))
do kx=1,nxy !for each row
pos=pack([(ky,ky=1,nnz)],rowind==kx)
A(kx)=sum(Sparse(pos)*F(colind(pos)))
!!$ A(kx)=0
!!$ do ky=1,size(pos)
!!$ A(kx)=A(kx)+Sparse(pos(ky))*F(colind(pos(ky)))
!!$ end do
end do
end function SparseMul
I assume the question "as is", i.e.:
we do not want to change the matrix storage format
we do not want to use an external library to perform the task
Otherwise, I think that using an external library should
be the best way to approach the problem, e.g.
https://software.intel.com/en-us/node/520797.
It is not easy to predict the "best" Fortran way to write the
multiplication. It depends on several factors (compiler, architecture,
matrix size,...). I think that the best strategy is to propose
some (reasonable) attempts and test them in a realistic configuration.
If I correctly understand the matrix storage format, my attempts -- including those reported in the question -- are provided
below:
save non-zero positions using pack
do kx=1,nxy
pos=pack([(ky,ky=1,nnz)],rowind==kx)
A(kx)=0
do ky=1,size(pos)
A(kx)=A(kx)+Sparse(pos(ky))*F(colind(pos(ky)))
end do
end do
as the previous one but using Fortran array syntax
do kx=1,nxy
pos=pack([(ky,ky=1,nnz)],rowind==kx)
A(kx)=sum(Sparse(pos)*F(colind(pos)))
end do
use a conditional to determine the components to be used
do kx=1,nxy
A(kx)=0
do ky=1,nnz
if(rowind(ky)==kx) A(kx)=A(kx)+Sparse(ky)*F(colind(ky))
end do
end do
as the previous one but interchanging loops
A(:)=0
do ky=1,nnz
do kx=1,nxy
if(rowind(ky)==kx) A(kx)=A(kx)+Sparse(ky)*F(colind(ky))
end do
end do
use the intrisic sum with the mask argument
do kx=1,nxy
A(kx)=sum(Sparse*F(colind), mask=(rowind==kx))
enddo
as the previous one but using an implied do-loop
A =[(sum(Sparse*F(colind), mask=(rowind==kx)), kx=1,nxy)]
These are the results using a 1000x1000 matrix with
33% non-zero values. The machine is an Intel Xeon
and my tests were performed using Intel v17 and GNU 6.1 compiler
using no optimization, high optimization but
without vectorization, and high optimization.
V1 V2 V3 V4 V5 V6
-O0
ifort 4.28 4.26 0.97 0.91 1.33 2.70
gfortran 2.10 2.10 1.10 1.05 0.30 0.61
-O3 -no-vec
ifort 0.94 0.91 0.23 0.22 0.23 0.52
gfortran 1.73 1.80 0.16 0.15 0.16 0.32
-O3
ifort 0.59 0.56 0.23 0.23 0.30 0.60
gfortran 1.52 1.50 0.16 0.15 0.16 0.32
A few short comments on the results:
Versions 3-4-5 are usually the fastest ones
The role of compiler optimizations is crucial for any version
The vectorization seems to play an important role only for
the non-optimal versions
Version 4 is the best for both compilers
gfortran V4 is the "best" version
Elegance does not always mean good performance (V6 is not very good)
Additional comments can be done analyzing the reports of the
compiler optimizations.
If we have a multi-core machine, we can try to use
all the cores. This implies dealing with the code parallelization, which
is a wide issue, but just to give some hints let us test
two possible OpenMP parallelizations. We work on the
serial fastest version (even though there is no guarantee it
is also the best version to be parallelized).
OpenMP 1.
!$omp parallel
!$omp workshare
A(:)=0
!$omp end workshare
!$omp do
do ky=1,nnz
do kx=1,nxy !for each row
if(rowind(ky)==kx) A(kx)=A(kx)+Sparse(ky)*F(colind(ky))
end do
end do
!$omp end do
!$omp end parallel
</pre>
OpenMP 2. add firstprivate to read-only vectors to improve memory access
!$omp parallel firstprivate(Sparse, colind, rowind)
...
!$omp end parallel
These are the results for up to 16 threads on 16 cores:
#threads 1 2 4 8 16
OpenMP v1
ifort 0.22 0.14 0.088 0.050 0.027
gfortran 0.155 0.11 0.064 0.035 0.020
OpenMP v2
ifort 0.24 0.12 0.065 0.042 0.029
gfortran 0.157 0.11 0.052 0.036 0.029
The scalability (around 8 at 16 threads) is reasonable considering
that it is a memory-bound computation. The firstprivate optimization
has advantages only for a small number of threads. gfortran using
16 threads is the "best" OpenMP solution.
I am having a hard time seeing where COLIND is and what it is doing... And also KX and KY. For the inner loop you want that vectorized, and that seems easiest for me using OpenMP SIMD REDUCTION. I am specifically looking here:
!!$ A(kx)=0
!!$ do ky=1,size(pos)
!!$ A(kx)=A(kx)+Sparse(pos(ky))*F(colind(pos(ky)))
!!$ end do
If you have to gather (PACK) then it may not help much. If there are more than 7/8 of zeros in F then F is likely better to PACK. Otherwise it may be better to vector multiply everything (including the zero-sums).
The main rule is that the data needs to be contiguous, so you cannot vectorize across the second dimension... If feels like Sparse and F are rank=2, but they are shown as being RANK=1. That works fine for going through as a vector, even if they are really a rank=2 array. UNION/MAP can also be used to implement a 2D array as also being a 1D vector.
Are Sparse and F really rank=1? and what are nax, nay, nxy and colind used for? And many of those are not defined (e.g. nay , nnz and colind )

passing a noncontiguous array section in Fortran

I am using intel fortran compiler and intel mkl for a performance check. I am passing some array sections to Fortran 77 interface with calls like
call dgemm( transa,transb,sz_s,P,P,&
a, Ts_tilde,&
sz_s,R_alpha,P,b,tr(:sz_s,:),sz_s)
as evident, tr(:sz_s,:) is not contiguous in memory and the Fortran 77 interface is expecting a continuous block and creating a temporary for this.
What I was wondering is that will there be a difference if I create my temporary array explicitly in the code for tr and copy information from that temporary back and forth before and after the operation, or will that be the same as compiler itself creating the temporary from a performance point of view? I guess compiler will always be more efficient.
And of course any more suggestions to eliminate these temporaries are welcome.
One more point, If I use the Fortran 95 interface of the library apparently, with a similar call on a simpler test problem, no warning is issued for the creation of a temporary. Then I read in the manual of mkl that Fortran 95 interface uses assumed shape arrays which explains why temporaries are not created.
However at that point, I can not seem to use some support functions like timing routines.
Namely, intel mkl has some timing support functions but if I use them with the mkl_service routine like below then I get 'This name does not have a type, and must have an explicit type' error for dsecnd. Any idea for this problem is also welcome. A simple example for this is given as
program dgemm95_test
! some modules for Fortran 95 interface
use mkl_service
use mkl95_precision
use mkl95_blas
!
implicit none
!
double precision, dimension(4,3) :: a
double precision, dimension(6,4) :: b
double precision, dimension(5,5) :: r ! result array
double precision, dimension(3,2) :: dummy_b
!
character(len=1) :: transa
character(len=1) :: transb
!
double precision :: alpha, beta, t1, t2, t
integer :: sz1, sz2
! initialize some variables
alpha = 1.0
beta = 0.0
a = 2.3
b = 4.5
r = 0.0
transa = 'n'
transb = 'n'
dummy_b = 0.0
! Fortran 95 interface
t1 = dsecnd()
call gemm( a, b(4:6,1:3:2), r(2:5,3:4),&
transa, transb, alpha, beta )
t2 = dsecnd()
!
write(*,*) r
dummy_b = r(2:4,4:5)
!
end program dgemm95_test
The temporary is absolutely necessary when passing your array section to an assumed size array dummy argument, which the old routines use, because the array section is not contiguous in memory.
You can of course make your own temporary arrays. Whether it will be faster or not depends on many factors. Among others the important thing is whether the temporary is allocated on the stack or on the heap. The Intel Fortran compiler is capable of both, there are compiler switches to control the behavior (-heap-arrays n) and it can depend on the array size. Stack allocation is much faster and it is usually the default. Automatic arrays, which you might use for your own temporary are allocated on the stack by default too. Be careful with large arrays on the stack, you can easily overflow it and cause a crash.
I would suggest you to make a performance test and use the simpler variant if it is not too slow. Probably it will be the Fortran 95 interface, but you should measure the times, really.
As for the timing, MKL manual page for second()/dsecnd() states you must includemkl_lapack.fi and doesn't speak about any Fortran95 interface. You could get away declaring it external double precision too, but I would use the include. Or use system_clock() as a portable standard Fortran 95.

Program crash for array copy with ifort

This program crashes with Illegal instruction: 4 on MacOSX Lion and ifort (IFORT) 12.1.0 20111011
program foo
real, pointer :: a(:,:), b(:,:)
allocate(a(5400, 5400))
allocate(b(5400, 3600))
a=1.0
b(:, 1:3600) = a(:, 1:3600)
print *, a
print *, b
deallocate(a)
deallocate(b)
end program
The same program works with gfortran. I don't see any problem. Any ideas ? Unrolling the copy and performing the explicit loop over the columns works in both compilers.
Note that with allocatable instead of pointer I have no problems.
The behavior is the same if the statement is either inside a module or not.
I confirm the same behavior on ifort (IFORT) 12.1.3 20120130.
Apparently, no problem occurs with Linux and ifort 12.1.5
I tried to increase the stack size with the following linking options
ifort -Wl,-stack_size,0x40000000,-stack_addr,0xf0000000 test.f90
but I still get the same error. Increasing ulimit -s to hard same problem.
Edit 2: I did some more debugging and apparently the problem happens when the array splicing operation
b(:, 1:3600) = a(:, 1:3600)
involves a value suspiciously close to 16 M of data.
I am comparing the opcodes produced, but if there is a way to see an intermediate code form that is more communicative, I'd gladly appreciate it.
Your program is correct (though I would prefer allocatable to pointer if you do not need to be able to repoint it). The problem is that ifort by default places all array temporaries on the stack, no matter how large they are. And it seems to need an array temporary for the copy operation you are doing here. To work around ifort's stupid default behavior, always use the -heap-arrays flag when compiling. I.e.
ifort -o test test.f90 -heap-arrays 1600
The number behind -heap-arrays is the threshold where it should begin using the heap. For sizes below this, the stack is used. I chose a pretty low number here - you can probably safely use higher ones. In theory stack arrays are faster, but the difference is usually totally negligible. I wish intel would fix this behavior. Every other compiler has sensible defaults for this setting.
Use "allocatable" instead of "pointer".
real, allocatable :: a(:,:), b(:,:)
Assigning a floating point number to a pointer looks dubious to me.

High Performance Fortran (HPF) without directives?

In High Performance Fortran (HPF), I could specify the distribution of arrays involved in a parallel calculation using the DISTRIBUTE directive. For example, the following minimal subroutine will sum two arrays in parallel:
subroutine mysum(x,y,z)
integer, intent(in) :: y(10000), z(10000)
integer, intent(out) :: x(10000),
!HPF$ DISTRIBUTE x(BLOCK), y(BLOCK), z(BLOCK)
x = y + z
end subroutine mysum
My question is, is the DISTRIBUTE directive necessary? I know in practise this is of little interest, but I'm curious as to whether an unadorned, directive-free, Fortran program could also be a valid HPF program?
I do not believe DISTRIBUTE statement is necessary, and I never used it.
You can achieve this implicitly by using FORALL statements instead of DO loops where applicable. Originally, DO loops would give explicit order of operation on array elements, whereas FORALL would allow the processor to determine an optimal order at runtime. I do not think this makes much difference nowadays, because modern compilers are able to optimize/vectorize/parallelize DO loops where possible. I cannot tell for sure for other compilers, but I remember using Intel Fortran Compiler to compile and run a program on 2 and 4 processors in parallel without using DISTRIBUTE.
However, depending on the processor architecture and compiler, it is best to try out what you have and see what gives you optimal results or efficiency.

Stack overflow in Fortran 90

I have written a fairly large program in Fortran 90. It has been working beautifully for quite a while, but today I tried to step it up a notch and increase the problem size (it is a research non-standard FE-solver, if that helps anyone...) Now I get the "stack overflow" error message and naturally the program terminates without giving me anything useful to work with.
The program starts with setting up all relevant arrays and matrices, and after that is done it prints a few lines of stats regarding this to a log-file. Even with my new, larger problem, this works fine (albeit a little slow), but then it fails as the "number crunching" gets going.
What confuses me is that everything at that point is already allocated (and that worked without errors). I'm not entirely sure what the stack is (Wikipedia and several treads here didn't do much since I have only a quite basic knowledge of the "behind the scenes" workings of a computer).
Assume that I for instance have some arrays initialized as:
INTEGER,DIMENSION(64) :: IA
REAL(8),DIMENSION(:,:),ALLOCATABLE :: AA, BB
which after some initialization routines (i.e. read input from file and such) are allocated as (I store some size-integers for easier passing to subroutines in IA of fixed size):
ALLOCATE( AA(N1,N2) , BB(N1,N2) )
IA(1) = N1
IA(2) = N2
This is basically what happens in the initial portion, and so far so good. But when I then call a subroutine
CALL ROUTINE_ONE(AA,BB,IA)
And the routine looks like (nothing fancy):
SUBROUTINE ROUTINE_ONE(AA,BB,IA)
IMPLICIT NONE
INTEGER,DIMENSION(64) :: IA
REAL(8),DIMENSION(IA(1),IA(2)) :: AA, BB
...
do lots of other stuff
...
END SUBROUTINE ROUTINE_ONE
Now I get an error! The output to the screen says:
forrtl: severe (170): Program Exception - stack overflow
However, when I run the program with the debugger it breaks at line 419 in a file called winsig.c (not my file, but probably part of the compiler?). It seems to be part of a routine called sigreterror: and it is the default case that has been invoked, returning the text Invalid signal or error. There is a comment line attached to this which strangely says /* should never happen, but compiler can't tell */ ...?
So I guess my question is, why does this happen and what is actually happening? I thought that as long as I can allocate all the relevant memory I should be fine? Does the call to the subroutine make copies of the arguments, or just pointers to them? If the answer is copies then I can see where the problem might be, and if so: any ideas on how to get around it?
The problem I try to solve is big, but not insane in any way. Standard FE-solvers can handle bigger problems than my current one. I run the program on a Dell PowerEdge 1850 and the OS is Microsoft Server 2008 R2 Enterprise. According to systeminfo at the cmd prompt I have 8GB of physical memory and almost 16GB virtual. As far as I understand the total of all my arrays and matrices should not add up to more than maybe 100MB - about 5.5M integer(4) and 2.5M real(8) (which according to me should be only about 44MB, but let's be fair and add another 50MB for overhead).
I use the Intel Fortran compiler integrated with Microsoft Visual Studio 2008.
Adding some actual source code to clarify a bit
! Update continuum state
CALL UpdateContinuumState(iTask,iArray,posc,dof,dof_k,nodedof,elm,&
bmtrx,detjac,w,mtrlprops,demtrx,dt,stress,strain,effstrain,&
effstress,aa,fi,errmsg)
is the actual call to the routine. Big arrays are posc, bmtrx and aa - all other are at least an order of magnitude smaller (if not more). posc is INTEGER(4) and bmtrx and aa is REAL(8)
SUBROUTINE UpdateContinuumState(iTask,iArray,posc,dof,dof_k,nodedof,elm,bmtrx,&
detjac,w,mtrlprops,demtrx,dt,stress,strain,effstrain,&
effstress,aa,fi,errmsg)
IMPLICIT NONE
!I/O
INTEGER(4) :: iTask, errmsg
INTEGER(4) :: iArray(64)
INTEGER(4),DIMENSION(iArray(15),iArray(15),iArray(5)) :: posc
INTEGER(4),DIMENSION(iArray(22),iArray(21)+1) :: nodedof
INTEGER(4),DIMENSION(iArray(29),iArray(3)+2) :: elm
REAL(8),DIMENSION(iArray(14)) :: dof, dof_k
REAL(8),DIMENSION(iArray(12)*iArray(17),iArray(15)*iArray(5)) :: bmtrx
REAL(8),DIMENSION(iArray(5)*iArray(17)) :: detjac
REAL(8),DIMENSION(iArray(17)) :: w
REAL(8),DIMENSION(iArray(23),iArray(19)) :: mtrlprops
REAL(8),DIMENSION(iArray(8),iArray(8),iArray(23)) :: demtrx
REAL(8) :: dt
REAL(8),DIMENSION(2,iArray(12)*iArray(17)*iArray(5)) :: stress
REAL(8),DIMENSION(iArray(12)*iArray(17)*iArray(5)) :: strain
REAL(8),DIMENSION(2,iArray(17)*iArray(5)) :: effstrain, effstress
REAL(8),DIMENSION(iArray(25)) :: aa
REAL(8),DIMENSION(iArray(14)) :: fi
!Locals
INTEGER(4) :: i, e, mtrl, i1, i2, j1, j2, k1, k2, dim, planetype, elmnodes, &
Nec, elmpnodes, Ndisp, Nstr, Ncomp, Ngpt, Ndofelm
INTEGER(4),DIMENSION(iArray(15)) :: doflist
REAL(8),DIMENSION(iArray(12)*iArray(17),iArray(15)) :: belm
REAL(8),DIMENSION(iArray(17)) :: jelm
REAL(8),DIMENSION(iArray(12)*iArray(17)*iArray(5)) :: dstrain
REAL(8),DIMENSION(iArray(12)*iArray(17)) :: s
REAL(8),DIMENSION(iArray(17)) :: ep, es, dep
REAL(8),DIMENSION(iArray(15),iArray(15)) :: kelm
REAL(8),DIMENSION(iArray(15)) :: felm
dim = iArray(1)
...
And it fails before the last line above.
As per steabert's request, I'll just summarize the conversation in the comments here where it's a bit more visible, even though M.S.B.'s answer already gets right to the nub of the problem.
In technical programming, where procedures often have large local arrays for intermediate computation, this happens a lot. Local variables are generally stored on the stack, which typically (and quite reasonably) a small fraction of overall system memory -- usually of order 10MB or so. When the local variable sizes exceed the stack size, you see exactly the symptoms described here -- a stack overflow occuring after a call to the relevant subroutine but before its first executable statement.
So when this problem happens, the best thing to do is to find the relevant large local variables, and decide what to do. In this case, at least the variables belm and dstrain were getting quite sizable.
Once the variables are located, and you've confirmed that's the problem, there's a few options. As MSB points out, if you can make your arrays smaller, that's one option. Alternatively, you can make the stack size larger; under linux, that's done with ulimit -s [newsize]. That really just postpones the problem, though, and you have to do something different on windows machines.
The other class of ways to avoid this problem is not to put the large data on the stack, but in the rest of memory (the "heap"). You can do that by giving the arrays the save attribute (in C, static); this puts the variable on the heap and thus makes the values persistent between calls. The downside there is that this potentially changes the behavior of the subroutine, and means the subroutine can't be used recursively, and similarly is non-threadsafe (if you're ever in a position where multiple threads will enter the routine simulatneously, they'll each see the same copy of the local varaiable and potentially overwrite each other's results). The upside is that it's easy and very portable -- it should work everywhere. However, this will only work with fixed-size local variables; if the temporary arrays have sizes that depend on the inputs, you can't do this (since there'd no longer be a single variable to save; it could be different size every time the procedure is called).
There are compiler-specific options which put all arrays (or all arrays of larger than some given size) on the heap rather than on the stack; every Fortran compiler I know has an option for this. For ifort, used in the OPs post, it's -heap-arrays in linux, or /heap-arrays for windows. For gfortran, this may actually be the default. This is good for making sure you know what's going on, but it means you have to have different incantations for every compiler to make sure your code works.
Finally, you can make the offending arrays allocatable. Allocated memory goes on the heap; but the variable which points to them is on the stack, so you get the benefits of both approaches. Also, this is completely standard fortran and so totally portable. The downside is that it requires code changes. Also, the allocation process can take nontrivial amounts of time; so if you're going to be calling the routine zillions of times, you may notice this slows things down slightly. (This possible performance regression is easy to fix, though; if you'll be calling it zillions of times with the same size arrays, you can have an optional argument to pass in a pre-allocated local array and use that instead, so that you only allocate/deallocate once).
Allocating/deallocating each time would look like:
SUBROUTINE UpdateContinuumState(iTask,iArray,posc,dof,dof_k,nodedof,elm,bmtrx,&
detjac,w,mtrlprops,demtrx,dt,stress,strain,effstrain,&
effstress,aa,fi,errmsg)
IMPLICIT NONE
!...arguments....
!Locals
!...
REAL(8),DIMENSION(:,:), allocatable :: belm
REAL(8),DIMENSION(:), allocatable :: dstrain
allocate(belm(iArray(12)*iArray(17),iArray(15))
allocate(dstrain(iArray(12)*iArray(17)*iArray(5))
!... work
deallocate(belm)
deallocate(dstrain)
Note that if the subroutine does a lot of work (eg, takes seconds to execute), the overhead from a couple allocate/deallocates should be negligable. If not, and you want to avoid the overhead, using the optional arguments for preallocated worskpace would look something like:
SUBROUTINE UpdateContinuumState(iTask,iArray,posc,dof,dof_k,nodedof,elm,bmtrx,&
detjac,w,mtrlprops,demtrx,dt,stress,strain,effstrain,&
effstress,aa,fi,errmsg,workbelm,workdstrain)
IMPLICIT NONE
!...arguments....
real(8),dimension(:,:), optional, target :: workbelm
real(8),dimension(:), optional, target :: workdstrain
!Locals
!...
REAL(8),DIMENSION(:,:), pointer :: belm
REAL(8),DIMENSION(:), pointer :: dstrain
if (present(workbelm)) then
belm => workbelm
else
allocate(belm(iArray(12)*iArray(17),iArray(15))
endif
if (present(workdstrain)) then
dstrain => workdstrain
else
allocate(dstrain(iArray(12)*iArray(17)*iArray(5))
endif
!... work
if (.not.(present(workbelm))) deallocate(belm)
if (.not.(present(workdstrain))) deallocate(dstrain)
Not all of the memory is created when the program starts. When you call the subroutine the executable is creating the memory that the subroutine needs for local variables. Typically arrays with simple declarations that are local to that subroutine -- neither allocatable, nor pointer -- are allocated on the stack. You could have simply run of of stack space when you reached these declarations. You might have reached a 2GB limit on a 32-bit OS with some array. Sometimes executable statements implicitly create a temporary array on the stack.
Possible solutions: 1) make your arrays smaller (not attractive), 2) make the stack larger), 3) some compilers have options to switch from placing arrays on the stack to dynamically allocating them, similar to the method used for "allocate", 4) identify large arrays and make them allocatable.
The stack is the memory area where the information needed to return from a function, and the information locally defined in a function is stored. So a stack overflow may indicate you have a function that calls another function which in its turn calls another function, etc.
I am not familiar with Fortran (anymore) but another cause might be that those functions declare tons of local variables, or at least variables that need a lot of place.
A last one: the stack is typically rather small, so it's not a priori relevant how much memory the machine has. It should be quite simple to instruct the linker to increase the stack size, at least if you are certain it's just a lack of space, and not a bug in your application.
Edit: do you use recursion in your program? Recursive calls can eat through the stack very quickly.
Edit: have a look at this: (emphasis mine)
On Windows, the stack space to
reserved for the program is set using
the /Fn compiler option, where n is
the number of bytes. Additionally,
the stack reserve size can be
specified through the Visual Studio
IDE which adds the Microsoft Linker
option /STACK: to the linker command
line. To set this, go to Property
Pages>Configuration
Properties>Linker>System>Stack Reserve
Size. There you can specify the stack
size in bytes in either decimal or
C-language notation. If not specified,
the default stack size is 1MB.
The only problem I ran into with a similar test code, is the 2Gb allocation limit for 32-bit compilation. When I exceed it I get an error message on line 419 in winsig.c
Here is the test code
program FortranCon
implicit none
! Variables
INTEGER :: IA(64), S1
REAL(8), DIMENSION(:,:), ALLOCATABLE :: AA, BB
REAL(4) :: S2
INTEGER, PARAMETER :: N = 10960
IA(1)=N
IA(2)=N
ALLOCATE( AA(N,N), BB(N,N) )
AA(1:N,1:N) = 1D0
BB(1:N,1:N) = 2D0
CALL TEST(AA,BB,IA)
S1 = SIZEOF(AA) !Size of each array
S2 = 2*DBLE(S1)/1024/1024 !Total size for 2 arrays in Mb
WRITE (*,100) S2, ' Mb' ! When allocation reached 2Gb then
100 FORMAT (F8.1,A) ! exception occurs in Win32
DEALLOCATE( AA, BB )
end program FortranCon
SUBROUTINE TEST(AA,BB,IA)
IMPLICIT NONE
INTEGER, DIMENSION(64),INTENT(IN) :: IA
REAL(8), DIMENSION(IA(1),IA(2)),INTENT(INOUT) :: AA,BB
... !Do stuff with AA,BB
END SUBROUTINE
When N=10960 it runs ok showing 1832.9 Mb. With N=11960 it crashes. Of course when I compile with x64 it works ok. Each array has 8*N^2 bytes storage. I don't know if it helps but I recommend using the INTENT() keywords for the dummy variables.
Are you using some parallelization? This can be a problem with statically declared arrays. Try all bigger arrays make ALLOCATABLE, otherwise, they will be placed on the stack in autoparallel or OpenMP threads.
For me the issue was the stack reserve size. I went and changed the stack reserved size from 0 to 100000000 and recompiled the code. The code now runs smoothly.