Fortran compiler differences in treating properties as 'save attributes'? - fortran

We have this old Fortran script that we're trying to recompile using Intel's Visual Fortran but we get calculation errors and different results than the old compiled version of the code.
We found what we believe to be the problem in the code below (which is loosely from Numerical Recipes).
The problem is that the 'it' parameter is reset for each call, it should however be stored between the function calls.
Our best guess for what might be the problem is that an older compiler might have treated 'it' as a 'save attribute' and therefore stored it between the function calls.
We may be completely wrong here and if some Fortran-guru can confirm this or have a better explenation we would love some help!
subroutine TrapezoidalRule(Func, a, b, s, n)
*
* This routine performs the trapezoidal rule, see Numerical Recipes
*
implicit none
real*8 Func, a, b, s
Integer*4 n
external Func
*
real*8 del, x, sum
Integer*4 it, tnm, j
*
if (n .eq. 1) then
*
s=0.5d0*(b-a)*(Func(a)+Func(b))
it=1
*
else
*
tnm=it
del=(b-a)/dble(tnm)
x=a+0.5d0*del
sum=0.d0
do 11 j=1,it
*
sum=sum+Func(x)
x=x+del
*
11 continue
*
s=0.5d0*(s+(b-a)*sum/dble(tnm))
it=2*it
*
endif
*
return
end

Yes, the explanation is plausible. The code accesses variable it at
tnm=it
and this value is undefined when it is not save.
The old compile might not have used stack at all and might have used static storage for all variables. It also might have used stack, but it never got overwritten and the value happened to be at the same place. Who knows, we don't have the information to know.
There are compiler options to force the save attribute to all variables for bad codes like this (the SAVE for all was never standard!). It is just -save for Intel Fortran.

Related

Difference between variable types for the same computation in Fortran

I am new to Fortran and I was experimenting with int and double variables. I saw that
when you divide for example
integer:: a = 5
integer:: b = 2
the outcome is 2
However I was wondering when we use different types is there a difference of speed? Are they calculated the same way?
For example
double :: a = 2.0
integer :: b = 2
1) a**b
2) a**a
3) b**a
Of course the outcome for all these will be the same since they turn to double. However are they calculated the same way? Is there a difference in the speed they calculated?
EDIT : I must admit I did not know that the compiler plays a role. So far I know about 3 compilers : gfortran, nagfor and ifort. Personally I have experience in just gfortran and I tried and I got the same results in all the 3 calculations. However are they calculated the same way?
Normally, when optimizations are enabled, a**2 with a literal 2 will be changed to a*a. It is less likely, but not impossible, for the compiler to do such a thing for a variable integer exponent.
A completely generic exponentiation to a real exponent is implemented using logarithms. You just need the exp(x) function and then you can exponentiate any other number to the power of x if you know the logarithm of your number.
You can test the gfortran optimizations online https://godbolt.org/z/MvGEnn
You get a call to __powidf2() in the first case, and calls to pow() in the other cases.
Those are functions from the C runtime library.
double __powidf2 (double a, int b)
https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html
double pow(double x, double y);
https://linux.die.net/man/3/pow
The former one is a specialized function to exponentiate to an integer and is much faster, the other is for two doubles.
You can play with the optimization level and you can also make on of the numbers known.
Like this one, where the optimizer can treat it as a constant even when it is a variable:
https://godbolt.org/z/YT3KP8
However, the compiler will not do that, if the value is only known outside the subroutine.
But when you use -fwhole-program, the compiler is actually able the pre-compute the result from the subroutine https://godbolt.org/z/zs43jv
I hope it illustrates that the problem is actually quite complex and cannot be answered in all generality.

Trying to use netlib code (QUADPACK). What is xerror?

I'm trying to figure out how to use quadpack.
In a single folder, I located the contents of "qag.f plus dependencies" and the code blow as qag_test.f:
(maybe this code itself is not very important. This is in fact just a snippet from the quadpack document)
REAL A,ABSERR,B,EPSABS,EPSREL,F,RESULT,WORK
INTEGER IER,IWORK,KEY,LAST,LENW,LIMIT,NEVAL
DIMENSION IWORK(100),WORK(400)
EXTERNAL F
A = 0.0E0
B = 1.0E0
EPSABS = 0.0E0
EPSREL = 1.0E-3
KEY = 6
LIMIT = 100
LENW = LIMIT*4
CALL QAG(F,A,B,EPSABS,EPSREL,KEY,RESULT,ABSERR,NEVAL,
* IER,LIMIT,LENW,LAST,IWORK,WORK)
C INCLUDE WRITE STATEMENTS
STOP
END
C
REAL FUNCTION F(X)
REAL X
F = 2.0E0/(2.0E0+SIN(31.41592653589793E0*X))
RETURN
END
Using gfortran *.f (installed as MinGW 64bit), I got:
C:\Users\username\AppData\Local\Temp\ccIQwFEt.o:qag.f:(.text+0x1e0): undefined re
ference to `xerror_'
C:\Users\username\AppData\Local\Temp\cc6XR3D0.o:qage.f:(.text+0x83): undefined re
ference to `r1mach_'
(and a lot more of the same r1mach_ error)
It seems r1mach is a part of BLAS (and I don't know why it's not packaged in here but obtained as "auxiliary"), but what is xerror?
How do I properly compile this snippet in my environment, Win7 64bit (hopefully without Cygwin)?
Your help is very much appreciated.
xerror is an error reporting routine. Looking at the way it is called, it appears to use Hollerith constants (the ones where "foo" is written as 3hfoo).
if(ier.ne.0) call xerror(26habnormal return from qag ,
* 26,ier,lvl)
xerror in turn calls xerrwv, passing along the arguments (plus a few more).
This was definitely written before Fortran 77 became widespread.
Your best bet would be to use a compiler which still supports Hollerith constants, pull in all the dependencies (xeerwv has a few more, I don't know why you didn't get them from netlib) and run it through the compiler of your choice. Most compilers, including gfortran, support Hollerith; just ignore the warnings :-)
You will possibly need to modify one routine, that is xerprt. With gfortran, you could write this one as
subroutine xerprt(c,n)
character(len=1), dimension(n) :: c
write (*,'(500A)') c
end subroutine xerprt
and put this one into a separate file so that the compiler doesn't catch the rank violation (I know, I know...)

passing a noncontiguous array section in Fortran

I am using intel fortran compiler and intel mkl for a performance check. I am passing some array sections to Fortran 77 interface with calls like
call dgemm( transa,transb,sz_s,P,P,&
a, Ts_tilde,&
sz_s,R_alpha,P,b,tr(:sz_s,:),sz_s)
as evident, tr(:sz_s,:) is not contiguous in memory and the Fortran 77 interface is expecting a continuous block and creating a temporary for this.
What I was wondering is that will there be a difference if I create my temporary array explicitly in the code for tr and copy information from that temporary back and forth before and after the operation, or will that be the same as compiler itself creating the temporary from a performance point of view? I guess compiler will always be more efficient.
And of course any more suggestions to eliminate these temporaries are welcome.
One more point, If I use the Fortran 95 interface of the library apparently, with a similar call on a simpler test problem, no warning is issued for the creation of a temporary. Then I read in the manual of mkl that Fortran 95 interface uses assumed shape arrays which explains why temporaries are not created.
However at that point, I can not seem to use some support functions like timing routines.
Namely, intel mkl has some timing support functions but if I use them with the mkl_service routine like below then I get 'This name does not have a type, and must have an explicit type' error for dsecnd. Any idea for this problem is also welcome. A simple example for this is given as
program dgemm95_test
! some modules for Fortran 95 interface
use mkl_service
use mkl95_precision
use mkl95_blas
!
implicit none
!
double precision, dimension(4,3) :: a
double precision, dimension(6,4) :: b
double precision, dimension(5,5) :: r ! result array
double precision, dimension(3,2) :: dummy_b
!
character(len=1) :: transa
character(len=1) :: transb
!
double precision :: alpha, beta, t1, t2, t
integer :: sz1, sz2
! initialize some variables
alpha = 1.0
beta = 0.0
a = 2.3
b = 4.5
r = 0.0
transa = 'n'
transb = 'n'
dummy_b = 0.0
! Fortran 95 interface
t1 = dsecnd()
call gemm( a, b(4:6,1:3:2), r(2:5,3:4),&
transa, transb, alpha, beta )
t2 = dsecnd()
!
write(*,*) r
dummy_b = r(2:4,4:5)
!
end program dgemm95_test
The temporary is absolutely necessary when passing your array section to an assumed size array dummy argument, which the old routines use, because the array section is not contiguous in memory.
You can of course make your own temporary arrays. Whether it will be faster or not depends on many factors. Among others the important thing is whether the temporary is allocated on the stack or on the heap. The Intel Fortran compiler is capable of both, there are compiler switches to control the behavior (-heap-arrays n) and it can depend on the array size. Stack allocation is much faster and it is usually the default. Automatic arrays, which you might use for your own temporary are allocated on the stack by default too. Be careful with large arrays on the stack, you can easily overflow it and cause a crash.
I would suggest you to make a performance test and use the simpler variant if it is not too slow. Probably it will be the Fortran 95 interface, but you should measure the times, really.
As for the timing, MKL manual page for second()/dsecnd() states you must includemkl_lapack.fi and doesn't speak about any Fortran95 interface. You could get away declaring it external double precision too, but I would use the include. Or use system_clock() as a portable standard Fortran 95.

Stack overflow in Fortran 90

I have written a fairly large program in Fortran 90. It has been working beautifully for quite a while, but today I tried to step it up a notch and increase the problem size (it is a research non-standard FE-solver, if that helps anyone...) Now I get the "stack overflow" error message and naturally the program terminates without giving me anything useful to work with.
The program starts with setting up all relevant arrays and matrices, and after that is done it prints a few lines of stats regarding this to a log-file. Even with my new, larger problem, this works fine (albeit a little slow), but then it fails as the "number crunching" gets going.
What confuses me is that everything at that point is already allocated (and that worked without errors). I'm not entirely sure what the stack is (Wikipedia and several treads here didn't do much since I have only a quite basic knowledge of the "behind the scenes" workings of a computer).
Assume that I for instance have some arrays initialized as:
INTEGER,DIMENSION(64) :: IA
REAL(8),DIMENSION(:,:),ALLOCATABLE :: AA, BB
which after some initialization routines (i.e. read input from file and such) are allocated as (I store some size-integers for easier passing to subroutines in IA of fixed size):
ALLOCATE( AA(N1,N2) , BB(N1,N2) )
IA(1) = N1
IA(2) = N2
This is basically what happens in the initial portion, and so far so good. But when I then call a subroutine
CALL ROUTINE_ONE(AA,BB,IA)
And the routine looks like (nothing fancy):
SUBROUTINE ROUTINE_ONE(AA,BB,IA)
IMPLICIT NONE
INTEGER,DIMENSION(64) :: IA
REAL(8),DIMENSION(IA(1),IA(2)) :: AA, BB
...
do lots of other stuff
...
END SUBROUTINE ROUTINE_ONE
Now I get an error! The output to the screen says:
forrtl: severe (170): Program Exception - stack overflow
However, when I run the program with the debugger it breaks at line 419 in a file called winsig.c (not my file, but probably part of the compiler?). It seems to be part of a routine called sigreterror: and it is the default case that has been invoked, returning the text Invalid signal or error. There is a comment line attached to this which strangely says /* should never happen, but compiler can't tell */ ...?
So I guess my question is, why does this happen and what is actually happening? I thought that as long as I can allocate all the relevant memory I should be fine? Does the call to the subroutine make copies of the arguments, or just pointers to them? If the answer is copies then I can see where the problem might be, and if so: any ideas on how to get around it?
The problem I try to solve is big, but not insane in any way. Standard FE-solvers can handle bigger problems than my current one. I run the program on a Dell PowerEdge 1850 and the OS is Microsoft Server 2008 R2 Enterprise. According to systeminfo at the cmd prompt I have 8GB of physical memory and almost 16GB virtual. As far as I understand the total of all my arrays and matrices should not add up to more than maybe 100MB - about 5.5M integer(4) and 2.5M real(8) (which according to me should be only about 44MB, but let's be fair and add another 50MB for overhead).
I use the Intel Fortran compiler integrated with Microsoft Visual Studio 2008.
Adding some actual source code to clarify a bit
! Update continuum state
CALL UpdateContinuumState(iTask,iArray,posc,dof,dof_k,nodedof,elm,&
bmtrx,detjac,w,mtrlprops,demtrx,dt,stress,strain,effstrain,&
effstress,aa,fi,errmsg)
is the actual call to the routine. Big arrays are posc, bmtrx and aa - all other are at least an order of magnitude smaller (if not more). posc is INTEGER(4) and bmtrx and aa is REAL(8)
SUBROUTINE UpdateContinuumState(iTask,iArray,posc,dof,dof_k,nodedof,elm,bmtrx,&
detjac,w,mtrlprops,demtrx,dt,stress,strain,effstrain,&
effstress,aa,fi,errmsg)
IMPLICIT NONE
!I/O
INTEGER(4) :: iTask, errmsg
INTEGER(4) :: iArray(64)
INTEGER(4),DIMENSION(iArray(15),iArray(15),iArray(5)) :: posc
INTEGER(4),DIMENSION(iArray(22),iArray(21)+1) :: nodedof
INTEGER(4),DIMENSION(iArray(29),iArray(3)+2) :: elm
REAL(8),DIMENSION(iArray(14)) :: dof, dof_k
REAL(8),DIMENSION(iArray(12)*iArray(17),iArray(15)*iArray(5)) :: bmtrx
REAL(8),DIMENSION(iArray(5)*iArray(17)) :: detjac
REAL(8),DIMENSION(iArray(17)) :: w
REAL(8),DIMENSION(iArray(23),iArray(19)) :: mtrlprops
REAL(8),DIMENSION(iArray(8),iArray(8),iArray(23)) :: demtrx
REAL(8) :: dt
REAL(8),DIMENSION(2,iArray(12)*iArray(17)*iArray(5)) :: stress
REAL(8),DIMENSION(iArray(12)*iArray(17)*iArray(5)) :: strain
REAL(8),DIMENSION(2,iArray(17)*iArray(5)) :: effstrain, effstress
REAL(8),DIMENSION(iArray(25)) :: aa
REAL(8),DIMENSION(iArray(14)) :: fi
!Locals
INTEGER(4) :: i, e, mtrl, i1, i2, j1, j2, k1, k2, dim, planetype, elmnodes, &
Nec, elmpnodes, Ndisp, Nstr, Ncomp, Ngpt, Ndofelm
INTEGER(4),DIMENSION(iArray(15)) :: doflist
REAL(8),DIMENSION(iArray(12)*iArray(17),iArray(15)) :: belm
REAL(8),DIMENSION(iArray(17)) :: jelm
REAL(8),DIMENSION(iArray(12)*iArray(17)*iArray(5)) :: dstrain
REAL(8),DIMENSION(iArray(12)*iArray(17)) :: s
REAL(8),DIMENSION(iArray(17)) :: ep, es, dep
REAL(8),DIMENSION(iArray(15),iArray(15)) :: kelm
REAL(8),DIMENSION(iArray(15)) :: felm
dim = iArray(1)
...
And it fails before the last line above.
As per steabert's request, I'll just summarize the conversation in the comments here where it's a bit more visible, even though M.S.B.'s answer already gets right to the nub of the problem.
In technical programming, where procedures often have large local arrays for intermediate computation, this happens a lot. Local variables are generally stored on the stack, which typically (and quite reasonably) a small fraction of overall system memory -- usually of order 10MB or so. When the local variable sizes exceed the stack size, you see exactly the symptoms described here -- a stack overflow occuring after a call to the relevant subroutine but before its first executable statement.
So when this problem happens, the best thing to do is to find the relevant large local variables, and decide what to do. In this case, at least the variables belm and dstrain were getting quite sizable.
Once the variables are located, and you've confirmed that's the problem, there's a few options. As MSB points out, if you can make your arrays smaller, that's one option. Alternatively, you can make the stack size larger; under linux, that's done with ulimit -s [newsize]. That really just postpones the problem, though, and you have to do something different on windows machines.
The other class of ways to avoid this problem is not to put the large data on the stack, but in the rest of memory (the "heap"). You can do that by giving the arrays the save attribute (in C, static); this puts the variable on the heap and thus makes the values persistent between calls. The downside there is that this potentially changes the behavior of the subroutine, and means the subroutine can't be used recursively, and similarly is non-threadsafe (if you're ever in a position where multiple threads will enter the routine simulatneously, they'll each see the same copy of the local varaiable and potentially overwrite each other's results). The upside is that it's easy and very portable -- it should work everywhere. However, this will only work with fixed-size local variables; if the temporary arrays have sizes that depend on the inputs, you can't do this (since there'd no longer be a single variable to save; it could be different size every time the procedure is called).
There are compiler-specific options which put all arrays (or all arrays of larger than some given size) on the heap rather than on the stack; every Fortran compiler I know has an option for this. For ifort, used in the OPs post, it's -heap-arrays in linux, or /heap-arrays for windows. For gfortran, this may actually be the default. This is good for making sure you know what's going on, but it means you have to have different incantations for every compiler to make sure your code works.
Finally, you can make the offending arrays allocatable. Allocated memory goes on the heap; but the variable which points to them is on the stack, so you get the benefits of both approaches. Also, this is completely standard fortran and so totally portable. The downside is that it requires code changes. Also, the allocation process can take nontrivial amounts of time; so if you're going to be calling the routine zillions of times, you may notice this slows things down slightly. (This possible performance regression is easy to fix, though; if you'll be calling it zillions of times with the same size arrays, you can have an optional argument to pass in a pre-allocated local array and use that instead, so that you only allocate/deallocate once).
Allocating/deallocating each time would look like:
SUBROUTINE UpdateContinuumState(iTask,iArray,posc,dof,dof_k,nodedof,elm,bmtrx,&
detjac,w,mtrlprops,demtrx,dt,stress,strain,effstrain,&
effstress,aa,fi,errmsg)
IMPLICIT NONE
!...arguments....
!Locals
!...
REAL(8),DIMENSION(:,:), allocatable :: belm
REAL(8),DIMENSION(:), allocatable :: dstrain
allocate(belm(iArray(12)*iArray(17),iArray(15))
allocate(dstrain(iArray(12)*iArray(17)*iArray(5))
!... work
deallocate(belm)
deallocate(dstrain)
Note that if the subroutine does a lot of work (eg, takes seconds to execute), the overhead from a couple allocate/deallocates should be negligable. If not, and you want to avoid the overhead, using the optional arguments for preallocated worskpace would look something like:
SUBROUTINE UpdateContinuumState(iTask,iArray,posc,dof,dof_k,nodedof,elm,bmtrx,&
detjac,w,mtrlprops,demtrx,dt,stress,strain,effstrain,&
effstress,aa,fi,errmsg,workbelm,workdstrain)
IMPLICIT NONE
!...arguments....
real(8),dimension(:,:), optional, target :: workbelm
real(8),dimension(:), optional, target :: workdstrain
!Locals
!...
REAL(8),DIMENSION(:,:), pointer :: belm
REAL(8),DIMENSION(:), pointer :: dstrain
if (present(workbelm)) then
belm => workbelm
else
allocate(belm(iArray(12)*iArray(17),iArray(15))
endif
if (present(workdstrain)) then
dstrain => workdstrain
else
allocate(dstrain(iArray(12)*iArray(17)*iArray(5))
endif
!... work
if (.not.(present(workbelm))) deallocate(belm)
if (.not.(present(workdstrain))) deallocate(dstrain)
Not all of the memory is created when the program starts. When you call the subroutine the executable is creating the memory that the subroutine needs for local variables. Typically arrays with simple declarations that are local to that subroutine -- neither allocatable, nor pointer -- are allocated on the stack. You could have simply run of of stack space when you reached these declarations. You might have reached a 2GB limit on a 32-bit OS with some array. Sometimes executable statements implicitly create a temporary array on the stack.
Possible solutions: 1) make your arrays smaller (not attractive), 2) make the stack larger), 3) some compilers have options to switch from placing arrays on the stack to dynamically allocating them, similar to the method used for "allocate", 4) identify large arrays and make them allocatable.
The stack is the memory area where the information needed to return from a function, and the information locally defined in a function is stored. So a stack overflow may indicate you have a function that calls another function which in its turn calls another function, etc.
I am not familiar with Fortran (anymore) but another cause might be that those functions declare tons of local variables, or at least variables that need a lot of place.
A last one: the stack is typically rather small, so it's not a priori relevant how much memory the machine has. It should be quite simple to instruct the linker to increase the stack size, at least if you are certain it's just a lack of space, and not a bug in your application.
Edit: do you use recursion in your program? Recursive calls can eat through the stack very quickly.
Edit: have a look at this: (emphasis mine)
On Windows, the stack space to
reserved for the program is set using
the /Fn compiler option, where n is
the number of bytes. Additionally,
the stack reserve size can be
specified through the Visual Studio
IDE which adds the Microsoft Linker
option /STACK: to the linker command
line. To set this, go to Property
Pages>Configuration
Properties>Linker>System>Stack Reserve
Size. There you can specify the stack
size in bytes in either decimal or
C-language notation. If not specified,
the default stack size is 1MB.
The only problem I ran into with a similar test code, is the 2Gb allocation limit for 32-bit compilation. When I exceed it I get an error message on line 419 in winsig.c
Here is the test code
program FortranCon
implicit none
! Variables
INTEGER :: IA(64), S1
REAL(8), DIMENSION(:,:), ALLOCATABLE :: AA, BB
REAL(4) :: S2
INTEGER, PARAMETER :: N = 10960
IA(1)=N
IA(2)=N
ALLOCATE( AA(N,N), BB(N,N) )
AA(1:N,1:N) = 1D0
BB(1:N,1:N) = 2D0
CALL TEST(AA,BB,IA)
S1 = SIZEOF(AA) !Size of each array
S2 = 2*DBLE(S1)/1024/1024 !Total size for 2 arrays in Mb
WRITE (*,100) S2, ' Mb' ! When allocation reached 2Gb then
100 FORMAT (F8.1,A) ! exception occurs in Win32
DEALLOCATE( AA, BB )
end program FortranCon
SUBROUTINE TEST(AA,BB,IA)
IMPLICIT NONE
INTEGER, DIMENSION(64),INTENT(IN) :: IA
REAL(8), DIMENSION(IA(1),IA(2)),INTENT(INOUT) :: AA,BB
... !Do stuff with AA,BB
END SUBROUTINE
When N=10960 it runs ok showing 1832.9 Mb. With N=11960 it crashes. Of course when I compile with x64 it works ok. Each array has 8*N^2 bytes storage. I don't know if it helps but I recommend using the INTENT() keywords for the dummy variables.
Are you using some parallelization? This can be a problem with statically declared arrays. Try all bigger arrays make ALLOCATABLE, otherwise, they will be placed on the stack in autoparallel or OpenMP threads.
For me the issue was the stack reserve size. I went and changed the stack reserved size from 0 to 100000000 and recompiled the code. The code now runs smoothly.

Fortran SAVE statement

I've read about the save statement in the (Intel's) language reference document, but I cannot quite grasp what it does. Could someone explain to me in simple language what it means when the save statement is included in a module ?
In principal when a module goes out-of-scope, the variables of that module become undefined -- unless they are declared with the SAVE attribute, or a SAVE statement is used. "Undefined" means that you are not allowed to rely on the variable having the previous value if you again use the module -- it might have the previous value when you re-access the module, or it might not -- there is no guarantee. But many compilers don't do this for module variables -- the variables probably retain their values -- it isn't worth the effort for the compiler to figure out whether a module remains in scope or not and probably module variables are treated as global variables -- but don't rely on that! To be safe, either use "save" or "use" the module from the main program so that it never goes out of scope.
"save" is also important in procedures, to store "state" across invocations of the subroutine or function (as written by #ire_and_curses) -- "first invocation" initializations, counters, etc.
subroutine my_sub (y)
integer :: var
integer, save :: counter = 0
logical, save :: FirstCall = .TRUE.
counter = counter + 1
write (*, *) counter
if (FirstCall) then
FirstCall = .FALSE.
....
end if
var = ....
etc.
In this code fragment, "counter" will report the number of invocations of subroutine x. Though actually in Fortran >=90 one can omit the "save" because the initialization in the declaration implies "save".
In contrast to the module case, with modern compilers, without the save attribute or initialization-on-a-declaration, it is normal for local variables of procedures to lose their values across invocations. So if you attempt to use "var" on an later call before redefining it in that call, the value is undefined and probably won't be the value calculated on a previous invocation of the procedure.
This is different from the behavior of many FORTRAN 77 compilers, some of which retained the values of all local variables, even though this wasn't required by the language standard. Some old programs were written relying on this non-standard behavior -- these programs will fail on the newer compilers. Many compilers have an option to use the non-standard behavior and "save" all local variables.
LATER EDIT: update with a code example that shows incorrect usage of a local variable that should have the save attribute but doesn't:
module subs
contains
subroutine asub (i, control)
implicit none
integer, intent (in) :: i
logical, intent (in) :: control
integer, save :: j = 0
integer :: k
j = j + i
if ( control ) k = 0
k = k + i
write (*, *) 'i, j, k=', i, j, k
end subroutine asub
end module subs
program test_saves
use subs
implicit none
call asub ( 3, .TRUE. )
call asub ( 4, .FALSE. )
end program test_saves
Local variable k of the subroutine is intentionally misused -- in this program it is initialized in the first call since control is TRUE, but on the second call control is FALSE, so k is not redefined. But without the save attribute k is undefined, so the using its value is illegal.
Compiling the program with gfortran, I found that k retained its value anyway:
i, j, k= 3 3 3
i, j, k= 4 7 7
Compiling the program with ifort and aggressive optimization options, k lost its value:
i, j, k= 3 3 3
i, j, k= 4 7 4
Using ifort with debugging options, the problems was detected at runtime!
i, j, k= 3 3 3
forrtl: severe (193): Run-Time Check Failure. The variable 'subs_mp_asub_$K' is being used without being defined
Normally, local variables go out of scope once execution leaves the current procedure, and so have no 'memory' of their value on previous invocations. SAVE is a way of specifying that a variable in a procedure should maintain its value from one call to the next. It's useful when you want to store state in a procedure, for example to keep a running total or maintain a variable's configuration.
There's a good explanation here, with an example.
A short explanation could be: the attribute save says that the value of a variable must be preserved across different calls to the same subroutine/function. Otherwise normally when you return from a subroutine/function, "local" variables lose their values since the memory where those vars were stored is released. It is like static in C, if you know this language.