Fortran runtime error "fixed" by writing output

Fortran runtime error "fixed" by writing output - fortran

I've having trouble with some old code used for research that I would like to compile using the Intel Fortran compiler. In a particular subroutine, I get segmentation faults unless I add in a write statement that just outputs the value of the loop index.
do j=1,ne
SOME STUFF
write(*,*) 'j=', j
end
What could be causing my error such that this write statement would fix my segmentation fault? (Note: j is declared as an integer)
thanks,
keely

Classic ways of causing this type of error which is 'fixed' by inserting write statements:
walking off the end of an array -- use your compiler to switch on bounds-checking and debugging options to check for this;
disagreement between arguments provided to a sub-program and arguments expected. Again, use your compiler if possible, your eyes otherwise.
Odds are 5-to-1 that one of these is the cause.

Related

Fortran - Write Statement effects subsequent calculations [duplicate]

I've having trouble with some old code used for research that I would like to compile using the Intel Fortran compiler. In a particular subroutine, I get segmentation faults unless I add in a write statement that just outputs the value of the loop index.
do j=1,ne
SOME STUFF
write(*,*) 'j=', j
end
What could be causing my error such that this write statement would fix my segmentation fault? (Note: j is declared as an integer)
thanks,
keely

Classic ways of causing this type of error which is 'fixed' by inserting write statements:
walking off the end of an array -- use your compiler to switch on bounds-checking and debugging options to check for this;
disagreement between arguments provided to a sub-program and arguments expected. Again, use your compiler if possible, your eyes otherwise.
Odds are 5-to-1 that one of these is the cause.

Segmentation fault for array, but only if a component of a derived type

Pretty simple setup, using gfortran 4.8.5 on linux (red hat):
I get a segfault if my array of reals (inside a derived type) has size > 2,000,000. This seems to be a standard stack/heap issue as my stack size is 8mb if I check with ulimit.
There is no problem if the array is NOT inside a derived type
Note that as #francescalus guesses, removing the initial value = 0.0 eliminates the problem
Edit to add: Note that I have posted a followup question Segmentation fault related to component of derived type that represents a more realistic use case and further narrows down the conditions under which this seems to occur.
program main
call sub1 ! seg fault if col size > 2,100,000
call sub2 ! works fine at col size = 100,000,000
end program main
subroutine sub1
type table
real :: col(2100000) = 0.0 ! works if "= 0.0" removed
end type table
type(table) :: table1
table1%col = 1.0
end subroutine sub1
subroutine sub2
real :: col(100000000) = 0.0
col = 1.0
end subroutine sub2
Some obvious questions here:
Is this expected behavior, or some bug that was fixed in newer versions of gfortran?
Am I following standard fortran operating procedures here, or doing something wrong?
What is the recommended way to avoid this (please assume that I am unable to update to a newer version of gfortran in the near term)? I will almost certainly solve with an allocatable array component for reasons not specific to this question, but that might not be an ideal general solution and I would like to know of all good options I have here.
In particular, is initializing the components of a derived type bad practice?

This is likely to be a runtime issue due to insufficient stack, rather than a bug with gfortran.
Gfortran uses the stack to store automatic arrays and other initialization data. When code does not create problems when one such array is small, but segfaults when the size of the array increases, a possible reason is running out of stack.
The issue seems to be the same in more recent versions of gfortran. I compiled and ran your program with gfortran 4.8.4, 4.9.3, 5.5.0, 6.4.0, 7.3.0 and 8.2.0. In all cases I obtained a segmentation fault with the default stack size, but no error when the stack size was slightly increased.
$ ./sfa
Segmentation fault
$ ulimit -s
8192
$ ulimit -s 8256
$ ./sfa && echo "DONE"
DONE
Your problem may be solved by running
$ ulimit -s unlimited
before executing your binary. I am not aware of any particular penalty for doing this, but programmers more aware of the fine details of memory management, such as compiler developers, may think otherwise.
Initializing the components of a derived type is not bad practice, but as you can see, it can create problems with the stack if the component is a big array - be it due to the storage of the component itself, or to the storage of memory to work on the RHS of the assignment. If the component is made allocatable and allocated in a subroutine, the array is stored in the heap rather than in the stack, and this issue is usually avoided. In this case, it may be about actually setting the values of the array dynamically in a subroutine rather than at compile time. It may be less elegant, but I think it's worth it, since it's the typical example of code development work that prevents avoidable, environment-related errors when executing the binary.
Your code above is standards compliant. As explained in the comments, lack of explicit interfaces for subroutines is not good practice, but for these simple subroutines it's not against the rules.
Some compilers have flags that allow you to change where some objects are allocated in memory. While it may fix a particular issue, flags are compiler dependent, and usually not equivalent when comparing different compilers. Using dynamic memory via allocatables is a more robust solution, according to my experience.
Finally, note that, if you are using OpenMP, the ulimit command above only affects the master thread - you need to set the stack size of each of the other threads via the environment variable OMP_STACKSIZE, which cannot be unlimited. And bear in mind that non-master threads running out of stack are a problem much more difficult to diagnose, since the binary may stop without a proper Segmentation fault error.

These are not necessarily useful solutions, but below are some conditions under which the seg fault disappears. A couple of people mentioned the lack of an explicit interface (as bad practice though not technically incorrect), and it seems that this might be one key here as either of these two changes to the code gets rid of the seg fault, although it's not quite that simple, as I'll explain:
Put everything in main, with no subroutine calls
Put the type definition table in a module
Let me expand on #2 briefly. Simply taking the example in the OP and then giving it an explicit interface by putting the subroutine in a module does NOT work. However, if I put the type definition in a module and then use it (as shown below) the segfault does not occur:
program main
use table_mod
type(table) :: table1
table1%col = 1.0
end program main

How to loop over an array only with one index?

I have the code kind of like this in F90:
real(8), dimension(10,10,10) :: A
do i = 1, 1000
print*,A(i,1,1)
enddo
I'm very surprised this worked and it's faster than simply looping over 3 dimensions by i,j,k.
Can someone please explain why this works?

Your code is illegal. But under the hood the memory layout of the array happens to be in the column major order as in the triple loop k,j,i, so the code appears to work. But it is illegal.
If you enable runtime error checks in your compiler (see the manual), it will find the error and report it.
It may be slightly faster, if you do not enable compiler optimizations, because there is some overhead in nested loops, but an optimizing compiler will optimize the code to one loop.
If you actually did (you should always show your code!!!)
do i=1,10
do j=1,10
do k=1,10
something with A(i,j,k)
then please note that that is the wrong order and you should loop in the k,j,i order.
Also please note that measuring the speed of printing to the screen is not useful and can be very tricky. Some mathematical operations are more useful.

What do I do about a FORTRAN intrinsic that was not part of the standard?

I'm trying to get a legacy FORTRAN code working by building it from source using gfortran. I have finally been able to build it successfully, but now I'm getting an out-of-bounds error when it runs. I used gdb and traced the error to a function that uses the loc() intrinsic. When I try to print the value of loc(ae), with ae being my integer value being passed, I get the error "No symbol "loc" in current context." I tried compiling with ifort 11.x and debugged with DDT and got the same error. To me, this means that the compiler knows nothing of the intrinsic.
A little reading revealed that the loc intrinsic wasn't part of the F77 standard, so maybe that's part of the problem. I posted the definition of the intrinsic below, but I don't know how I can implement that into my code so loc() can be used.
Any advice or am I misinterpreting my problem? Because both gfortran and ifort crash in the same place due to an out of bounds error, but the function utilizing loc() returns the same large number between both compilers. It seems a bit strange that loc() wouldn't be working if both compilers shoot back the same value for loc.
Usage:
iaddr = loc(obj)
Where:
obj
is a variable, array, function or subroutine whose address is wanted.
iaddr
is an integer with the address of "obj". The address is in the same
format as stored by an LARn
instruction.
Description:
LOC is used to obtain the address of
something. The value returned is not
really useful within Fortran, but may
be needed for GMAP subroutines, or
very special debugging.

Well, no, the fact that it compiles means that loc is known by the compiler; the fact that gdb doesn't know about it just means it's just not known by the debugger (which probably doesn't know the matmult intrinsic, either).
loc is a widely-available non-standard extension. I hate those. If you want something standard that should work everywhere, c_loc, which is part of the C<->Fortran interoperability standard in Fortran2003, is something you could use. It returns a pointer that can be passed to C routines.
How is the value from the loc call being used?

Gfortran loc seems to work a bit differently with arrays to that of some other compilers. If you are using it to eg check for array copies or such then it can be better to do loc of the first element loc(obj(1,1)) or similar. This is equivalent to what loc does I think with intel, but in gfortran it gives instead some other address (so two arrays which share exactly the same memory layout have different loc results).

Fortran 77 handling C++ memory allocations

I'm trying to write a C++ program that utilizes a few tens of thousands of lines of Fortran 77 code, but running into some strange errors. I'm passing three coordinates (x,y,z) and the address of three vectors from C++ into fortran, then having fortran run some computations on the initial points and return results in the three vectors.
I do this a few hundred times in a C++ function, leave that function, and then come back to do it again. It works perfectly the first time through, but the second time through it stops returning useful results (returns nan) for points with a positive x component.
Initially it seems like an algorithm problem, except for three things:
It works perfectly the first 200 times I run it
It works if I call it from fortran and eliminate C++ altogether (not viable for the final program)
I've tried adding print statements to fortran to debug where it goes wrong, but turns out if I add print statments to a specific subroutine (even something as simple as PRINT *,'Here'), the program starts returning NaNs even on the first run.
This is why I think it's something to do with how memory is being allocated and deallocated between C and fortran function/subroutine calls. The basic setup looks like this:
C++:
void GetPoints(void);
extern"C"
{
void getfield_(float*,float*,float*,float[],float[],float[],int*,int*);
}
int main(void)
{
GetPoints(); //Works
GetPoints(); //Doesn't
}
void GetPoints(void)
{
float x,y,z;
int i,n,l;
l=50;
n=1;
x=y=z=0.0;
float xx[l],yy[l],zz[l]
for(i=0;i<l;i++)
getfield_(&x,&y,&z,xx,yy,zz,&n,&l);
//Store current xx,yy,zz in large global array
}
Fortran:
SUBROUTINE GETFIELD(XI,YI,ZI,XX,YY,ZZ,IIN,NP)
DIMENSION XX(NP),YY(NP),ZZ(NP)
EXTERNAL T89c
T89c(XI,YI,ZI,XX,YY,ZZ)
RETURN
END
!In T89c.f
SUBROUTINE T89c(XI,YI,ZI,XX,YY,ZZ)
COMMON /STUFF/ ARRAY(100)
!Lots of calculations
!Calling ~20 other subroutines
RETURN
END
Do any of you see any glaring memory issues that I'm creating? Perhaps common blocks that fortran thinks exist but are really deallocated by C++? Without the ability to debug using print statements, nor the time to try to understand the few thousand lines of someone else's Fortran 77 code, I'm open to trying just about anything you all can suggest or think of.
I'm using g++ 4.5.1 for compiling the C++ code and final linking, and gfortran 4.5.1 for compiling the fortran code.
Thanks
**Edit:**
I've tracked the error down to some obscure piece of the code that was written before I was even born. It appears it's looking for some common variable that got removed in the updates over the years. I don't know why it only affected one dimension, nor why the bug was replicatable by adding a print statement, but I've eliminated it nonetheless. Thank you all for the help.

You may be running into the "off-by-one" error. Fortran arrays are 1-based, while C arrays are 0-based. Make sure the array sizes you pass into Fortran are not 1 less than they should be.
Edit:
I guess it looks right... Still, I would try allocating 51 elements in the C++ function, just to see what happens.
By the way float xx[l]; is not standard. This is a gcc feature. Normally you should be allocating memory with new here, or you should be using std::vector.
Also, I am confused by the call to getfield_ in the loop. Shouldn't you be passing i to getfield_?

You should declare XX, YY and ZZ as arrays also in the subroutine T89c as follows:
REAL*4 XX(*)
REAL*4 YY(*)
REAL*4 ZZ(*)
C/C++ should in general never deallocate any Fortran common blocks. These are like structs in C (i.e. memory is reserved at compile time, not at runtime).
For some reason, gfortran seems to accept the following in T89c even without the above declarations:
print *,XX(1)
during compilation but when executing it I get a segmentation fault.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js