So, I want to help my researchers a bit with debugging Fortran programs, and for demonstration purposes I created a program that intentionally causes a segfault.
Here's the source:
program segfault
implicit none
integer :: n(10), i
integer :: ios, u
open(newunit=u, file='data.txt', status='old', action='read', iostat=ios)
if (ios /= 0) STOP "error opening file"
i = 0
do
i = i + 1
read(u, *, iostat=ios) n(i)
if (ios /= 0) exit
end do
close(u)
print*, sum(n)
end program segfault
The data.txt file contains 100 random numbers:
for i in {1..100}; do
echo $RANDOM >> data.txt;
done
When I compile this program with
gfortran -O3 -o segfault.exe segfault.f90
the resulting executable dutifully crashes. But when I compile with debugging enabled:
gfortran -O0 -g -o segfault.exe segfault.f90
Then it reads in only the first 10 values, and prints their sum. For what it's worth, -O2 causes the desired segfault, -O1 does not.
I find this deeply concerning. After all, how can I debug properly if the bug goes away when I compile with debugging symbols enabled?
Can someone explain this behaviour?
I am using GNU Fortran (MacPorts gcc5 5.3.0_1) 5.3.0
A segfault is an undefined behaviour. The program does not conform to the Fortran standard so you cannot expect any particular outcome. It can do anything at all. You cannot count with a segfault to happen, the less be deeply concerned whent it does not happen.
There are compiler checks (fcheck=) and sanitizations (-fsanitize=) available for a reason. Waiting for a segfault is not guaranteed to work. Not in Fortran, not in C, not in any similar language.
The outcome of a non-conforming program may depend on many things like placement of a variable in memory or in a register. Aligning of variables in memory, position of stack frames... You can't count with anything at all. These details obviously depend on the optimization level.
If the program accesses an array out of bounds, but the address in memory happens to be a part of memory which still belongs to the process, a segfault may not happen. It is just some bytes in memory which the process is allowed to read or write to (or both). You may be overwriting some other variable, you may be reading some garbage from some old stack frame, you may be overwriting malloc's internal book-keeping data and currupting the heap. The crash may be waiting to happen somewhere else or maybe just the numeric result of the program will be slightly wrong. Anything can happen.
Related
I was making some tests with the code bellow when I faced an strange behavior in my program. When I use the call for the intrinsic subroutine "sleep" in my program nothing was written to the file testing.dat. If I removed the call for this subroutine it worked fine, the numbers were written. I tried the same code (calling the subroutine "sleep") with Intel Fortran and it worked fine as well.
It seems to me that the sleep subroutine halts in some sense the execution before the file is written with the program compiled using gfortran, behavior that does not occur using intel fortran. I'm not an computer science expert but that is my guess, does anyone else have a better one?
I tried with all the flags bellow and nothing has changed:
gfortran -g file.f90 -o executable
gfortran file.f90 -o executable
gfortran -O3 file.f90 -o executable
I am using a xubuntu 18.01 OS.
program test
implicit none
integer :: i, j, k
open(34, file="testing.dat")
do i=1,9999999
do j=1,9999999
do k=1,9999999
print*, i, j, k
write(34,'(3I8)') i, j, k
call sleep (1)
end do
end do
end do
end program
File output can be buffered. That means that the characters or bytes that are to be written in the external file are first gathered somewhere in memory and than written to the external file in larger chunks. That can speed-up file output. If you look at the external file i a random moment, it does not have to contain the output from all write statements that were executed, some may be in the buffers. The flush(unit) statement makes the data visible to the external processes by flushing the data. The gfortran manual for the older flush intrinsic subroutine states
The FLUSH intrinsic and the Fortran 2003 FLUSH statement have
identical effect: they flush the runtime library's I/O buffer so that
the data becomes visible to other processes. This does not guarantee
that the data is committed to disk.
File buffering can also be typically controlled by compiler or runtime-library settings using compiler flags or environment variables. For gfortran you can find the runtime variables at https://gcc.gnu.org/onlinedocs/gfortran/Runtime.html#Runtime
There are four variables you might be interested in:
GFORTRAN_UNBUFFERED_ALL: Do not buffer I/O for all units
GFORTRAN_UNBUFFERED_PRECONNECTED: Do not buffer I/O for preconnected units.
GFORTRAN_FORMATTED_BUFFER_SIZE: Buffer size for formatted files
GFORTRAN_UNFORMATTED_BUFFER_SIZE: Buffer size for unformatted files
Pretty simple setup, using gfortran 4.8.5 on linux (red hat):
I get a segfault if my array of reals (inside a derived type) has size > 2,000,000. This seems to be a standard stack/heap issue as my stack size is 8mb if I check with ulimit.
There is no problem if the array is NOT inside a derived type
Note that as #francescalus guesses, removing the initial value = 0.0 eliminates the problem
Edit to add: Note that I have posted a followup question Segmentation fault related to component of derived type that represents a more realistic use case and further narrows down the conditions under which this seems to occur.
program main
call sub1 ! seg fault if col size > 2,100,000
call sub2 ! works fine at col size = 100,000,000
end program main
subroutine sub1
type table
real :: col(2100000) = 0.0 ! works if "= 0.0" removed
end type table
type(table) :: table1
table1%col = 1.0
end subroutine sub1
subroutine sub2
real :: col(100000000) = 0.0
col = 1.0
end subroutine sub2
Some obvious questions here:
Is this expected behavior, or some bug that was fixed in newer versions of gfortran?
Am I following standard fortran operating procedures here, or doing something wrong?
What is the recommended way to avoid this (please assume that I am unable to update to a newer version of gfortran in the near term)? I will almost certainly solve with an allocatable array component for reasons not specific to this question, but that might not be an ideal general solution and I would like to know of all good options I have here.
In particular, is initializing the components of a derived type bad practice?
This is likely to be a runtime issue due to insufficient stack, rather than a bug with gfortran.
Gfortran uses the stack to store automatic arrays and other initialization data. When code does not create problems when one such array is small, but segfaults when the size of the array increases, a possible reason is running out of stack.
The issue seems to be the same in more recent versions of gfortran. I compiled and ran your program with gfortran 4.8.4, 4.9.3, 5.5.0, 6.4.0, 7.3.0 and 8.2.0. In all cases I obtained a segmentation fault with the default stack size, but no error when the stack size was slightly increased.
$ ./sfa
Segmentation fault
$ ulimit -s
8192
$ ulimit -s 8256
$ ./sfa && echo "DONE"
DONE
Your problem may be solved by running
$ ulimit -s unlimited
before executing your binary. I am not aware of any particular penalty for doing this, but programmers more aware of the fine details of memory management, such as compiler developers, may think otherwise.
Initializing the components of a derived type is not bad practice, but as you can see, it can create problems with the stack if the component is a big array - be it due to the storage of the component itself, or to the storage of memory to work on the RHS of the assignment. If the component is made allocatable and allocated in a subroutine, the array is stored in the heap rather than in the stack, and this issue is usually avoided. In this case, it may be about actually setting the values of the array dynamically in a subroutine rather than at compile time. It may be less elegant, but I think it's worth it, since it's the typical example of code development work that prevents avoidable, environment-related errors when executing the binary.
Your code above is standards compliant. As explained in the comments, lack of explicit interfaces for subroutines is not good practice, but for these simple subroutines it's not against the rules.
Some compilers have flags that allow you to change where some objects are allocated in memory. While it may fix a particular issue, flags are compiler dependent, and usually not equivalent when comparing different compilers. Using dynamic memory via allocatables is a more robust solution, according to my experience.
Finally, note that, if you are using OpenMP, the ulimit command above only affects the master thread - you need to set the stack size of each of the other threads via the environment variable OMP_STACKSIZE, which cannot be unlimited. And bear in mind that non-master threads running out of stack are a problem much more difficult to diagnose, since the binary may stop without a proper Segmentation fault error.
These are not necessarily useful solutions, but below are some conditions under which the seg fault disappears. A couple of people mentioned the lack of an explicit interface (as bad practice though not technically incorrect), and it seems that this might be one key here as either of these two changes to the code gets rid of the seg fault, although it's not quite that simple, as I'll explain:
Put everything in main, with no subroutine calls
Put the type definition table in a module
Let me expand on #2 briefly. Simply taking the example in the OP and then giving it an explicit interface by putting the subroutine in a module does NOT work. However, if I put the type definition in a module and then use it (as shown below) the segfault does not occur:
program main
use table_mod
type(table) :: table1
table1%col = 1.0
end program main
As the title says, really.
Is there anything against not using stop, like this:
PROGRAM myprog
.
. < do stuff >
.
END PROGRAM myprog
rather than using an explicit stop, as in this:
PROGRAM myprog
.
. < do stuff >
.
STOP
END PROGRAM myprog
I see a lot of older fortran code that has a STOP before the END PROGRAM statment, but is it really needed there?
On our Cray machine, having a STOP stament at the end of the program writes the string "STOP" to STDERR, which is a bit annoying...
The code
stop
end program
is redundant as far as the program return value is concerned in modern Fortran. A stop with no integer or character stop-code should return a 0 exit code to the OS if exit codes are supported. If end program is encountered the behavior is the same, returning 0 to the OS.
The difference arises in program output. As you've noted, stop produces output. The standard (Fortran 2008 cl. 8.4) says
When an image is terminated by a STOP or ERROR STOP statement, its stop code, if any, is made available
in a processor-dependent manner. If any exception (14) is signaling on that image, the processor shall issue a
warning indicating which exceptions are signaling; this warning shall be on the unit identified by the named
constant ERROR UNIT (13.8.2.8). It is recommended that the stop code is made available by formatted output
to the same unit.
This recommends the stop-code be made available on standard error, which is where your STOP output is coming from. If you had given a stop-code to stop, it would have been output with STOP. Additionally, if there are floating point exceptions signalling, you will get additional output on standard error detailing that condition.
If you don't desire the additional output from stop and are not using it to return a non-zero error code to the OS, you can omit it from your program.
There is probably a historical reason for the stop,end ending of the main program, but my brief skimming of a FORTRAN66 manual did not enlighten me.
The following program tries to do a common mistake: modify a function argument,
whereas it is passed initially as a constant. Thus, usually, the constant is stored
in a read-only section in object code, and at run time one gets an access violation.
It's exactly what happens with gfortran, with optimization -O0 or -O1 (gfortran 4.8.1 on Windows).
But it disappears with -O2, and the second PRINT shows the value 100, like the first.
By inspection of the assembly output, I can see that in the -O1 case, the function F is optimized out, but the computations are still done in the code of A, and storing 117 causes a crash. With -O2, no computation is done, the result (201) is included in the assembly output as a constant, and the value 117 is never stored.
program bob
implicit none
call a(100)
contains
subroutine a(n)
integer :: n
print *, "In A:", f(n), n
print *, n
end subroutine
function f(n)
integer :: n, f
f = 2*n + 1
n = 117
end function
end program
Is this behaviour accepted by the standard? Is this a bug?
My first thought was that maybe it's a bug of the optimizer (it does not do something that would have indeed an effect, since the modified value is printed afterwards). But I'm aware that usually, an undefined behaviour in the standard can have any consequence when actually run.
If I replace the constant 100 in the call, with a variable previously initialized to 100, the compiler produces the expected result (the second PRINT gives me 117, with any optimization level).
So, maybe the optimizer is very clever, in the "constant" case: since the code would crash, the print woud not happen, so the value is not needed, so optmized out, and finally the program won't crash. But I still find it a bit puzzling.
The behaviour of the erroneous program is consistent with what the standard requires.
The standard doesn't require the compiler to diagnose this particular error (it is not a violation of the numbered syntax rules or numbered constraints). Beyond that, if a program is in error in this way, then the standard doesn't impose any requirements on the Fortran processor.
It does not reveal a bug in the compiler. Any behaviour is valid, including things like the compiler beating you over the head with a stick.
Perhaps you should have stated your INTENT.
This is probably a bug in the constants propagation module of the GCC optimiser. It is enabled by default for any optimisation level greater than -O1 and could be disabled by passing -fno-ipa-cp.
This example only serves to illustrate the importance of giving each dummy argument the correct INTENT attribute. When n is marked as INTENT(INOUT) in a, the compiler gives an error, no matter what the optimisation level.
This program crashes with Illegal instruction: 4 on MacOSX Lion and ifort (IFORT) 12.1.0 20111011
program foo
real, pointer :: a(:,:), b(:,:)
allocate(a(5400, 5400))
allocate(b(5400, 3600))
a=1.0
b(:, 1:3600) = a(:, 1:3600)
print *, a
print *, b
deallocate(a)
deallocate(b)
end program
The same program works with gfortran. I don't see any problem. Any ideas ? Unrolling the copy and performing the explicit loop over the columns works in both compilers.
Note that with allocatable instead of pointer I have no problems.
The behavior is the same if the statement is either inside a module or not.
I confirm the same behavior on ifort (IFORT) 12.1.3 20120130.
Apparently, no problem occurs with Linux and ifort 12.1.5
I tried to increase the stack size with the following linking options
ifort -Wl,-stack_size,0x40000000,-stack_addr,0xf0000000 test.f90
but I still get the same error. Increasing ulimit -s to hard same problem.
Edit 2: I did some more debugging and apparently the problem happens when the array splicing operation
b(:, 1:3600) = a(:, 1:3600)
involves a value suspiciously close to 16 M of data.
I am comparing the opcodes produced, but if there is a way to see an intermediate code form that is more communicative, I'd gladly appreciate it.
Your program is correct (though I would prefer allocatable to pointer if you do not need to be able to repoint it). The problem is that ifort by default places all array temporaries on the stack, no matter how large they are. And it seems to need an array temporary for the copy operation you are doing here. To work around ifort's stupid default behavior, always use the -heap-arrays flag when compiling. I.e.
ifort -o test test.f90 -heap-arrays 1600
The number behind -heap-arrays is the threshold where it should begin using the heap. For sizes below this, the stack is used. I chose a pretty low number here - you can probably safely use higher ones. In theory stack arrays are faster, but the difference is usually totally negligible. I wish intel would fix this behavior. Every other compiler has sensible defaults for this setting.
Use "allocatable" instead of "pointer".
real, allocatable :: a(:,:), b(:,:)
Assigning a floating point number to a pointer looks dubious to me.