I am working on re-engineering an old FORTRAN77 program to Python for a while now. I'm running into an issue, though: when dividing by zero, it appears that the FORTRAN program just continues processing the data without issue. However, predictably it causes an issue in Python. I'm not able to find a discussion about this on any official channel for F77, and I only have an old version of the source code for the program I am translating that I can't get to compile.
TL;DR: How does F77 handle division by zero for the following cases?:
REAL division
INT division
The numerator is also zero (e.g. 0./0.)
Yes, I also have code that does nothing when a divide by zero error is encountered. Usually, it is the programmers responsibility to ensure that the results are either expected (the target variable's value is unchanged) or an error is thrown etc. In other words, you need to inspect any division operation for a possible zero divisor. Modern operating systems throw an internal exception on divide by zero (and assign NAN to the target variable if the system would pause under these circumstances), most older Fortran code is written such that divide by zero doesn't matter.
Related
Here is a little program in FORTRAN 77
dimension totlev(20)
do 100 i=1,24
totlev(i)=0.0
write(0,*) 'totlev i=',i, totlev(i)
100 continue
end
I compile it using MinGW by typing gfortran test.f and I do get a warning (not an error):
test.f:4:14:
do 100 i=1,25
2
totlev(i)=0.0
1
Warning: Array reference at (1) out of bounds (25 > 20) in loop beginning at (2)
test.f:5:40:
test.f:3:72:
do 100 i=1,25
2
test.f:5:40:
write(0,*) 'totlev i=',i, totlev(i)
1
Warning: Array reference at (1) out of bounds (25 > 20) in loop beginning at (2)
However, not always such a warning would be produced if it was a longer program. An executable is created. When I run it it behaves like an infinite loop.
And this is my problem: How is an infinite loop even possible with the DO iteration? Isn't it a logical impossibility? My only explanation is that overindexing in this case reaches to the program code itself and changes it. Is that possible?
I use Windows 7 OS if that's relevant.
It's not changing the code, it's changing the variable i. Both the array totlev(20) and the scalar i are local variables, and thus typically stored in the program's stack frame (though the standard leaves this choice to the 'processor', Fortran-speak for implementation). In this case the compiler apparently put i 4 'real's (probably 16 bytes) after the end of totlev, so assigning to totlev(24) actually changes i. Fortran basically requires that an integer and single/default-precision real variable be the same size, and while it doesn't require any particular relationship between the representations for integers and reals, most machines today use 'IEEE 754' floating-point and in that system a real 0.0 has the same representation as an integer 0.
On many though not all computer architectures it is possible to address code by indexing an array out of range, but this almost always requires indexes far out of range: millions or billions or more, not one or two. On older architectures it was often possible both to read and write code this way, but most systems since about 1980 have memory protection so that you can't write to code. In particular all Windows NT-series systems do this, which includes Windows 7.
I'm trying to launch a certain kernel using clEnqueueNDRangeKernel(), from within a C++ program. But instead of enqueuing, or returning an error, it gets a floating-point exception signal (SIGFPE).
For IP reasons which I can't go into, it's difficult for me to provide an example triggering this signal. But - there doesn't seem to be any legitimate reason for this to occur. Are there known cases of that function itself actually performing an invalid floating-point operation?
tl;dr: It's a divide-by zero due to a problem with your dimensions.
With NVIDIA CUDA's OpenCL library (at least with v11.2.152), passing workgroup dimensions and overall grid dimensions of different dimensionality may indeed trigger such a situation. OpenCL users have reported similar behavior in the past.
NVIDIA has deplorably chosen to provide its libraries as closed-source only, so I can only speculate about the reason, but it seems to be the following: When you construct a cl::NDRange with two dimensions, the third value in the international representation is initialized to 0 (explicitly and necessarily, or just sometimes). Now, if you read the documentation for clEnqueueNDRangeKernel() carefully, you'll notice that the both the global dimensions and the local dimensions must have the same dimensionality, i.e. same number of dimensions; clEnqueueNDRangeKernel() probably assumes this is the case, and calculates the number of grid blocks (number of workgroups) in the third dimension using global_dims_it_got[i] / local_dims_it_got[i] for every dimension of the global dimensions it received. Thus when the global dimensionality is higher, it ends up dividing by 0. That triggers SIGFPE, which, despite its name, is not only used for invalid floating-point operations, but rather any arithmetic error.
We have numerical code written in C++. Rarely but under certain specific inputs, some of the computations result in an 'nan' value.
Is there a standard or recommended method by which we can stop and alert the user when a certain numerical calculation results in an 'nan' being generated? (under debug mode).Checking for each result if it is equal to 'nan' seems impractical given the huge sizes of matrices and vectors.
How do standard numerical libraries handle this situation? Could you throw some light on this?
NaN is propagated, when applied to a numeric operation. So, it is enough to check the final result for being a NaN. As for, how to do it -- if building for >= C++11, there is std::isnan, as Goz noticed. For < C++11 - if want to be bulletproof - I would personally do bit-checking (especially, if there may be an optimization involved). The pattern for NaN is
? 11.......1 xx.......x
sign bit ^ ^exponent^ ^fraction^
where ? may be anything, and at least one x must be 1.
For platform dependent solution, there seams to be yet another possibility. There is the function feenableexcept in glibc (probably with the signal function and the compiler option -fnon-call-exceptions), which turns on a generation of the SIGFPE sinals, when an invalid floating point operation occure. And the function _control87 (probably with the _set_se_translator function and compiler option /EHa), which allows pretty much the same in VC.
Although this is a nonstandard extension originally from glibc, on many systems you can use the feenableexcept routine declared in <fenv.h> to request that the machine trap particular floating-point exceptions and deliver SIGFPE to your process. You can use fedisableexcept to mask trapping, and fegetexcept to query the set of exceptions that are unmasked. By default they are all masked.
On older BSD systems without these routines, you can use fpsetmask and fpgetmask from <ieeefp.h> instead, but the world seems to be converging on the glibc API.
Warning: glibc currently has a bug whereby (the C99 standard routine) fegetenv has the unintended side effect of masking all exception traps on x86, so you have to call fesetenv to restore them afterward. (Shows you how heavily anyone relies on this stuff...)
On many architectures, you can unmask the invalid exception, which will cause an interrupt when a NaN would ordinarily be generated by a computation such as 0*infinity. Running in the debugger, you will break on this interrupt and can examine the computation that led to that point. Outside of a debugger, you can install a trap handler to log information about the state of the computation that produced the invalid operation.
On x86, for example, you would clear the Invalid Operation Mask bit in FPCR (bit 0) and MXCSR (bit 7) to enable trapping for invalid operations from x87 and SSE operations, respectively.
Some individual platforms provide a means to write to these control registers from C, but there's no portable interface that works cross-platform.
Testing f!=f might give problems using g++ with -ffast-math optimization enabled: Checking if a double (or float) is NaN in C++
The only foolproof way is to check the bitpattern.
As to where to implement the checks, this is really dependent on the specifics of your calculation and how frequent Nan errors are i.e. performance penalty of continuing tainted calculations versus checking at certain stages.
Here a small snippet of code that returns epsilon() for a real value:
program epstest
real :: eps=1.0, d
do
d=1.0+eps
if (d==1.0) then
eps=eps*2
exit
else
eps=eps/2
end if
end do
write(*,*) eps, epsilon(d)
pause
end program
Now, when I replace the if condition by
if (1.0+eps==1.0) then
the program should have the same in return but it unfortunately does not! I tested it with the latest (snapshot) release of g95 on Linux and Windows.
Can someone explain that issue to me?
Floating point arithmetic has many subtle issues. One form of source code can generate different machine instructions than another seemingly almost identical source code. For example, "d" might get stored to a memory location, while "1.0 + eps" might be evaluated solely with registers ... this can cause different precision.
More generally, why not use the intrinsic functions provided for Fortran 95 that disclose the characteristics of a particular precision of reals?
I just wonder if there is some convenient way to detect if overflow happens to any variable of any default data type used in a C++ program during runtime? By convenient, I mean no need to write code to follow each variable if it is in the range of its data type every time its value changes. Or if it is impossible to achieve this, how would you do?
For example,
float f1=FLT_MAX+1;
cout << f1 << endl;
doesn't give any error or warning in either compilation with "gcc -W -Wall" or running.
Thanks and regards!
Consider using boosts numeric conversion which gives you negative_overflow and positive_overflow exceptions (examples).
Your example doesn't actually overflow in the default floating-point environment in a IEEE-754 compliant system.
On such a system, where float is 32 bit binary floating point, FLT_MAX is 0x1.fffffep127 in C99 hexadecimal floating point notation. Writing it out as an integer in hex, it looks like this:
0xffffff00000000000000000000000000
Adding one (without rounding, as though the values were arbitrary precision integers), gives:
0xffffff00000000000000000000000001
But in the default floating-point environment on an IEEE-754 compliant system, any value between
0xfffffe80000000000000000000000000
and
0xffffff80000000000000000000000000
(which includes the value you have specified) is rounded to FLT_MAX. No overflow occurs.
Compounding the matter, your expression (FLT_MAX + 1) is likely to be evaluated at compile time, not runtime, since it has no side effects visible to your program.
In situations where I need to detect overflow, I use SafeInt<T>. It's a cross platform solution which throws an exception in overflow situations.
SafeInt<float> f1 = FLT_MAX;
f1 += 1; // throws
It is available on codeplex
http://www.codeplex.com/SafeInt/
Back in the old days when I was developing C++ (199x) we used a tool called Purify. Back then it was a tool that instrumented the object code and logged everything 'bad' during a test run.
I did a quick google and I'm not quite sure if it still exists.
As far as I know nowadays several open source tools exist that do more or less the same.
Checkout electricfence and valgrind.
Clang provides -fsanitize=signed-integer-overflow and -fsanitize=unsigned-integer-overflow.
http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation