I am new to Fortran and coding in general so I apologize if my terminology is not correct.
I am using a Linux machine with the gfortran compiler.
I am doing research this summer which involves me getting a program written in about 1980 working again. It is written in Fortran 77. I have all the code as well as some documentation about it.
In its current form it I am receiving a "IEEE_UNDERFLOW_FLAG IEEE_DENORMAL" error. My first thought is that this code was meant to be developed under a different environment/architecture.
The documentation states “This program was designed to run on the HARRIS computer system. It also can be run on VAX system if the single precision variables are changed into double precision variables both in the main code and the subroutine package.”
I have tried changing the single precision variables to double precision variables, but I may have done that wrong. If this is the correct thing to do any insight would be great.
I have also tried compiling the coding with -std=legacy and -m32. I receive the same error from this as well.
Any advice to get me going in the right direction would be greatly appreciated.
"IEEE_UNDERFLOW_FLAG IEEE_DENORMAL is signalling" is not that uncommon. It is NOT an error message.
The meaning is that there are denormal numbers generated when running the code.
It may be a hint about numerical problems in your code, but it is not an error per se. Probably it means that your program finished successfully.
Fortran in its latest edition requires that all floating point exceptions that are signalling be reported when a STOP statement is executed. See gfortran IEEE exception inexact BTW, that also means that your program is not being compiled as Fortran 77 but as Fortran 2003 or higher.
Note that even if you request the Fortran 95 standard by -std=f95 the note is still displayed, but it can be controlled by the -ffpe-summary=list flag.
The linked answer also says that a way to avoid these warnings is to not finish the program by a STOP statement, but by running till the END PROGRAM. If you have something like
STOP
END
or
STOP
END PROGRAM
in your code, just remove the STOP, it is useless, if not even harmful.
You may but you don't have to be successful in getting rid of that by using double precision. If there are numerical problems in the algorithms, they will stay there even with doubles. But they may become less apparent. Or they might not, it depends. You don't have to re-write your code for that, just use -fdefault-real-8 or -freal-4-real-8 or similar. Read more about these options in your gfortran manual. You could even try quadruple precision, but double should normally be sufficient for all reasonable algorithms.
Related
I am currently using a large computational package written in c++, which I have downloaded from github and compiled myself as I want to use it for some work I am doing.
The code works well for most purposes. Unfortunately, I have found that for certain inputs the code gives the error: Floating point exception (core dumped)
Now, I am a beginner at c++ and I have had no luck trying to browse through the many scripts that make up the code. My question is therefore: Is there a simple way to get a c++ code to output which line and which script the error occurred? Being used to Python, this is where I would always start, but unfortunately the compiled code does not return any more details about the error. Do I need to compile it in a form of debugging mode to get it to do so?
Yes, you should build the program in debug mode and run it through a debugger. It'll "break" when the error happens and tell you exactly what line of code triggers it. Furthermore, you can examine the values of variables in that stack frame and lower to diagnose the cause of the problem.
In fact, while developing, you should be doing this anyway.
It is impossible to give general steps as to how to do this, but if you're using an IDE (Visual Studio, Xcode) this should automatically happen; if you're using GCC on the command line, research GDB; if you're using Clang on the command line, research LLDB.
Speaking generally, though, a Floating-Point Exception (not a C++ exception!) is usually, and perhaps confusingly, triggered by an integer division by zero. Though, there are other reasons it can occur. You'll know more once you're debugging.
I'm developing a cross-platform game which plays over a network using a lockstep model. As a brief overview, this means that only inputs are communicated, and all game logic is simulated on each client's computer. Therefore, consistency and determinism is very important.
I'm compiling the Windows version on MinGW32, which uses GCC 4.8.1, and on Linux I'm compiling using GCC 4.8.2.
What struck me recently was that, when my Linux version connected to my Windows version, the program would diverge, or de-sync, instantly, even though the same code was compiled on both machines! Turns out the problem was that the Linux build was being compiled via 64 bit, whereas the Windows version was 32 bit.
After compiling a Linux 32 bit version, I was thankfully relieved that the problem was resolved. However, it got me thinking and researching on floating point determinism.
This is what I've gathered:
A program will be generally consistent if it's:
ran on the same architecture
compiled using the same compiler
So if I assume, targeting a PC market, that everyone has a x86 processor, then that solves requirement one. However, the second requirement seems a little silly.
MinGW, GCC, and Clang (Windows, Linux, Mac, respectively) are all different compilers based/compatible with/on GCC. Does this mean it's impossible to achieve cross-platform determinism? or is it only applicable to Visual C++ vs GCC?
As well, do the optimization flags -O1 or -O2 affect this determinism? Would it be safer to leave them off?
In the end, I have three questions to ask:
1) Is cross-platform determinism possible when using MinGW, GCC, and Clang for compilers?
2) What flags should be set across these compilers to ensure the most consistency between operating systems / CPUs?
3) Floating point accuracy isn't that important for me -- what's important is that they are consistent. Is there any method to reducing floating point numbers to a lower precision (like 3-4 decimal places) to ensure that the little rounding errors across systems become non-existent? (Every implementation I've tried to write so far has failed)
Edit: I've done some cross-platform experiments.
Using floatation points for velocity and position, I kept a Linux Intel Laptop and a Windows AMD Desktop computer in sync for up to 15 decimal places of the float values. Both systems are, however, x86_64. The test was simple though -- it was just moving entities around over a network, trying to determine any visible error.
Would it make sense to assume that the same results would hold if a x86 computer were to connect to a x86_64 computer? (32 bit vs 64 bit Operating System)
Cross-platform and cross-compiler consistency is of course possible. Anything is possible given enough knowledge and time! But it might be very hard, or very time-consuming, or indeed impractical.
Here are the problems I can foresee, in no particular order:
Remember that even an extremely small error of plus-or-minus 1/10^15 can blow up to become significant (you multiply that number with that error margin with one billion, and now you have a plus-or-minus 0.000001 error which might be significant.) These errors can accumulate over time, over many frames, until you have a desynchronized simulation. Or they can manifest when you compare values (even naively using "epsilons" in floating-point comparisons might not help; only displace or delay the manifestation.)
The above problem is not unique to distributed deterministic simulations (like yours.) The touch on the issue of "numerical stability", which is a difficult and often neglected subject.
Different compiler optimization switches, and different floating-point behavior determination switches might lead to the compiler generate slightly different sequences of CPU instructions for the same statements. Obviously these must be the same across compilations, using the same exact compilers, or the generated code must be rigorously compared and verified.
32-bit and 64-bit programs (note: I'm saying programs and not CPUs) will probably exhibit slightly different floating-point behaviors. By default, 32-bit programs cannot rely on anything more advanced than x87 instruction set from the CPU (no SSE, SSE2, AVX, etc.) unless you specify this on the compiler command line (or use the intrinsics/inline assembly instructions in your code.) On the other hand, a 64-bit program is guaranteed to run on a CPU with SSE2 support, so the compiler will use those instructions by default (again, unless overridden by the user.) While x87 and SSE2 float datatypes and operations on them are similar, they are - AFAIK - not identical. Which will lead to inconsistencies in the simulation if one program uses one instruction set and another program uses another.
The x87 instruction set includes a "control word" register, which contain flags that control some aspects of floating-point operations (e.g. exact rounding behavior, etc.) This is a runtime thing, and your program can do one set of calculations, then change this register, and after that do the exact same calculations and get a different result. Obviously, this register must be checked and handled and kept identical on the different machines. It is possible for the compiler (or the libraries you use in your program) to generate code that changes these flags at runtime inconsistently across the programs.
Again, in case of the x87 instruction set, Intel and AMD have historically implemented things a little differently. For example, one vendor's CPU might internally do some calculations using more bits (and therefore arrive at a more accurate result) that the other, which means that if you happen to run on two different CPUs (both x86) from two different vendors, the results of simple calculations might not be the same. I don't know how and under what circumstances these higher accuracy calculations are enabled and whether they happen under normal operating conditions or you have to ask for them specifically, but I do know these discrepancies exist.
Random numbers and generating them consistently and deterministically across programs has nothing to do with floating-point consistency. It's important and source of many bugs, but in the end it's just a few more bits of state that you have to keep synched.
And here are a couple of techniques that might help:
Some projects use "fixed-point" numbers and fixed-point arithmetic to avoid rounding errors and general unpredictability of floating-point numbers. Read the Wikipedia article for more information and external links.
In one of my own projects, during development, I used to hash all the relevant state (including a lot of floating-point numbers) in all the instances of the game and send the hash across the network each frame to make sure even one bit of that state wasn't different on different machines. This also helped with debugging, where instead of trusting my eyes to see when and where inconsistencies existed (which wouldn't tell me where they originated, anyways) I would know the instant some part of the state of the game on one machine started diverging from the others, and know exactly what it was (if the hash check failed, I would stop the simulation and start comparing the whole state.)
This feature was implemented in that codebase from the beginning, and was used only during the development process to help with debugging (because it had performance and memory costs.)
Update (in answer to first comment below): As I said in point 1, and others have said in other answers, that doesn't guarantee anything. If you do that, you might decrease the probability and frequency of an inconsistency occurring, but the likelihood doesn't become zero. If you don't analyze what's happening in your code and the possible sources of problems carefully and systematically, it is still possible to run into errors no matter how much you "round off" your numbers.
For example, if you have two numbers (e.g. as results of two calculations that were supposed to produce identical results) that are 1.111499999 and 1.111500001 and you round them to three decimal places, they become 1.111 and 1.112 respectively. The original numbers' difference was only 2E-9, but it has now become 1E-3. In fact, you have increased your error 500'000 times. And still they are not equal even with the rounding. You've exacerbated the problem.
True, this doesn't happen much, and the examples I gave are two unlucky numbers to get in this situation, but it is still possible to find yourself with these kinds of numbers. And when you do, you're in trouble. The only sure-fire solution, even if you use fixed-point arithmetic or whatever, is to do rigorous and systematic mathematical analysis of all your possible problem areas and prove that they will remain consistent across programs.
Short of that, for us mere mortals, you need to have a water-tight way to monitor the situation and find exactly when and how the slightest discrepancies occur, to be able to solve the problem after the fact (instead of relying on your eyes to see problems in game animation or object movement or physical behavior.)
No, not in practice. For example, sin() might come from a library or from a compiler intrinsic, and differ in rounding. Sure, that's only one bit, but that's already out of sync. And that one bit error may add up over time, so even an imprecise comparison may not be sufficient.
N/A
You can't reduce FP precision for a given type, and I don't even see how it would help you. You'd turn the occasional 1E-6 difference into an occasional 1E-4 difference.
Next to your concerns on determinism, I have another remark: if you are worried about calculation consistency on a distributed system, you may have a design issue.
You could think about your application as a bunch of nodes, each responsible for their own calculations. If information about another node is needed, it should sent to you by that node.
1.)
In principle cross platform, OS, hardware compatibility is possible but in practice it's a pain.
In general your results will depend on which OS you use, which compiler, and which hardware you use. Change any one of those and your results might change. You have to test all changes.
I use Qt Creator and qmake (cmake is probably better but qmake works for me) and test my code in MSVC on Windows, GCC on Linux, and MinGW-w64 on Windows. I test both 32-bit and 64-bit. This has to be done whenever code changes.
2.) and 3.)
In terms of floating point some compilers will use x87 instead of SSE in 32-bit mode. See this as an example of the consequences of when that happens Why a number crunching program starts running much slower when diverges into NaNs? All 64-bit systems have SSE so I think most use SSE/AVX in 64-bit otherwise, e.g. in 32 bit mode, you might need to force SSE with something like -mfpmath=sse and -msse2.
But if you want a more compatible version of GCC on windows then I would used MingGW-w64 for 32-bit (aka MinGW-w32) or MinGW-w64 in 64bit . This is not the same thing as MinGW (aka mingw32). The projects have diverged. MinGW depends on MSVCRT (the MSVC C runtime library) and MinGW-w64 does not. The Qt project has a pretty good description of MinGW-w64 and installiation. http://qt-project.org/wiki/MinGW-64-bit
You might also want to consider writing a CPU dispatcher cpu dispatcher for visual studio for AVX and SSE.
I need some pointers to solve a problem that I can describe only in a limited way. I got a code written in f77 from a senior scientist. I can't give the code on a public forum for ownership issues. It's not big (750 lines) but given implicit declarations and gotos statements, it is very unreadable. Hence I am having trouble finding out the source of error. Here is the problem:
When I compile the code with ifort, it runs fine and gives me sensible numbers but when I compile it with gfortran, it compiles fine but does not give me the right answer. The code is a numerical root finder for a complex plasma physics problem. The ifort compiled version finds the root but the gfortran compiled version fails to find the root.
Any ideas on how to proceed looking for a solution? I will update the question to reflect the actual problem when I find one.
Some things to investigate, not necessarily in the order I would try them:
Use your compiler(s) to check everything that your compiler(s) are capable of checking including and especially array-bounds (for run-time confidence) and subroutine argument matching.
Use of uninitialised variables.
The kinds of real, complex and integer variables; the compilers (or your compilation options) may default to different kinds.
Common blocks, equivalences, entry, ... other now deprecated or obsolete features.
Finally, perhaps not a matter for immediate investigation but something you ought to do sooner (right choice) or later (wrong choice), make the effort to declare IMPLICIT NONE in all scopes and to write explicit declarations for all entities.
I have a thermal hydraulics code written in Fortran that I work on. For my debug version, I use the -check bounds option in ifort 11.1 during compile time. I have caught array bounds errors in the past in this way. Recently, though, I was seeing that the solution was quickly blowing up for a given case. The peculiar thing was that it was converging nicely for the release version of the code. Sure enough, removing the -check bounds flag from my debug makefile cleared up the problem.
The strange thing is that the debug version was working fine for many other test cases I used before and it wasn't throwing up any errors on going outside of any array bounds in my code. This behavior seems very strange to me and I have no idea if there is some kind of bug in my code or what. Anybody have any ideas what could be causing this sort of behavior?
As requested, the flags I use for release and debug are:
Release: -c -r8 -traceback -extend-source -override-limits -zero -unroll -O3
Debug: -c -r8 -traceback -extend-source -override-limits -zero -g -O0
Of course, as my original question indicates, I toggle the -check bounds flag on and off for the debug case.
I would suspect your numerical algorithm here more than the Fortran code. Have you ensured that all of convergence and stability criteria have been met?
What it sounds like is that round-off error is causing the solution to fail to converge. If you are on the edges of safe convergence, compiler optimizations can definitely tip things one way or another.
I use gfortran more than ifort, so I don't know all the specifics of the -unroll option, but unrolling loops can change some rounding even though the calculations seem like they should remain the same. Also, debug will definitely change the exact order of memory and register access. If the number is in the processor in some internal representation, then is written to memory and read back again, the value can change. This can be alleviated to some extent by careful selection of kind. By it's nature, this will be processor specific rather than portable.
In theory, full compliance with IEEE 754 would make floating point operations reproducible, but this is not always the case. If debug is actually causing these problems as opposed to some other bug in your code, then other mysterious things related to the inner workings of the processor could also cause it to blow up.
I would add write statements at various key points in the code to output your data matrices (or whatever data structures you are using). Be sure to use binary output. Open with form='unformatted' and access='direct'.
I have to work on a fortran program, which used to be compiled using Microsoft Compaq Visual Fortran 6.6. I would prefer to work with gfortran but I have met lots of problems.
The main problem is that the generated binaries have different behaviours. My program takes an input file and then has to generate an output file. But sometimes, when using the binary compiled by gfortran, it crashes before its end, or gives different numerical results.
This a program written by researchers which uses a lot of float numbers.
So my question is: what are the differences between these two compilers which could lead to this kind of problem?
edit:
My program computes the values of some parameters and there are numerous iterations. At the beginning, everything goes well. After several iterations, some NaN values appear (only when compiled by gfortran).
edit:
Think you everybody for your answers.
So I used the intel compiler which helped me by giving some useful error messages.
The origin of my problems is that some variables are not initialized properly. It looks like when compiling with compaq visual fortran these variables take automatically 0 as a value, whereas with gfortran (and intel) it takes random values, which explain some numerical differences which add up at the following iterations.
So now the solution is a better understanding of the program to correct these missing initializations.
There can be several reasons for such behaviour.
What I would do is:
Switch off any optimization
Switch on all debug options. If you have access to e.g. intel compiler, use ifort -CB -CU -debug -traceback. If you have to stick to gfortran, use valgrind, its output is somewhat less human-readable, but it's often better than nothing.
Make sure there are no implicit typed variables, use implicit none in all the modules and all the code blocks.
Use consistent float types. I personally always use real*8 as the only float type in my codes. If you are using external libraries, you might need to change call signatures for some routines (e.g., BLAS has different routine names for single and double precision variables).
If you are lucky, it's just some variable doesn't get initialized properly, and you'll catch it by one of these techniques. Otherwise, as M.S.B. was suggesting, a deeper understanding of what the program really does is necessary. And, yes, it might be needed to just check the algorithm manually starting from the point where you say 'some NaNs values appear'.
Different compilers can emit different instructions for the same source code. If a numerical calculation is on the boundary of working, one set of instructions might work, and another not. Most compilers have options to use more conservative floating point arithmetic, versus optimizations for speed -- I suggest checking the compiler options that you are using for the available options. More fundamentally this problem -- particularly that the compilers agree for several iterations but then diverge -- may be a sign that the numerical approach of the program is borderline. A simplistic solution is to increase the precision of the calculations, e.g., from single to double. Perhaps also tweak parameters, such as a step size or similar parameter. Better would be to gain a deeper understanding of the algorithm and possibly make a more fundamental change.
I don't know about the crash but some differences in the results of numerical code in an Intel machine can be due to one compiler using 80-doubles and the other 64-bit doubles, even if not for variables but perhaps for temporary values. Moreover, floating-point computation is sensitive to the order elementary operations are performed. Different compilers may generate different sequence of operations.
Differences in different type implementations, differences in various non-Standard vendor extensions, could be a lot of things.
Here are just some of the language features that differ (look at gfortran and intel). Programs written to fortran standard work on every compiler the same, but a lot of people don't know what are the standard language features, and what are the language extensions, and so use them ... when compiled with a different compiler troubles arise.
If you post the code somewhere I could take a quick look at it; otherwise, like this, 'tis hard to say for certain.