Questions about intel fortran compiler options - fortran

I am currently running a fortran code both on serial (single core)/parallel (48 cores),and there are values such as "infinity" or "NaN" in the output (which shouldn't have) without any other information. I would like to use the compiler options to help me locate the source of the infinity/NaN. I tried the combination of "-O0 -g -traceback -fpe3", then during run-time, the infinity appears earlier in such case than the normal case (without debug options). But still, there is no information about which line in the source code causes such behavior. So, I was wondering, are there any available compiler options that can help me locate the source of infinity/NaN? Or am I using the right combination of the flags?
Thanks in advance! =)

The fpe option was the right idea! But you used the wrong number: according to the Intel Fortran Compiler documentation, when the integer after fpe is 3:
All floating-point exceptions are disabled. Floating-point underflow
is gradual, unless you explicitly specify a compiler option that
enables flush-to-zero, such as -ftz or /Qftz, O3, or O2 on IA-32 and
Intel EM64T systems. This setting provides full IEEE support.
You need to use -fpe0, which results in:
Floating-point invalid, divide-by-zero, and overflow exceptions are
enabled. If any such exceptions occur, execution is aborted. This
option sets the -ftz (Linux and Mac OS) or /Qftz (Windows) option;
therefore underflow results will be set to zero unless you explicitly
specify -no-ftz (Linux and Mac OS) or /Qftz- (Windows). On
Itanium®-based systems, underflow behavior is equivalent to specifying
option -ftz or /Qftz.

Related

Cross Platform Floating Point Consistency

I'm developing a cross-platform game which plays over a network using a lockstep model. As a brief overview, this means that only inputs are communicated, and all game logic is simulated on each client's computer. Therefore, consistency and determinism is very important.
I'm compiling the Windows version on MinGW32, which uses GCC 4.8.1, and on Linux I'm compiling using GCC 4.8.2.
What struck me recently was that, when my Linux version connected to my Windows version, the program would diverge, or de-sync, instantly, even though the same code was compiled on both machines! Turns out the problem was that the Linux build was being compiled via 64 bit, whereas the Windows version was 32 bit.
After compiling a Linux 32 bit version, I was thankfully relieved that the problem was resolved. However, it got me thinking and researching on floating point determinism.
This is what I've gathered:
A program will be generally consistent if it's:
ran on the same architecture
compiled using the same compiler
So if I assume, targeting a PC market, that everyone has a x86 processor, then that solves requirement one. However, the second requirement seems a little silly.
MinGW, GCC, and Clang (Windows, Linux, Mac, respectively) are all different compilers based/compatible with/on GCC. Does this mean it's impossible to achieve cross-platform determinism? or is it only applicable to Visual C++ vs GCC?
As well, do the optimization flags -O1 or -O2 affect this determinism? Would it be safer to leave them off?
In the end, I have three questions to ask:
1) Is cross-platform determinism possible when using MinGW, GCC, and Clang for compilers?
2) What flags should be set across these compilers to ensure the most consistency between operating systems / CPUs?
3) Floating point accuracy isn't that important for me -- what's important is that they are consistent. Is there any method to reducing floating point numbers to a lower precision (like 3-4 decimal places) to ensure that the little rounding errors across systems become non-existent? (Every implementation I've tried to write so far has failed)
Edit: I've done some cross-platform experiments.
Using floatation points for velocity and position, I kept a Linux Intel Laptop and a Windows AMD Desktop computer in sync for up to 15 decimal places of the float values. Both systems are, however, x86_64. The test was simple though -- it was just moving entities around over a network, trying to determine any visible error.
Would it make sense to assume that the same results would hold if a x86 computer were to connect to a x86_64 computer? (32 bit vs 64 bit Operating System)
Cross-platform and cross-compiler consistency is of course possible. Anything is possible given enough knowledge and time! But it might be very hard, or very time-consuming, or indeed impractical.
Here are the problems I can foresee, in no particular order:
Remember that even an extremely small error of plus-or-minus 1/10^15 can blow up to become significant (you multiply that number with that error margin with one billion, and now you have a plus-or-minus 0.000001 error which might be significant.) These errors can accumulate over time, over many frames, until you have a desynchronized simulation. Or they can manifest when you compare values (even naively using "epsilons" in floating-point comparisons might not help; only displace or delay the manifestation.)
The above problem is not unique to distributed deterministic simulations (like yours.) The touch on the issue of "numerical stability", which is a difficult and often neglected subject.
Different compiler optimization switches, and different floating-point behavior determination switches might lead to the compiler generate slightly different sequences of CPU instructions for the same statements. Obviously these must be the same across compilations, using the same exact compilers, or the generated code must be rigorously compared and verified.
32-bit and 64-bit programs (note: I'm saying programs and not CPUs) will probably exhibit slightly different floating-point behaviors. By default, 32-bit programs cannot rely on anything more advanced than x87 instruction set from the CPU (no SSE, SSE2, AVX, etc.) unless you specify this on the compiler command line (or use the intrinsics/inline assembly instructions in your code.) On the other hand, a 64-bit program is guaranteed to run on a CPU with SSE2 support, so the compiler will use those instructions by default (again, unless overridden by the user.) While x87 and SSE2 float datatypes and operations on them are similar, they are - AFAIK - not identical. Which will lead to inconsistencies in the simulation if one program uses one instruction set and another program uses another.
The x87 instruction set includes a "control word" register, which contain flags that control some aspects of floating-point operations (e.g. exact rounding behavior, etc.) This is a runtime thing, and your program can do one set of calculations, then change this register, and after that do the exact same calculations and get a different result. Obviously, this register must be checked and handled and kept identical on the different machines. It is possible for the compiler (or the libraries you use in your program) to generate code that changes these flags at runtime inconsistently across the programs.
Again, in case of the x87 instruction set, Intel and AMD have historically implemented things a little differently. For example, one vendor's CPU might internally do some calculations using more bits (and therefore arrive at a more accurate result) that the other, which means that if you happen to run on two different CPUs (both x86) from two different vendors, the results of simple calculations might not be the same. I don't know how and under what circumstances these higher accuracy calculations are enabled and whether they happen under normal operating conditions or you have to ask for them specifically, but I do know these discrepancies exist.
Random numbers and generating them consistently and deterministically across programs has nothing to do with floating-point consistency. It's important and source of many bugs, but in the end it's just a few more bits of state that you have to keep synched.
And here are a couple of techniques that might help:
Some projects use "fixed-point" numbers and fixed-point arithmetic to avoid rounding errors and general unpredictability of floating-point numbers. Read the Wikipedia article for more information and external links.
In one of my own projects, during development, I used to hash all the relevant state (including a lot of floating-point numbers) in all the instances of the game and send the hash across the network each frame to make sure even one bit of that state wasn't different on different machines. This also helped with debugging, where instead of trusting my eyes to see when and where inconsistencies existed (which wouldn't tell me where they originated, anyways) I would know the instant some part of the state of the game on one machine started diverging from the others, and know exactly what it was (if the hash check failed, I would stop the simulation and start comparing the whole state.)
This feature was implemented in that codebase from the beginning, and was used only during the development process to help with debugging (because it had performance and memory costs.)
Update (in answer to first comment below): As I said in point 1, and others have said in other answers, that doesn't guarantee anything. If you do that, you might decrease the probability and frequency of an inconsistency occurring, but the likelihood doesn't become zero. If you don't analyze what's happening in your code and the possible sources of problems carefully and systematically, it is still possible to run into errors no matter how much you "round off" your numbers.
For example, if you have two numbers (e.g. as results of two calculations that were supposed to produce identical results) that are 1.111499999 and 1.111500001 and you round them to three decimal places, they become 1.111 and 1.112 respectively. The original numbers' difference was only 2E-9, but it has now become 1E-3. In fact, you have increased your error 500'000 times. And still they are not equal even with the rounding. You've exacerbated the problem.
True, this doesn't happen much, and the examples I gave are two unlucky numbers to get in this situation, but it is still possible to find yourself with these kinds of numbers. And when you do, you're in trouble. The only sure-fire solution, even if you use fixed-point arithmetic or whatever, is to do rigorous and systematic mathematical analysis of all your possible problem areas and prove that they will remain consistent across programs.
Short of that, for us mere mortals, you need to have a water-tight way to monitor the situation and find exactly when and how the slightest discrepancies occur, to be able to solve the problem after the fact (instead of relying on your eyes to see problems in game animation or object movement or physical behavior.)
No, not in practice. For example, sin() might come from a library or from a compiler intrinsic, and differ in rounding. Sure, that's only one bit, but that's already out of sync. And that one bit error may add up over time, so even an imprecise comparison may not be sufficient.
N/A
You can't reduce FP precision for a given type, and I don't even see how it would help you. You'd turn the occasional 1E-6 difference into an occasional 1E-4 difference.
Next to your concerns on determinism, I have another remark: if you are worried about calculation consistency on a distributed system, you may have a design issue.
You could think about your application as a bunch of nodes, each responsible for their own calculations. If information about another node is needed, it should sent to you by that node.
1.)
In principle cross platform, OS, hardware compatibility is possible but in practice it's a pain.
In general your results will depend on which OS you use, which compiler, and which hardware you use. Change any one of those and your results might change. You have to test all changes.
I use Qt Creator and qmake (cmake is probably better but qmake works for me) and test my code in MSVC on Windows, GCC on Linux, and MinGW-w64 on Windows. I test both 32-bit and 64-bit. This has to be done whenever code changes.
2.) and 3.)
In terms of floating point some compilers will use x87 instead of SSE in 32-bit mode. See this as an example of the consequences of when that happens Why a number crunching program starts running much slower when diverges into NaNs? All 64-bit systems have SSE so I think most use SSE/AVX in 64-bit otherwise, e.g. in 32 bit mode, you might need to force SSE with something like -mfpmath=sse and -msse2.
But if you want a more compatible version of GCC on windows then I would used MingGW-w64 for 32-bit (aka MinGW-w32) or MinGW-w64 in 64bit . This is not the same thing as MinGW (aka mingw32). The projects have diverged. MinGW depends on MSVCRT (the MSVC C runtime library) and MinGW-w64 does not. The Qt project has a pretty good description of MinGW-w64 and installiation. http://qt-project.org/wiki/MinGW-64-bit
You might also want to consider writing a CPU dispatcher cpu dispatcher for visual studio for AVX and SSE.

Why is bounds checking changing the behavior of my program?

I have a thermal hydraulics code written in Fortran that I work on. For my debug version, I use the -check bounds option in ifort 11.1 during compile time. I have caught array bounds errors in the past in this way. Recently, though, I was seeing that the solution was quickly blowing up for a given case. The peculiar thing was that it was converging nicely for the release version of the code. Sure enough, removing the -check bounds flag from my debug makefile cleared up the problem.
The strange thing is that the debug version was working fine for many other test cases I used before and it wasn't throwing up any errors on going outside of any array bounds in my code. This behavior seems very strange to me and I have no idea if there is some kind of bug in my code or what. Anybody have any ideas what could be causing this sort of behavior?
As requested, the flags I use for release and debug are:
Release: -c -r8 -traceback -extend-source -override-limits -zero -unroll -O3
Debug: -c -r8 -traceback -extend-source -override-limits -zero -g -O0
Of course, as my original question indicates, I toggle the -check bounds flag on and off for the debug case.
I would suspect your numerical algorithm here more than the Fortran code. Have you ensured that all of convergence and stability criteria have been met?
What it sounds like is that round-off error is causing the solution to fail to converge. If you are on the edges of safe convergence, compiler optimizations can definitely tip things one way or another.
I use gfortran more than ifort, so I don't know all the specifics of the -unroll option, but unrolling loops can change some rounding even though the calculations seem like they should remain the same. Also, debug will definitely change the exact order of memory and register access. If the number is in the processor in some internal representation, then is written to memory and read back again, the value can change. This can be alleviated to some extent by careful selection of kind. By it's nature, this will be processor specific rather than portable.
In theory, full compliance with IEEE 754 would make floating point operations reproducible, but this is not always the case. If debug is actually causing these problems as opposed to some other bug in your code, then other mysterious things related to the inner workings of the processor could also cause it to blow up.
I would add write statements at various key points in the code to output your data matrices (or whatever data structures you are using). Be sure to use binary output. Open with form='unformatted' and access='direct'.

How do I ensure lrint is inlined in gcc?

After reading around the subject, there is overwhelming evidence from numerous sources that using standard C or C++ casts to convert from floating point to integer numbers on Intel is very slow. In order to meeting the ANSI/ISO specification, Intel CPUs need to execute a large number of instructions including those needed to switch the rounding mode of the FPU hardware.
There are a number of workarounds described in various documents, but the cleanest and most portable seems to be the lrint() call added to C99 and C++ 0x standards. Many documents say that a compiler should inline expand these functions when optimization is enabled, leading to code which is faster than a conventional cast, or a function call.
I even found references to gcc feature tracking bags to add this inline expansion to the gcc optimizer, but in my own performance tests I have been unable to get it to work. All my attempts show lrint performance to be much slower than a simple C or C++ style cast. Examining the assembly output of the compiler, and disassembling the compiled objects always shows an explicit call to an external lrint() or lrintf() function.
The gcc versions I have been working with are 4.4.3 and 4.6.1, and I have tried a number of flag combinations on 32bit and 64bit x86 targets, including options to explicitly enable SSE.
How do I get gcc to inline expand lrint, and give me fast conversions?
The lrint() function may raise domain and range errors. One possible way the libc deals with such errors is setting errno (see C99/C11 section 7.12.1). The overhead of the error checking can be quite significant and in this particular case seems to be enough for the optimizer to decide against inlining.
The gcc flag -fno-math-errno (which is part of -ffast-math) will disable these checks. It might be a good idea to look into -ffast-math if you do not rely on standards-compliant handling of floating-point semantics, in particular NaNs and infinities...
Have you tried the -finline-functions flag to gcc.
You can also direct GCC to try to integrate all “simple enough” functions into their callers with the option -finline-functions.
see http://gcc.gnu.org/onlinedocs/gcc/Inline.html
Here you can say gcc to make all function to inline but not all will be inlined.
The compiler uses some heuristics to determine whether the function is small enough to be inlined. One more thing is that a recursive function are also not going to be inline here.

Is there a (Linux) g++ equivalent to the /fp:precise and /fp:fast flags used in Visual Studio?

Background:
Many years ago, I inherited a codebase that was using the Visual Studio (VC++) flag '/fp:fast' to produce faster code in a particular calculation-heavy library. Unfortunately, '/fp:fast' produced results that were slightly different to the same library under a different compiler (Borland C++). As we needed to produce exactly the same results, I switched to '/fp:precise', which worked fine, and everything has been peachy ever since. However, now I'm compiling the same library with g++ on uBuntu Linux 10.04 and I'm seeing similar behavior, and I wonder if it might have a similar root cause. The numerical results from my g++ build are slightly different from the numerical results from my VC++ build. This brings me to my question:
Question:
Does g++ have equivalent or similar parameters to the 'fp:fast' and 'fp:precise' options in VC++? (and what are they? I want to activate the 'fp:precise' equivalent.)
More Verbose Information:
I compile using 'make', which calls g++. So far as I can tell (the make files are a little cryptic, and weren't written by me) the only parameters added to the g++ call are the "normal" ones (include folders and the files to compile) and -fPIC (I'm not sure what this switch does, I don't see it on the 'man' page).
The only relevant parameters in 'man g++' seem to be for turning optimization options ON. (e.g. -funsafe-math-optimizations). However, I don't think I'm turning anything ON, I just want to turn the relevant optimization OFF.
I've tried Release and Debug builds, VC++ gives the same results for release and debug, and g++ gives the same results for release and debug, but I can't get the g++ version to give the same results as the VC++ version.
From the GCC manual:
-ffloat-store
Do not store floating point variables in registers, and inhibit other options that might change whether a floating point value is taken from a register or memory.
This option prevents undesirable excess precision on machines such as the 68000 where the floating registers (of the 68881) keep more precision than a double is supposed to have. Similarly for the x86 architecture. For most programs, the excess precision does only good, but a few programs rely on the precise definition of IEEE floating point. Use -ffloat-store for such programs, after modifying them to store all pertinent intermediate computations into variables.
To expand a bit, most of these discrepancies come from the use of the x86 80-bit floating point registers for calculations (vs. the 64-bits used to store double values). If intermediate results are kept in the registers without writing back to memory, you effectively get 16 bits of extra precision in your calculations, making them more precise but possibly divergent from results generated with write/read of intermediate values to memory (or from calculations on architectures that only have 64-bit FP registers).
These flags (both in GCC and MSVC) generally force truncation of each intermediate result to 64-bits, thereby making calculations insensitive to the vagaries of code generation and optimization and platform differences. This consistency generally comes with a slight runtime cost in addition to the cost in terms of accuracy/precision.
Excess register precision is an issue only on FPU registers, which compilers (with the right enabling switches) tend to avoid anyway. When floating point computations are carried out in SSE registers, the register precision equals the memory one.
In my experience most of the /fp:fast impact (and potential discrepancy) comes from the compiler taking the liberty to perform algebraic transforms. This can be as simple as changing summands order:
( a + b ) + c --> a + ( b + c)
can be - distributing multiplications like a*(b+c) at will, and can get to some rather complex transforms - all intended to reuse previous calculations.
In infinite precision such transforms are benign, of course - but in finite precision they actually change the result. As a toy example, try the summand-order-example with a=b=2^(-23), c = 1. MS's Eric Fleegal describes it in much more detail.
In this respect, the gcc switch nearest to /fp:precise is -fno-unsafe-math-optimizations. I think it's on by default - perhaps you can try setting it explicitly and see if it makes a difference. Similarly, you can try explicitly turning off all -ffast-math optimizations: -fno-finite-math-only, -fmath-errno, -ftrapping-math, -frounding-math and -fsignaling-nans (the last 2 options are non default!)
I don't think there's an exact equivalent. You might try -mfpmath=sse instead of the default -mfpmath=387 to see if that helps.
This is definitely not related to optimization flags, assuming by "Debug" you mean "with optimizations off." If g++ gives the same results in debug as in release, that means it's not an optimization-related issue.
Debug builds should always store each intermediate result in memory, thereby guaranteeing the same results as /fp:precise does for MSVC.
This likely means there is (a) a compiler bug in one of the compilers, or more likely (b) a math library bug. I would drill into individual functions in your calculation and narrow down where the discrepancy lies. You'll likely find a workaround at that point, and if you do find a bug, I'm sure the relevant team would love to hear about it.
-mpc32 or -mpc64?
But you may need to recompile C and math libraries with the switch to see the difference... This may apply to options others suggested as well.

Getting hardware floating point with android NDK

I've begun playing with the android NDK. One of the things I've just learnt is about creating an application.mk file to specify the armv7 abi.
I'm building the san-angeles example with the following parameters.
APP_MODULES := sanangeles
APP_PROJECT_PATH := $(call my-dir)/../
APP_OPTIM := release
APP_ABI := armeabi-v7a
However this seems to run at exactly the same speed as it did before (ie badly). Am I just GL limited and not CPU limited or is something wrong here?
I have noticed when I compile that I get the following command line options emitted:
-march=armv7-a -mfloat-abi=softfp -mfpu=vfp -mthumb
The thing that worries me there is the "softfp". There IS mention of the v7 abi, the VFP fpu stuff and I'm guessing the "thumb" refers to the "thumb-2" instructions (Though I don't know what exactly these are). However that "softfp" does concern me. Shouldn't it be "hardfp"?
Anyone got any ideas on these questions? I think I'm probably about ready to start implementing some GL ES 2.0 code for my HTC Desire but I'd like to make sure I'm getting the best possible speed out of it :)
Cheers in advance!
The options you supply to the NDK will only affect the way your code is compiled. It won't change the GL libs or anything else that's part of the platform, which are always generated in an appropriate fashion. If you're just throwing geometry at the GL hardware, you're not going to see a difference.
If you want to see if your options are having an effect, download (or create) a simple benchmark that does a bunch of operations with double-precision floating point values, and time how long it takes to execute before and after.
The -mfloat-abi=softp argument determines how floating-point values are passed between functions. softfp means they're always passed in integer registers or on the stack. If Android didn't specify softfp, the ARMv7-A version of the library would expect floats to show up in hardware registers, and any code built for ARMv5TE would break.
"softfp" adds a little overhead to some functions, but the instructions for moving values in and out of fp registers are cheap on ARM, and the ABI compatibility provided makes it worthwhile.
The "-mthumb" enables generation of Thumb/Thumb2 code. Thumb code tends to be a bit slower but a bit smaller than equivalent ARM; sometimes smaller means you'll fit better in the CPU i-cache and will actually run faster. Size is always a concern on these devices, so Thumb is enabled by default.
When in doubt, "arm-eabi-objdump -d whatever.o" will show you a disassembly of your code.
Update: NDK r9b added support for -mhard-float. This allows you to build NDK libraries with hard-float API conventions for armeabi-v7a targets.