Switching from a 53-bit to 64-bit FPU in Fortran

Switching from a 53-bit to 64-bit FPU in Fortran - fortran

I am porting Fortran code from Fortran PowerStation(version 4.0) to the Fortran 11(2003) compiler. The old compiler (PowerStation) has 53-bit precision. When porting to the new compiler, I am not achieving proper or exact values for my real/float variables. I hope the new compiler is 64-bit precision. So I think I need to change the FPU (floating point unit) from 53-bit to 64-bit precision. Is this correct? If so, how do I go about changing 53-bit to 64-bit precision using the properties of the new compiler? If not, what should I be doing?
Thanks in advance.

The portable way to request floating point precision in Fortran 90/95/2003 is with the selected_real_kind intrinsic function. For example,
integer, parameter :: DoubleReal_K = selected_real_kind (14) will define a integer DoubleReal_K that specifies a floating point variable with at least 14 decimal digits:
real (DoubleReal_K) :: MyFloat.
Requesting 14 decimal digits will typically produce a double-precision float with 53 bits -- but the only guarantee is 14 decimal digits.
If you need more precision, use a larger value than 14 to specify a longer type -- 17 decimal digits might get extended precision (64 bits), or it might get quadrupole precision, or nothing, depending on the compiler. If the compiler has a larger type available, it will provide it... Otherwise, get a better compiler. Why are you using such an old and unsupported compiler? Also, don't expect exact results from floating point calculations -- it is normal for changes to cause small changes in the results.

You hope the new compiler is 64-bit precision ? I sort of expect that you read the manual and figure that out yourself but if you can't do that, tell us which compiler you are using and someone might help.
How different are the results of the old code and the code compiled with the new compiler ? Of course the results won't be exactly the same if the precision has changed -- how could they be unless you take special steps to ensure sameness.;

Related

acos(double) gives different result on x64 and x32 Visual Studio

acos(double) gives different result on x64 and x32 Visual Studio.
printf("%.30g\n", double(acosl(0.49990774364240564)));
printf("%.30g\n", acos(0.49990774364240564));
on x64: 1.0473040763868076
on x32: 1.0473040763868078
on linux4.4 x32 and x64 with sse enabled: 1.0473040763868078
is there a way to make VSx64 acos() give me 1.0473040763868078 as result?

TL:DR: this is normal and you can't reasonably change it.
The 32-bit library may be using 80-bit FP values in x87 registers for its temporaries, avoiding rounding off to 64-bit double after every operation. (Unless there's a whole separate library, compiling your own code to use SSE doesn't change what's inside the library, or even the calling convention for passing data to the library. But since 32-bit passes double and float in memory on the stack, a library is free to load it with SSE2 or with x87. Still, you don't get the performance advantage of passing FP values in xmm registers unless it's impossible for non-SSE code to use the library.)
It's also possible that they're different simply because they use a different order of operations, producing different temporaries along the way. That's less plausible, unless they're separately hand-written in asm. If they're built from the same C source (without "unsafe" FP optimizations), then the compiler isn't allowed to reorder things, because of this non-associative behaviour of FP math.
glibc's libm (used on Linux) typically favours precision over speed, so its giving you the correctly-rounded result out to the last bit of the mantissa for both 32 and 64-bit. The IEEE FP standard only requires the basic operations (+ - * / FMA and FP remainder) to be "correctly rounded" out to the last bit of the mantissa. (i.e. rounding error of at most 0.5 ulp). (The exact result, according to calc, is 1.047304076386807714.... Keep in mind that double (on x86 with normal compilers) is IEEE754 binary64, so internally the mantissa and exponent are in base2. If you print enough extra decimal digits, though, you can tell that ...7714 should round up to ...78, although really you should print more digits in case they're not zero beyond that. I'm just assuming it's ...78000.)
So Microsoft's 64-bit library implementation produces 1.0473040763868076 and there's pretty much nothing you can do about it, other than not use it. (e.g. find your own acos() implementation and use it.) But FP determinism is hard, even if you limit yourself to just x86 with SSE. See Does any floating point-intensive code produce bit-exact results in any x86-based architecture?. If you limit yourself to a single compiler, it can be possible if you avoid complicated library functions like acos().
You might be able to get the 32-bit library version to produce the same value as the 64-bit version, if it uses x87 and changing the x87 precision setting affects it. But the other way around is not possible: SSE2 has separate instructions for 64-bit double and 32-bit float, and always rounds after every instruction, so you can't change any setting that will increase precision result. (You could change the SSE rounding mode, and that will change the result, but not in a good way!)
See also:
Intermediate Floating-Point Precision and the rest of Bruce Dawson's excellent series of articles about floating point. (table of contents.
The linked article describes how some versions of VC++'s CRT runtime startup set the x87 FP register precision to 53-bit mantissa instead of 80-bit full precision. Also that D3D9 will set it to 24, so even double only has the precision of float if done with x87.
https://en.wikipedia.org/wiki/Rounding#Table-maker.27s_dilemma
What Every Computer Scientist Should Know About Floating-Point Arithmetic

You may have reached the precision limit. Double precision is approximately 16 digits. After, there is no guarantee the digits are valid. If so, you cannot change this behavior, except changing the type double to something else, supporting higher precision.
E.g. long double if your machine and compiler supports the extended 80 bit double or 128 bit Quadruple (also machine depended see here for example).

I disagree that there isn't much you can do about it.
For example, you could try changing the floating point model compiler options.
Here are my results with different floating point models (note /fp:precise is the default):
/fp:precise 1.04730407638680755866289473488
/fp:strict 1.04730407638680755866289473488
/fp:fast 1.04730407638680778070749965991
So it seems you are looking for /fp:fast. Whether that gives the most accurate result remains to be seen though.

C++ Passing Double in Qt adds digits

I am coding in Qt C++.
I am declaring a literal double value in my main rountine of 0.1. I then pass this variable to a function that takes type double. When I hover the mouse over the variable in debug mode I see 0.100000000000001
What is the cause of this change and how do I stop it please?
Method Definition:
void MyClass::MyMethod(double fNewCellSize)
{
// variable fNewCellSize appears as 0.10000000000001
}
Method Call with literal value of 0.1:
MyObj.MyMethod(0.1);
My environment is Windows 64 bit OS using Qt 5.2.1 and compiling using Microsoft 2010 Visual Studio compiler in 32 bit.

Many decimal numbers are not exactly representable in IEEE floating point (i.e. the binary representation used for double) and 0.1 falls in this camp. When you type 0.1 the C++ compiler is required to convert that into a double to be represented on your hardware. It does so by computing a near approximation of this and so you see a bit of error.
If you try something like a power of two: 0.5, 0.25 these will be exactly represented.
See this link for a much more in-depth description of the idea.

This is normal behaviour in any computer. It's caused by the fact that decimal numbers are being stored in fixed-sized binary memory. The extra digits come from the inherent errors caused by converting from binary to decimal.
You will not get rid of them. The best you can do is choose a precision (either float or double) that is big enough so that the extra error digits will not make any difference to your solution, and then chop them off when you display the number.

Warning for inexact floating-point constants

Questions like "Why isn't 0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1 = 0.8?" got me thinking that...
... It would probably be nice to have the compiler warn about the floating-point constants that it rounds to the nearest representable in the binary floating-point type (e.g. 0.1 and 0.8 are rounded in radix-2 floating-point, otherwise they'd need an infinite amount of space to store the infinite number of digits).
I've looked up gcc warnings and so far found none for this purpose (-Wall, -Wextra, -Wfloat-equal, -Wconversion, -Wcoercion (unsupported or C only?), -Wtraditional (C only) don't appear to be doing what I want).
I haven't found such a warning in Microsoft Visual C++ compiler either.
Am I missing a hidden or rarely-used option?
Is there any compiler at all that has this kind of warning?
EDIT: This warning could be useful for educational purposes and serve as a reminder to those new to floating-point.

There is no technical reason the compiler could not issue such warnings. However, they would be useful only for students (who ought to be taught how floating-point arithmetic works before they start doing any serious work with it) and people who do very fine work with floating-point. Unfortunately, most floating-point work is rough; people throw numbers at the computer without much regard for how the computer works, and they accept whatever results they get.
The warning would have to be off by default to support the bulk of existing floating-point code. Were it available, I would turn it on for my code in the Mac OS X math library. Certainly there are points in the library where we depend on every bit of the floating-point value, such as places where we use extended-precision arithmetic, and values are represented across more than one floating-point object (e.g., we would have one object with the high bits of 1/π, another object with 1/π minus the first object, and a third object with 1/π minus the first two objects, giving us about 150 bits of 1/π). Some such values are represented in hexadecimal floating-point in the source text, to avoid any issues with compiler conversion of decimal numerals, and we could readily convert any remaining numerals to avoid the new compiler warning.
However, I doubt we could convince the compiler developers that enough people would use this warning or that it would catch enough bugs to make it worth their time. Consider the case of libm. Suppose we generally wrote exact numerals for all constants but, on one occasion, wrote some other numeral. Would this warning catch a bug? Well, what bug is there? Most likely, the numeral is converted to exactly the value we wanted anyway. When writing code with this warning turned on, we are likely thinking about how the floating-point calculations will be performed, and the value we have written is one that is suitable for our purpose. E.g., it may be a coefficient of some minimax polynomial we calculated, and the coefficient is as good as it is going to get, whether represented approximately in decimal or converted to some exactly-representable hexadecimal floating-point numeral.
So, this warning will rarely catch bugs. Perhaps it would catch an occasion where we mistyped a numeral, accidentally inserting an extra digit into a hexadecimal floating-point numeral, causing it to extend beyond the representable significand. But that is rare. In most cases, the numerals we use are either simple and short or are copied and pasted from software that has calculated them. On some occasions, we will hand-type special values, such as 0x1.fffffffffffffp0. A warning when an extra “f” slips into that numeral might catch a bug during compilation, but that error would almost certainly be caught quickly in testing, since it drastically alters the special value.
So, such a compiler warning has little utility: Very few people will use it, and it will catch very few bugs for the people who do use it.

The warning is in the source: when you write float, double, or long double including any of their respective literals. Obviously, some literals are exact but even this doesn't help much: the sum of two exact values may inexact, e.g., if the have rather different scales. Having the compiler warn about inexact floating point constants would generate a false sense of security. Also, what are you meant to do about rounded constants? Writing the exact closest value explicitly would be error prone and obfuscate the intent. Writing them differently, e.g., writing 1.0 / 10.0 instead of 0.1 also obfuscates the intent and could yield different values.

There will be no such compiler switch and the reason is obvious.
We are writing down the binary components in decimal:
First fractional bit is 0.5
Second fractional bit is 0.25
Third fractional bit is 0.125
....
Do you see it ? Due to the odd endings with the number 5 every bit needs
another decimal to represent it exactly. One bit needs one decimal, two bits
needs two decimals and so on.
So for fractional floating points it would mean that for most decimal numbers
you need 24(!) decimal digits for single precision floats and
53(!!) decimal digits for double precision.
Worse, the exact digits carry no extra information, they are pure artifacts
caused by the base change.
Noone is going to write down 3.141592653589793115997963468544185161590576171875
for pi to avoid a compiler warning.

I don't see how a compiler would know or that the compiler can warn you about something like that. It is only a coincidence that a number can be exactly represented by something that is inherently inaccurate.

precision differences in matlab and c++

I am trying to make equivalence tests on an algorithm written in C++ and in Matlab.
The algorithm contains some kind of a loop in time and runs more than 1000 times. It has arithmetic operations and some math functions.
I feed the initial inputs to both platforms by hand (like a=1.767, b=6.65, ...) and when i check the hexadecimal representations of those inputs they are the same. So no problem for inputs. And get the outputs of c++ to matlab by a text file with 16 decimal digits. (i use "setprecision(32)" statement)
But here comes the problem; although after the 614'th step of both code, all the results are exactly the same, on the step of 615 I get a difference about 2.xxx..xxe-19? And after this step the error becomes larger and larger, and at the end of the runs it is about 5.xx..xxe-14.
0x3ff1 3e42 a211 6cca--->[C++ function]--->0x3ff4 7619 7005 5a42
0x3ff1 3e42 a211 6cca--->[MATLAB function]--->ans
ans - 0x3ff4 7619 7005 5a42
= 2.xxx..xxe-19
I searched how matlab behaves the numbers and found really interesting things like "denormalized mantissa". While realmin is about e-308, by denormalizing the mantissa matlab has the smallest real number about e-324. Further matlab holds many more digits for "pi" or "exp(1)" than that of c++.
On the other hand, matlab help says that whatever the format it displays, matlab uses the double precision internally.
So,I'd really appreciate if someone explains what the exact reason is for these differences? How can we make equivalence tests on matlab and c++?

There is one thing in x86 CPU about floating points numbers. Internally, the floating point unit uses registers that are 10 bytes, i.e. 80 bits. Furthermore, the CPU has a setting that tells whether the floating point calculations should be made with 32 bits (float), 64 bits (double) or 80 bits precision. Less precision meaning faster executed floating point operations. (The 32 bits mode used to be popular for video games, where speed takes over precision).
From this I remember I tracked a bug in a calculation library (dll) that given the same input did not gave the same result whether it was started from a test C++ executable, or from MatLab.. Furthermore, this did not happen in Debug mode, only in Release!
The final conclusion was that MatLab did set the CPU floating point precision to 80 bits, whereas our test executable did not (and leave the default 64 bits precision). Furthermore, this calculation mismatch did not happen Debug mode because all the variables were written to memory into 64 bits double variables, and reloaded from there afterward, nullifying the additional 16 bits. In Release mode, some variables were optimized out (not written to memory), and all calculations were done with floating point registers only, on 80 bits, keeping the additional 16 bits non-zero value.
Don't know if this helps, but maybe worth knowing.

A similar discussion occurred before, the conclusion was that IEEE 754 tolerates error in the last bit for transcendental functions (cos, sin, exp, etc..). So you can't expect exactly same results between MATLAB and C (not even same C code compiled in different compilers).

I may be way off track here and you may already have investigated this possibility but it could be possible that there are differences between C++ and Matlab in the way that the mathematical library functions (sin() cos() and exp() that you mention) are implemented internally. Ultimately, some kind of functional approximation must be being used to generate function values and if there is some difference between these methods then presumably it is possible that this manifests itself in the form of numerical rounding error over a large number of iterations.
This question basically covers what I am trying to suggest How does C compute sin() and other math functions?

What does "real*8" mean?

The manual of a program written in Fortran 90 says, "All real variables and parameters are specified in 64-bit precision (i.e. real*8)."
According to Wikipedia, single precision corresponds to 32-bit precision, whereas double precision corresponds to 64-bit precision, so apparently the program uses double precision.
But what does real*8 mean?
I thought that the 8 meant that 8 digits follow the decimal point. However, Wikipedia seems to say that single precision typically provides 6-9 digits whereas double precision typically provides 15-17 digits. Does this mean that the statement "64-bit precision" is inconsistent with real*8?

The 8 refers to the number of bytes that the data type uses.
So a 32-bit integer is integer*4 along the same lines.
A quick search found this guide to Fortran data types, which includes:
The "real*4" statement specifies the variable names to be single precision 4-byte real numbers which has 7 digits of accuracy and a magnitude range of 10 from -38 to +38. The "real" statement is the same as "real*4" statement in nearly all 32-bit computers.
and
The "real*8" statement specifies the variable names to be double precision 8-byte real numbers which has 15 digits of accuracy and a magnitude range of 10 from -308 to +308. The "double precision" statement is the same as "real*8" statement in nearly all 32-bit computers.

There are now at least 4 ways to specify precision in Fortran.
As already answered, real*8 specifies the number of bytes. It is somewhat obsolete but should be safe.
The new way is with "kinds". One should use the intrinsic functions to obtain the kind that has the precision that you need. Specifying the kind by specific numeric value is risky because different compilers use different values.
Yet another way is to use the named types of the ISO_C_Binding. This question discusses the kind system for integers -- it is very similar for reals.

The star notation (as TYPE*n is called) is a non-standard Fortran construct if used with TYPE other than CHARACTER.
If applied to character type, it creates an array of n characters (or a string of n characters).
If applied to another type, it specifies the storage size in bytes. This should be avoided at any cost in Fortran 90+, where the concept of type KIND is introduced. Specifying storage size creates non-portable applications.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js