Are the "reals" in Fortran the same as "floats" in C++? - c++

I have translated some code from Fortran to C++ and both codes give me the same result for a given input with the exception of two data points in the middle of my data set.
My code calculates the distance between points and does some interesting things with that information. Two points in the C++ code are found to be at one distance from each other and at a different distance in Fortran. The code is long, so I won't post it.
This strikes me as weird because the two "strange points" are right in the middle of my code, whereas all of the other 106 points behave the same.
I have already read the Goldberg paper, and it makes me believe that real and float ought to be the same on my 32-bit system.

A real in Fortran may be float (which is kind 4) or double (kind 8) in C++.
It also may depend on your compiler options (i.e. math extensions, optimization, square root implementation, etc).

In most C/C++ implementations you'll encounter, float corresponds to REAL*4, and double corresponds to REAL*8.
This StackOverflow answer (and related comments) describe Fortran 90's types: Fortran 90 kind parameter.
Differences in floating point computations may arise due to different evaluation order. Floating point arithmetic is very sensitive to evaluation order, especially where addition and subtraction among values with a wide dynamic range is involved.
Also, C/C++ math and math libraries default to double precision in many surprising places, unless you explicitly ask for a float. For example, the constant 1.0 is a double precision constant. Consider the following code fragment:
float x;
x = 1.0 + 2.0;
x = x + 3.0;
The expression 1.0 + 2.0 is computed at double precision, and the result cast back to single precision. Likewise, the second statement x + 3.0 promotes x to double, does the arithmetic, and then casts back to float.
To get single precision constants and keep your arithmetic at single precision, you need to add the suffix f, as follows:
float x;
x = 1.0f + 2.0f;
x = x + 3.0f;
Now this arithmetic will all be done at single precision.
For math library calls, the single-precision variant also usually has an f suffix, such as cosf or sqrtf.

Related

Is it ok to compare floating points to 0.0 without epsilon?

I am aware, that to compare two floating point values one needs to use some epsilon precision, as they are not exact. However, I wonder if there are edge cases, where I don't need that epsilon.
In particular, I would like to know if it is always safe to do something like this:
double foo(double x){
if (x < 0.0) return 0.0;
else return somethingelse(x); // somethingelse(x) != 0.0
}
int main(){
int x = -3.0;
if (foo(x) == 0.0) {
std::cout << "^- is this comparison ok?" << std::endl;
}
}
I know that there are better ways to write foo (e.g. returning a flag in addition), but I wonder if in general is it ok to assign 0.0 to a floating point variable and later compare it to 0.0.
Or more general, does the following comparison yield true always?
double x = 3.3;
double y = 3.3;
if (x == y) { std::cout << "is an epsilon required here?" << std::endl; }
When I tried it, it seems to work, but it might be that one should not rely on that.
Yes, in this example it is perfectly fine to check for == 0.0. This is not because 0.0 is special in any way, but because you only assign a value and compare it afterwards. You could also set it to 3.3 and compare for == 3.3, this would be fine too. You're storing a bit pattern, and comparing for that exact same bit pattern, as long as the values are not promoted to another type for doing the comparison.
However, calculation results that would mathematically equal zero would not always equal 0.0.
This Q/A has evolved to also include cases where different parts of the program are compiled by different compilers. The question does not mention this, my answer applies only when the same compiler is used for all relevant parts.
C++ 11 Standard,
§5.10 Equality operators
6 If both operands are of arithmetic or enumeration type, the usual
arithmetic conversions are performed on both operands; each of the
operators shall yield true if the specified relationship is true and
false if it is false.
The relationship is not defined further, so we have to use the common meaning of "equal".
§2.13.4 Floating literals
1 [...] If the scaled value is in the range of representable values
for its type, the result is the scaled value if representable, else
the larger or smaller representable value nearest the scaled value,
chosen in an implementation-defined manner. [...]
The compiler has to choose between exactly two values when converting a literal, when the value is not representable. If the same value is chosen for the same literal consistently, you are safe to compare values such as 3.3, because == means "equal".
Yes, if you return 0.0 you can compare it to 0.0; 0 is representable exactly as a floating-point value. If you return 3.3 you have to be a much more careful, since 3.3 is not exactly representable, so a conversion from double to float, for example, will produce a different value.
correction: 0 as a floating point value is not unique, but IEEE 754 defines the comparison 0.0==-0.0 to be true (any zero for that matter).
So with 0.0 this works - for every other number it does not. The literal 3.3 in one compilation unit (e.g. a library) and another (e.g. your application) might differ. The standard only requires the compiler to use the same rounding it would use at runtime - but different compilers / compiler settings might use different rounding.
It will work most of the time (for 0), but is very bad practice.
As long as you are using the same compiler with the same settings (e.g. one compilation unit) it will work because the literal 0.0 or 0.0f will translate to the same bit pattern every time. The representation of zero is not unique though. So if foo is declared in a library and your call to it in some application the same function might fail.
You can rescue this very case by using std::fpclassify to check whether the returned value represents a zero. For every finite (non-zero) value you will have to use an epsilon-comparison though unless you stay within one compilation unit and perform no operations on the values.
As written in both cases you are using identical constants in the same file fed to the same compiler. The string to float conversion the compiler uses should return the same bit pattern so these should not only be equal as in a plus or minus cases for zero thing but equal bit by bit.
Were you to have a constant which uses the operating systems C library to generate the bit pattern then have a string to f or something that can possibly use a different C library if the binary is transported to another computer than the one compiled on. You might have a problem.
Certainly if you compute 3.3 for one of the terms, runtime, and have the other 3.3 computed compile time again you can and will get failures on the equal comparisons. Some constants obviously are more likely to work than others.
Of course as written your 3.3 comparison is dead code and the compiler just removes it if optimizations are enabled.
You didnt specify the floating point format nor standard if any for that format you were interested in. Some formats have the +/- zero problem, some dont for example.
It is a common misconception that floating point values are "not exact". In fact each of them is perfectly exact (except, may be, some special cases as -0.0 or Inf) and equal to s·2e – (p – 1), where s, e, and p are significand, exponent, and precision correspondingly, each of them integer. E.g. in IEEE 754-2008 binary32 format (aka float32) p = 24 and 1 is represented as ‭0x‭800000‬‬·20 – 23. There are two things that are really not exact when you deal with floating point values:
Representation of a real value using a FP one. Obviously, not all real numbers can be represented using a given FP format, so they have to be somehow rounded. There are several rounding modes, but the most commonly used is the "Round to nearest, ties to even". If you always use the same rounding mode, which is almost certainly the case, the same real value is always represented with the same FP one. So you can be sure that if two real values are equal, their FP counterparts are exactly equal too (but not the reverse, obviously).
Operations with FP numbers are (mostly) inexact. So if you have some real-value function φ(ξ) implemented in the computer as a function of a FP argument f(x), and you want to compare its result with some "true" value y, you need to use some ε in comparison, because it is very hard (sometimes even impossible) to white a function giving exactly y. And the value of ε strongly depends on the nature of the FP operations involved, so in each particular case there may be different optimal value.
For more details see D. Goldberg. What Every Computer Scientist Should Know About Floating-Point Arithmetic, and J.-M. Muller et al. Handbook of Floating-Point Arithmetic. Both texts you can find in the Internet.

Should I combine multiplication and division steps when working with floating point values?

I am aware of the precision problems in floats and doubles, which why I am asking this:
If I have a formula such as: (a/PI)*180.0 (where PI is a constant)
Should I combine the division and multiplication, so I can use only one division: a/0.017453292519943295769236, in order to avoid loss of precision ?
Does this make it more precise when it has less steps to calculate the result?
Short answer
Yes, you should in general combine as many multiplications and divisions by constants as possible into one operation. It is (in general(*)) faster and more accurate at the same time.
Neither π nor π/180 nor their inverses are representable exactly as floating-point. For this reason, the computation will involve at least one approximate constant (in addition to the approximation of each of the operations involved).
Because two operations introduce one approximation each, it can be expected to be more accurate to do the whole computation in one operation.
In the case at hand, is division or multiplication better?
Apart from that, it is a question of “luck” whether the relative accuracy to which π/180 can be represented in the floating-point format is better or worse than that of 180/π.
My compiler provides addition precision with the long double type, so I am able to use it as reference for answering this question for double:
~ $ cat t.c
#define PIL 3.141592653589793238462643383279502884197L
#include <stdio.h>
int main() {
long double heop = 180.L / PIL;
long double pohe = PIL / 180.L;
printf("relative acc. of π/180: %Le\n", (pohe - (double) pohe) / pohe);
printf("relative acc. of 180/π: %Le\n", (heop - (double) heop) / heop);
}
~ $ gcc t.c && ./a.out
relative acc. of π/180: 1.688893e-17
relative acc. of 180/π: -3.469703e-17
In usual programming practice, one wouldn't bother and simply multiply by (the floating-point representation of) 180/π, because multiplication is so much faster than division.
As it turns out, in the case of the binary64 floating-point type double almost always maps to, π/180 can be represented with better relative accuracy than 180/π, so π/180 is the constant one should use to optimize accuracy: a / ((double) (π / 180)). With this formula, the total relative error would be approximately the sum of the relative error of the constant (1.688893e-17) and of the relative error of the division (which will depend on the value of a but never be more than 2-53).
Alternative methods for faster and more accurate results
Note that division is so expensive that you could get an even more accurate result faster by using one multiplication and one fma: let heop1 be the best double approximation of 180/π, and heop2 the best double approximation of 180/π - heop1. Then the best value for the result can be computed as:
double r = fma(a, heop1, a * heop2);
The fact that the above is the absolute best possible double approximation to the real computation is a theorem (in fact, it is a theorem with exceptions. The details can be found in the “Handbook of Floating-Point Arithmetic”). But even when the real constant you want to multiply a double by in order to get a double result is one of the exceptions to the theorem, the above computation is still clearly very accurate and only differs from the best double approximation for a few exceptional values of a.
If, like mine, your compiler provides more precision for long double than for double, you can also use one long double multiplication:
// this is more accurate than double division:
double r = (double)((long double) a * 57.295779513082320876798L)
This is not as good as the solution based on fma, but it is good enough that for most values of a, it produces the optimal double approximation to the real computation.
A counter-example to the general claim that operations should be grouped as one
(*) The claim that it is better to group constant is only statistically true for most constants.
If you happened to wish to multiply a by, say, the real constant 0.0000001 * DBL_MIN, you would be better off multiplying first by 0.0000001, then by DBL_MIN, and the end result (which can be a normalized number if a is larger than 1000000 or so) would be more precise than if you had multiplied by the best double representation of 0.0000001 * DBL_MIN. This is because the relative accuracy when representing 0.0000001 * DBL_MIN as a single double value is much worse than the accuracy for representing 0.0000001.

Precision of floating point operations

Floating point operations are typically approximations to the corresponding arithmetic operations, because in many cases the precise arithmetic result cannot be represented by the internal number format. But what happens if I have a program where all numbers can actually be represented exactly by IEEE 754 single precision? Consider for example this:
float a = 7;
float b = a / 32.0;
float c = a + b;
float d = c - a;
assert( d == b );
Is it safe to assume that within this and similar cases, the result of the numerical calculation is identical to the exact arithmetic result? I think this sort of code would work on every actual machine. Is this correct?
Edit This question is not about what the standard says, but rather about the real world. I'm well aware that in theory one could create a conforming engine where this would fail. I don't care, but rather I wonder if this works on existing machines.
No as the c++ standard does not require floating point values to be stored in the IEEE 754 format.
"Real" machines are quite careful in implementing the standard exactly (just remember the Pentium division bug). But be careful to check, the i386 line of machines did use some extra bits in its registers, which were cut off when asigning to/from memory, so computations done only in registers gave different results than if some intermediate results where spilled to memory.
Also check out David Goldberg's What every computer scientist should know about floating point arithmetic.
In any case, it doesn't make any sense to "round" (or otherwise mangle) a number that can be represented exactly.

How come some people append f to the end of variables?

In the tutorial I'm reading for OGRE3d here the programmer is constantly adding f at the end of any variable he initializes, like 200.00f or 0.00f so I decided to erase f and see if it compiles and it compiles just fine, what is the point of adding f at the end of the variable?
EDIT: So you're saying if I initialize a variable with 200.03 it won't initialize it as a floating point but if I were to do so with 200.03f it would? If not where does the f become useful then?
It's a way to specify that number has to be interpreted as a "float", not a "double" (which is the standard for C++ decimal numbers and uses up twice the memory).
This discussion could be of help:
http://www.cplusplus.com/forum/beginner/24483/
Quoted from http://msdn.microsoft.com/en-us/library/w9bk1wcy.aspx
A floating-point constant without an f, F, l, or L suffix has type
double. If the letter f or F is the suffix, the constant has type
float. If suffixed by the letter l or L, it has type long double. For
example:
200.00f is not a variable. It can't vary.
It's a compile-time constant, with float representation. The f signifies that it's a float.
By comparison, 200.00 would be interpreted as a double.
The C standard states that constant floats are doubles which promotes the operation to a double.
float a,b,c;
...
a = b+7.1; this is a double precision operation
...
a = b+7.1f; this is a single precision operation
...
c = 7.1; //double
a = b + c; //single all the way
The double precision requires more storage for the constant, plus a conversion from single to double for the variable operand, then a conversion from double to single to assign the result. With all the conversions going on if you are not in tune with how floating point works, rounding and such you might not get the result you were thinking you were going to get. The compiler may at some point in the path optimize some of this behavior out, making it either harder to understand the real problems and the fpu in the hardware might accept mixed mode operands, also hiding what is really going on.
It is not just a speed problem but also accuracy. There was a recent SO question, pretty much the same problem, why does this comparison work with one number and not another. Take the fraction 5/11ths for example 0.454545.... Lets say, hypothetically, you had base 10 fpu with single precision of 3 significant digits and a double of 6 significant digits.
float a = 0.45454545454;
...
if(a>0.4545454545) b=1;
...
well in our hypothetical system we can only store three digits into a, so a = .455 because we are using by default a round up rounding mode. but our comparision will be considered double because we didnt put the f at the end of the number. the double version is 0.454545. a is converted to a double which results in 0.455000, so:
if(0.455000>0.454545) b = 1;
0.455 is greater than 0.454545 so b would be a 1.
float a = 0.45454545454;
...
if(a>0.4545454545f) b=1;
...
so now the comparison is single precision so we are comparing 0.455 to 0.455 which is not greater, so b=1 does not happen.
When you write floating point constants that is base 10 decimal, the floating point numbers in the computer are base 2 and they dont always convert smoothly just like 5/11 would work just fine in base 11 but in base 10 you get an infinite repeating digit. 0.1 in decimal for example creates a repeating pattern in binary. Depending on where the mantissa cuts off the rounding can make that lsbit of the mantissa round up or not (also depends on the rounding mode you are using if the floating point format you are using even has rounding). Which of itself creates problems depending on how you use the variable as the comparison above shows.
For non-floating point the compiler usually saves you, but sometimes doesnt:
unsigned long long a;
...
a = ~3;
a = ~(3ULL);
...
Depending on the compiler and computer the two assignments can give you different results one MIGHT give you 0x00000000FFFFFFFC another MIGHT give 0xFFFFFFFFFFFFFFFC.
If you want something specific you should be quite clear when you tell the compiler what you want otherwise the compiler takes a guess and doesnt always make the guess that you wanted.
It means that the value is to be interpreted as a single-precision floating point variable (type float). Without the f-suffix, it is interpreted as a double-precision floationg point variable (type double).
This is usually done to shut up compiler warnings about possible loss of precision by assigning a double value to a float variable. When you didn't receive such a warning you maybe have switched off warnings in your compiler settings (which is bad!).
But it can also have subtile syntactical meaning. As you know C++ allows functions which have the same name but differ by the types of their parameters. In that case the f suffix can determine which function is called.

Double versus float

I have a constant for pi in my code:
const float PI = acos(-1);
Would it be better to declare it as a double? An answer to another question on this site said floating point operations aren't exactly precise, and I'd like the constant to be accurate.
"precise" is not a boolean concept. float provides a certain amount of precision. Whether or not that amount is sufficient for your application depends on, well, your application.
most applications don't need more precision than float provides, though many prefer to use double to (try and) gloss over problems with unstable algorithms or "just because" due to misconceptions like "floating point operations aren't exactly precise".
In most cases when a float is "not precise enough", the problem is not float, it's the code that uses it.
Edit: That being said, most modern CPUs only do calculations in double precision or greater anyway, so you might as well use double unless you're working with large arrays and memory usage is an issue.
From standard:
There are three floating point types:
float, double, and long double. The
type double provides at least as much
precision as float, and the type long
double provides at least as much
precision as double.
Of the three (notice that this goes hand in hand with the 3 versions of acos) you should choose long double if what you are aiming for is precision (but you should also know that after some degree, further precision may be redundant in some cases).
So you should use this to get the most precise result from acos
long double result = acos(-1L);
(Note: There might be some platform specific types or some user defined types which provide more precision)
I'd like the constant to be accurate.
There is nothing like accurate floating point values. They cannot be stored with perfect precision, because of their representation in memory. This is only possible with integers. double give you double the precision a float offers (who would have guessed). double should fit your needs in almost every case.
I would recommend using M_PI from <cmath>, which should be available in all POSIX compliant implementations of the standard.
It depends exactly how precise you need to be. I've never had to you doubles because floats are not precise enough.
The most accurate representation of pi is M_PI from math.h
The question boils down to: how much accuracy do you need?
Let's quote wikipedia:
For example, the decimal
representation of π truncated to 11
decimal places is good enough to
estimate the circumference of any
circle that fits inside the Earth with
an error of less than one millimetre,
and the decimal representation of π
truncated to 39 decimal places is
sufficient to estimate the
circumference of any circle that fits
in the observable universe with
precision comparable to the radius of
a hydrogen atom.
I've written a small java program, here's its output:
As string: 3.14159265358979323846264338327950288419716939937510
As double: 3.141592653589793
As float: 3.1415927
Remember, that if you want to have the double precision of a double, all your numbers you're calculating with need also to be doubles. (That is not entierly true, but is close enough.)
For most applications, float would do just fine for PI. Double is definitely has more precision, but it doesn't guarantee precision anymore than floats can. By that I mean that the number 1.0 represented in binary is not a rational number. Therefore, if you try to represent it, you'll only succeed to an nth digit where n is determined by how many bytes you use to represent that number.
Unfortunately to contain many digits of PI, you'd probably need to hold it in a string. Though now we're talking about some impressive number crunching here that you might see in molecule simulations. You're probably not going to need that level of precision.
As this site says, there are three overloaded versions of acos function.
Therefore the call acos(-1) is ambiguous.
Having said that, you should declare PI as long double to avoid any loss of precision, by using
const long double PI = acos(-1.0L);