This question already has answers here:
Comparing float and double
(3 answers)
Closed 7 years ago.
According to this post, when comparing a float and a double, the float should be treated as double.
The following program, does not seem to follow this statement. The behaviour looks quite unpredictable.
Here is my program:
void main(void)
{
double a = 1.1; // 1.5
float b = 1.1; // 1.5
printf("%X %X\n", a, b);
if ( a == b)
cout << "success " <<endl;
else
cout << "fail" <<endl;
}
When I run the following program, I get "fail" displayed.
However, when I change a and b to 1.5, it displays "success".
I have also printed the hex notations of the values. They are different in both the cases. My compiler is Visual Studio 2005
Can you explain this output ? Thanks.
float f = 1.1;
double d = 1.1;
if (f == d)
In this comparison, the value of f is promoted to type double. The problem you're seeing isn't in the comparison, but in the initialization. 1.1 can't be represented exactly as a floating-point value, so the values stored in f and d are the nearest value that can be represented. But float and double are different sizes, so have a different number of significant bits. When the value in f is promoted to double, there's no way to get back the extra bits that were lost when the value was stored, so you end up with all zeros in the extra bits. Those zero bits don't match the bits in d, so the comparison is false. And the reason the comparison succeeds with 1.5 is that 1.5 can be represented exactly as a float and as a double; it has a bunch of zeros in its low bits, so when the promotion adds zeros the result is the same as the double representation.
I found a decent explanation of the problem you are experiencing as well as some solutions.
See How dangerous is it to compare floating point values?
Just a side note, remember that some values can not be represented EXACTLY in IEEE 754 floating point representation. Your same example using a value of say 1.5 would compare as you expect because there is a perfect representation of 1.5 without any loss of data. However, 1.1 in 32-bit and 64-bit are in fact different values because the IEEE 754 standard can not perfectly represent 1.1.
See http://www.binaryconvert.com
double a = 1.1 --> 0x3FF199999999999A
Approximate representation = 1.10000000000000008881784197001
float b = 1.1 --> 0x3f8ccccd
Approximate representation = 1.10000002384185791015625
As you can see, the two values are different.
Also, unless you are working in some limited memory type environment, it's somewhat pointless to use floats. Just use doubles and save yourself the headaches.
If you are not clear on why some values can not be accurately represented, consult a tutorial on how to covert a decimal to floating point.
Here's one: http://class.ece.iastate.edu/arun/CprE281_F05/ieee754/ie5.html
I would regard code which directly performs a comparison between a float and a double without a typecast to be broken; even if the language spec says that the float will be implicitly converted, there are two different ways that the comparison might sensibly be performed, and neither is sufficiently dominant to really justify a "silent" default behavior (i.e. one which compiles without generating a warning). If one wants to perform a conversion by having both operands evaluated as double, I would suggest adding an explicit type cast to make one's intentions clear. In most cases other than tests to see whether a particular double->float conversion will be reversible without loss of precision, however, I suspect that comparison between float values is probably more appropriate.
Fundamentally, when comparing floating-point values X and Y of any sort, one should regard comparisons as indicating that X or Y is larger, or that the numbers are "indistinguishable". A comparison which shows X is larger should be taken to indicate that the number that Y is supposed to represent is probably smaller than X or close to X. A comparison that says the numbers are indistinguishable means exactly that. If one views things in such fashion, comparisons performed by casting to float may not be as "informative" as those done with double, but are less likely to yield results that are just plain wrong. By comparison, consider:
double x, y;
float f = x;
If one compares f and y, it's possible that what one is interested in is how y compares with the value of x rounded to a float, but it's more likely that what one really wants to know is whether, knowing the rounded value of x, whether one can say anything about the relationship between x and y. If x is 0.1 and y is 0.2, f will have enough information to say whether x is larger than y; if y is 0.100000001, it will not. In the latter case, if both operands are cast to double, the comparison will erroneously imply that x was larger; if they are both cast to float, the comparison will report them as indistinguishable. Note that comparison results when casting both operands to double may be erroneous not only when values are within a part per million; they may be off by hundreds of orders of magnitude, such as if x=1e40 and y=1e300. Compare f and y as float and they'll compare indistinguishable; compare them as double and the smaller value will erroneously compare larger.
The reason why the rounding error occurs with 1.1 and not with 1.5 is due to the number of bits required to accurately represent a number like 0.1 in floating point format. In fact an accurate representation is not possible.
See How To Represent 0.1 In Floating Point Arithmetic And Decimal for an example, particularly the answer by #paxdiablo.
Related
Today in my C++ programming lessons, my proff told me that one should never compare two floating point values directly.
So I tried this piece of code and found out the reason for his statement.
double l_Value=94.9;
print("%.20lf",l_Value);
And I found the results as 94.89999999 ( some relative error )
I understand that floating numbers are not stored in the way one presents it to the code. Squeezing those ones and zeros in binary form involves some relative rounding errors.
Iam looking for solutions to two problems.
1. Efficient way to compare two floating values.
2. How to add a floating value to another one. Example. Add 0.1111 to 94.4345 to get the exact value as 94.5456
Thanks in advance.
Efficient way to compare two floating values.
A simple double a,b; if (a == b) is an efficient way to compare two floating values. Yet as OP noticed, this may not meet the overall coding goal. Better ways depend on the context of the compare, something not supplied by OP. See far below.
How to add a floating value to another one. Example. Add 0.1111 to 94.4345 to get the exact value as 94.5456
Floating values as source code have effective unlimited range and precision such as 1.23456789012345678901234567890e1234567. Conversion of this text to a double is limited typically to one of 264 different values. The closest is selected, but that may not be an exact match.
Neither 0.1111, 94.4345, 94.5456 can be representably exactly as a typical double.
OP has choices:
1.) Use another type other than double, float. Various libraries offer decimal floating point types.
2) Limit code to rare platforms that support double to a base 10 form such that FLT_RADIX == 10.
3) Write your own code to handle user input like "0.1111" into a structure/string and perform the needed operations.
4) Treat user input as strings and the convert to some integer type, again with supported routines to read/compute/and write.
5) Accept that floating point operations are not mathematically exact and handle round-off error.
double a = 0.1111;
printf("a: %.*e\n", DBL_DECIMAL_DIG -1 , a);
double b = 94.4345;
printf("b: %.*e\n", DBL_DECIMAL_DIG -1 , b);
double sum = a + b;
printf("sum: %.*e\n", DBL_DECIMAL_DIG -1 , sum);
printf("%.4f\n", sum);
Output
a: 1.1110000000000000e-01
b: 9.4434500000000000e+01
sum: 9.4545599999999993e+01
94.5456 // Desired textual output based on a rounded `sum` to the nearest 0.0001
More on #1
If an exact compare is not sought but some sort of "are the two values close enough?", a definition of "close enough" is needed - of which there are many.
The following "close enough" compares the distance by examining the ULP of the two numbers. It is a linear difference when the values are in the same power-of-two and becomes logarithmic other wise. Of course, change of sign is an issue.
float example:
Consider all finite float ordered from most negative to most positive. The following, somewhat-portable code, returns an integer for each float with that same order.
uint32_t sequence_f(float x) {
union {
float f;
uint32_t u32;
} u;
assert(sizeof(float) == sizeof(uint32_t));
u.f = x;
if (u.u32 & 0x80000000) {
u.u32 ^= 0x80000000;
return 0x80000000 - u.u32;
}
return u.u3
}
Now, to determine if two float are "close enough", simple compare two integers.
static bool close_enough(float x, float y, uint32_t ULP_delta) {
uint32_t ullx = sequence_f(x);
uint32_t ully = sequence_f(y);
if (ullx > ully) return (ullx - ully) <= ULP_delta;
return (ully - ullx) <= ULP_delta;
}
The way I've usually done this is is to have a custom equality comparison function. The basic idea, is you have a certain tolerance, say 0.0001 or something. Then you subtract your two numbers and take their absolute value, and if it is less than your tolerance you treat it as equal. There are other strategies that may be more appropriate for certain situations, of course.
Define for yourself a tolerance level e (for example, e=.0001) and check if abs(a-b) <= e
You aren't going to get an "exact" value with floating point. Ever. If you know in advance that you are using four decimals, and you want "exact", then you need to internally treat your numbers as integers and only display them as decimals. 944345 + 1111 = 945456
I have a class that internally represents some quantity in fixed point as 32-bit integer with somewhat arbitrary denominator (it is neither power of 2 nor power of 10).
For communicating with other applications the quantity is converted to plain old double on output and back on input. As code inside the class it looks like:
int32_t quantity;
double GetValue() { return double(quantity) / DENOMINATOR; }
void SetValue(double x) { quantity = x * DENOMINATOR; }
Now I need to ensure that if I output some value as double and read it back, I will always get the same value back. I.e. that
x.SetValue(x.GetValue());
will never change x.quantity (x is arbitrary instance of the class containing the above code).
The double representation has more digits of precision, so it should be possible. But it will almost certainly not be the case with the simplistic code above.
What rounding do I need to use and
How can I find the critical would-be corner cases to test that the rounding is indeed correct?
Any 32 bits will be represented exactly when you convert to a double, but when you divide then multiply by an arbitrary value you will get a similar value but not exactly the same. You should lose at most one bit per operations, which means your double will be almost the same, prior to casting back to an int.
However, since int casts are truncations, you will get the wrong result when very minor errors turn 2.000 into 1.999, thus what you need to do is a simple rounding task prior to casting back.
You can use std::lround() for this if you have C++11, else you can write you own rounding function.
You probably don't care about fairness much here, so the common int(doubleVal+0.5) will work for positives. If as seems likely, you have negatives, try this:
int round(double d) { return d<0?d-0.5:d+0.5; }
The problem you describe is the same problem which exists with converting between binary and decimal representation just with different bases. At least it exists if you want to have the double representation to be a good approximation of the original value (otherwise you could just multiply the 32 bit value you have with your fixed denominator and store the result in a double).
Assuming you want the double representation be a good approximation of your actual value the conversions are nontrivial! The conversion from your internal representation to double can be done using Dragon4 ("How to print floating point numbers accurately", Steele & White) or Grisu ("How to print floating point numbers quickly and accurately", Loitsch; I'm not sure if this algorithm is independent from the base, though). The reverse can be done using Bellerophon ("How to read floating point numbers accurately", Clinger). These algorithms aren't entirely trivial, though...
I just wrote the following code in C++:
double variable1;
double variable2;
variable1=numeric_limits<double>::max()-50;
variable2=variable1;
variable1=variable1+5;
cout<<"\nVariable1==Variable2 ? "<<(variable1==variable2);
The answer to the cout statement comes out 1, even when variable2 and variable1 are not equal.Can someone help me with this? Why is this happening?
I knew the concept of imprecise floating point math but didn't think this would happen with comparing two doubles directly. Also I am getting the same resuklt when I replace variable1 with:
double variable1=(numeric_limits<double>::max()-10000000000000);
The comparison still shows them as equal. How much would I have to subtract to see them start differing?
The maximum value for a double is 1.7976931348623157E+308. Due to lack of precision, adding and removing small values such as 50 and 5 does not actually changes the values of the variable. Thus they stay the same.
There isn't enough precision in a double to differentiate between M and M-45 where M is the largest value that can be represented by a double.
Imagine you're counting atoms to the nearest million. "123,456 million atoms" plus 1 atom is still "123,456 million atoms" because there's no space in the "millions" counting system for the 1 extra atom to make any difference.
numeric_limits<double>::max()
is a huuuuuge number. But the greater the absolute value of a double, the smaller is its precision. Apparently in this case max-50and max-5 are indistinguishable from double's point of view.
You should read the floating point comparison guide. In short, here are some examples:
float a = 0.15 + 0.15
float b = 0.1 + 0.2
if(a == b) // can be false!
if(a >= b) // can also be false!
The comparison with an epsilon value is what most people do.
#define EPSILON 0.00000001
bool AreSame(double a, double b)
{
return fabs(a - b) < EPSILON;
}
In your case, that max value is REALLY big. Adding or subtracting 50 does nothing. Thus they look the same because of the size of the number. See #RichieHindle's answer.
Here are some additional resources for research.
See this blog post.
Also, there was a stack overflow question on this very topic (language agnostic).
From the C++03 standard:
3.9.1/ [...] The value representation of floating-point types is
implementation-defined
and
5/ [...] If during the evaluation of an expression, the result is not
mathematically defined or not in the range of representable values for
its type, the behavior is undefined, unless such an expression is a
constant expression (5.19), in which case the program is ill-formed.
and
18.2.1.2.4/ (about numeric_limits<T>::max()) Maximum finite value.
This implies that once you add something to std::numeric_limits<T>::max(), the behavior of the program is implementation defined if T is floating point, perfectly defined if T is an unsigned type, and undefined otherwise.
If you happen to have std::numeric_limits<T>::is_iec559 == true, in this case the behavior is defined by IEEE 754. I don't have it handy, so I cannot tell whether variable1 is finite or infinite in this case. It seems (according to some lecture notes on IEEE 754 on the internet) that it depends on the rounding mode..
Please read What Every Computer Scientist Should Know About Floating-Point Arithmetic.
I have a double of 3.4. However, when I multiply it with 100, it gives 339 instead of 340. It seems to be caused by the precision of double. How could I get around this?
Thanks
First what is going on:
3.4 can't be represented exactly as binary fraction. So the implementation chooses closest binary fraction that is representable. I am not sure whether it always rounds towards zero or not, but in your case the represented number is indeed smaller.
The conversion to integer truncates, that is uses the closest integer with smaller absolute value.
Since both conversions are biased in the same direction, you can always get a rounding error.
Now you need to know what you want, but probably you want to use symmetrical rounding, i.e. find the closest integer be it smaller or larger. This can be implemented as
#include <cmath>
int round(double x) { std::floor(x + 0.5); } // floor is provided, round not
or
int round(double x) { return x < 0 ? x - 0.5 : x + 0.5; }
I am not completely sure it's indeed rounding towards zero, so please verify the later if you use it.
If you need full precision, you might want to use something like Boost.Rational.
You could use two integers and multiply the fractional part by multiplier / 10.
E.g
int d[2] = {3,4};
int n = (d[0] * 100) + (d[1] * 10);
If you really want all that precision either side of the decimal point. Really does depend on the application.
Floating-point values are seldom exact. Unfortunately, when casting a floating-point value to an integer in C, the value is rounded towards zero. This mean that if you have 339.999999, the result of the cast will be 339.
To overcome this, you could add (or subtract) "0.5" from the value. In this case 339.99999 + 0.5 => 340.499999 => 340 (when converted to an int).
Alternatively, you could use one of the many conversion functions provided by the standard library.
You don't have a double with the value of 3.4, since 3.4 isn't
representable as a double (at least on the common machines, and
most of the exotics as well). What you have is some value very
close to 3.4. After multiplication, you have some value very
close to 340. But certainly not 399.
Where are you seeing the 399? I'm guessing that you're simply
casting to int, using static_cast, because this operation
truncates toward zero. Other operations would likely do what
you want: outputting in fixed format with 0 positions after the
decimal, for example, rounds (in an implementation defined
manner, but all of the implementations I know use round to even
by default); the function round rounds to nearest, rounding
away from zero in halfway cases (but your results will not be
anywhere near a halfway case). This is the rounding used in
commercial applications.
The real question is what are you doing that requires an exact
integral value. Depending on the application, it may be more
appropriate to use int or long, scaling the actual values as
necessary (i.e. storing 100 times the actual value, or
whatever), or some sort of decimal arithmetic package, rather
than to use double.
I had a small WTF moment this morning. Ths WTF can be summarized with this:
float x = 0.2f;
float y = 0.1f;
float z = x + y;
assert(z == x + y); //This assert is triggered! (Atleast with visual studio 2008)
The reason seems to be that the expression x + y is promoted to double and compared with the truncated version in z. (If i change z to double the assert isn't triggered).
I can see that for precision reasons it would make sense to perform all floating point arithmetics in double precision before converting the result to single precision. I found the following paragraph in the standard (which I guess I sort of already knew, but not in this context):
4.6.1.
"An rvalue of type float can be converted to an rvalue of type double. The value is unchanged"
My question is, is x + y guaranteed to be promoted to double or is at the compiler's discretion?
UPDATE: Since many people has claimed that one shouldn't use == for floating point, I just wanted to state that in the specific case I'm working with, an exact comparison is justified.
Floating point comparision is tricky, here's an interesting link on the subject which I think hasn't been mentioned.
You can't generally assume that == will work as expected for floating point types. Compare rounded values or use constructs like abs(a-b) < tolerance instead.
Promotion is entirely at the compiler's discretion (and will depend on target hardware, optimisation level, etc).
What's going on in this particular case is almost certainly that values are stored in FPU registers at a higher precision than in memory - in general, modern FPU hardware works with double or higher precision internally whatever precision the programmer asked for, with the compiler generating code to make the appropriate conversions when values are stored to memory; in an unoptimised build, the result of x+y is still in a register at the point the comparison is made but z will have been stored out to memory and fetched back, and thus truncated to float precision.
The Working draft for the next standard C++0x section 5 point 11 says
The values of the floating operands and the results of floating expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby
So at the compiler's discretion.
Using gcc 4.3.2, the assertion is not triggered, and indeed, the rvalue returned from x + y is a float, rather than a double.
So it's up to the compiler. This is why it's never wise to rely on exact equality between two floating point values.
The C++ FAQ lite has some further discussion on the topic:
Why is cos(x) != cos(y) even though x == y?
It is the problem since float number to binary conversion does not give accurate precision.
And within sizeof(float) bytes it cant accomodate precise value of float number and arithmetic operation may lead to approximation and hence equality fails.
See below e.g.
float x = 0.25f; //both fits within 4 bytes with precision
float y = 0.50f;
float z = x + y;
assert(z == x + y); // it would work fine and no assert
I would think it would be at the compiler's discretion, but you could always force it with a cast if that was your thinking?
Yet another reason to never directly compare floats.
if (fabs(result - expectedResult) < 0.00001)