How does the IEEE754 specification compare the size relationship between two floating-point numbers? [duplicate] - ieee-754

This question already has answers here:
What is the range in which IEEE754 can correctly compare the size of floating point numbers
(3 answers)
Closed 8 months ago.
It seems that it only compares the size relationship of the first few significant digits? If the number is too long, some strange phenomena will appear.
double a = 1.9;
double b = 1.8;
System.out.println(a > b); // return true
double c = 9007199254740990.9;
double d = 9007199254740990.8;
System.out.println(c > d); // return false

When program source code is compiled or otherwise translated or analyzed, the numerals of floating-point constants in the source code are converted to a floating-point format.
When a numeral is converted to IEEE-754 “double precision” format (also called binary64), the result should be the nearest value representable in that format in the direction determined by the rounding rule in use. The default rounding rule is round-to-nearest ties-to-even.
When 1.9 is converted in this way, the result is 1.899999999999999911182158029987476766109466552734375, because that is the closest representable value to 1.9. The second closest number is 1.9000000000000001332267629550187848508358001708984375, and that is a little farther from 1.9 than 1.899999999999999911182158029987476766109466552734375 is.
When 1.8 is converted, the result is 1.8000000000000000444089209850062616169452667236328125. The former is greater than the latter, so a > b evaluates as true.
When 9007199254740990.9 is converted, the result is 9007199254740991. When 9007199254740990.8, the result is 9007199254740991. These are equal, so c > d evaluates as false.


Is negative zero always equal zero [duplicate]

In my code,
float f = -0.0; // Negative
and compared with negative zero
f == -0.0f
result will be true.
float f = 0.0; // Positive
and compared with negative zero
f == -0.0f
also, result will be true instead of false
Why in both cases result to be true?
Here is a MCVE to test it (live on coliru):
#include <iostream>
int main()
float f = -0.0;
std::cout<<"==== > " << f <<std::endl<<std::endl;
if(f == -0.0f)
==== > -0 // Here print negative zero
C++11 introduced functions like std::signbit() which can detect signed zeros, and std::copysign() which can copy the sign bit between floating point values, if the implementation supports signed zero (e.g. due to using IEEE floating point). The specifications of those functions don't actually require that an implementation support distinct positive and negative zeros. That sort of thing aside, I'm unaware of any references in a C++ standard that even mentions signed zeros, let alone what should be the result of comparing them.
The C++ standards also do not stipulate any floating point representation - that is implementation-defined.
Although not definitive, these observations suggest that support of signed zeros, or the result of comparing them, would be determined by what floating point representation the implementation supports.
IEEE-754 is a common (albeit not the only) floating point representation used by modern implementations (i.e. compilers on their host systems). The current (published in 2008) version of IEEE-758 "IEEE Standard for Floating -Point Arithmetic" Section 5.11, second paragraph, says (bold emphasis mine)
Four mutually exclusive relations are possible: less than, equal, greater than, and unordered. The last case arises when at least one operand is NaN. Every NaN shall compare unordered with everything, including itself. Comparisons shall ignore the sign of zero (so +0 = −0). Infinite operands of the same sign shall compare equal.
Floating point arithmetic in C++ is often IEEE-754. This norm differs from the mathematical definition of the real number set.
This norm defines two different representations for the value zero: positive zero and negative zero. It is also defined that those two representations must compare equals, so by definition:
+0.0 == -0.0
As to why it is so, in its paper What Every Computer Scientist Should Know About Floating Point Arithmetic, David Goldberg, 1991-03 (linked in the IEEE-754 page on the IEEE website) writes:
In IEEE arithmetic, it is natural to define log 0 = -∞ and log x to be a NaN when x < 0. Suppose that x represents a small negative number that has underflowed to zero. Thanks to signed zero, x will be negative, so log can return a NaN. However, if there were no signed zero, the log function could not distinguish an underflowed negative number from 0, and would therefore have to return -∞.
That's because the signed negative zero must compare true with zero: i.e. -0.0 == 0.0, -0f == 0f, and -0l == 0l.
It's a requirement of any floating point scheme supported by a C++ compiler.
(Note that most platforms these days use IEEE754 floating point, and this behaviour is explicitly documented in that specification.)
Because 0.0f and -0.0f is same negative of a zero is zero

Why it is so ? And It is without you explicitly do casting [duplicate]

This question already has answers here:
How dangerous is it to compare floating point values?
(12 answers)
Closed 7 years ago.
float a = 4.2 ;
double b = 4.2 ;
It will give result as "False" .
But when I declare a and b as ,
float a = 4.5;
double b =4.5 ;
It gives the result "True"
What is happening here ? Can anybody explain it please and its not duplicate as values when given "4.5" or "4.0" it results in "True"
A comparison between double and float gets float implicitly converted to double. Which means that your a==b is interpreted as (double) a == b.
The fractional part of 4.5 is an exact power of 2: 0.5 == 2-2. This value is represented in binary floating-point format precisely
4.5 dec = 100.1 bin
And it is represented precisely in both double and float. So, the comparison compares 4.5 to 4.5 and the result is true.
4.2 is requires an infinite periodic sum of powers of 2 to represent.
4.2 dec = 100.0011001100110011... bin = 100.(0011) bin
For this reason it is represented in double and float only approximately - the infinite sequence gets trimmed in one way or another. Since float has less precision than double, the value stored in float ends up being different than the value stored in double. In double you get something like 4.1999998092651367, while in float you get something like 4.2000000000000002. The conversion to double in (double) a == b cannot make the values to "match" and the result is false.
Floating point variables cannot accurately represent all values. In particular, the values 0.1 and 0.2 (in decimal) cannot be exactly represented, and only an approximation is stored in the variable.
A variable of type double will store such values to greater precision than will a float (i.e. it will give a better approximation). So the two values will not be equal.
To understand, understand that floating point variables typically work with base 2 (binary fractions). Then try to represent 1/10 (decimal) in base 2. The result is an infinite series. It is the same phenomenon that results in 1/3 being represented in decimal as 0.33333.... (an infinite number of digits). The only differences is that, in binary, a different set of values have an infinite number of digits. One of the affected values is 1/10.

double and float comparison [duplicate]

This question already has answers here:
Comparing float and double
(3 answers)
Closed 7 years ago.
According to this post, when comparing a float and a double, the float should be treated as double.
The following program, does not seem to follow this statement. The behaviour looks quite unpredictable.
Here is my program:
void main(void)
double a = 1.1; // 1.5
float b = 1.1; // 1.5
printf("%X %X\n", a, b);
if ( a == b)
cout << "success " <<endl;
cout << "fail" <<endl;
When I run the following program, I get "fail" displayed.
However, when I change a and b to 1.5, it displays "success".
I have also printed the hex notations of the values. They are different in both the cases. My compiler is Visual Studio 2005
Can you explain this output ? Thanks.
float f = 1.1;
double d = 1.1;
if (f == d)
In this comparison, the value of f is promoted to type double. The problem you're seeing isn't in the comparison, but in the initialization. 1.1 can't be represented exactly as a floating-point value, so the values stored in f and d are the nearest value that can be represented. But float and double are different sizes, so have a different number of significant bits. When the value in f is promoted to double, there's no way to get back the extra bits that were lost when the value was stored, so you end up with all zeros in the extra bits. Those zero bits don't match the bits in d, so the comparison is false. And the reason the comparison succeeds with 1.5 is that 1.5 can be represented exactly as a float and as a double; it has a bunch of zeros in its low bits, so when the promotion adds zeros the result is the same as the double representation.
I found a decent explanation of the problem you are experiencing as well as some solutions.
See How dangerous is it to compare floating point values?
Just a side note, remember that some values can not be represented EXACTLY in IEEE 754 floating point representation. Your same example using a value of say 1.5 would compare as you expect because there is a perfect representation of 1.5 without any loss of data. However, 1.1 in 32-bit and 64-bit are in fact different values because the IEEE 754 standard can not perfectly represent 1.1.
double a = 1.1 --> 0x3FF199999999999A
Approximate representation = 1.10000000000000008881784197001
float b = 1.1 --> 0x3f8ccccd
Approximate representation = 1.10000002384185791015625
As you can see, the two values are different.
Also, unless you are working in some limited memory type environment, it's somewhat pointless to use floats. Just use doubles and save yourself the headaches.
If you are not clear on why some values can not be accurately represented, consult a tutorial on how to covert a decimal to floating point.
Here's one:
I would regard code which directly performs a comparison between a float and a double without a typecast to be broken; even if the language spec says that the float will be implicitly converted, there are two different ways that the comparison might sensibly be performed, and neither is sufficiently dominant to really justify a "silent" default behavior (i.e. one which compiles without generating a warning). If one wants to perform a conversion by having both operands evaluated as double, I would suggest adding an explicit type cast to make one's intentions clear. In most cases other than tests to see whether a particular double->float conversion will be reversible without loss of precision, however, I suspect that comparison between float values is probably more appropriate.
Fundamentally, when comparing floating-point values X and Y of any sort, one should regard comparisons as indicating that X or Y is larger, or that the numbers are "indistinguishable". A comparison which shows X is larger should be taken to indicate that the number that Y is supposed to represent is probably smaller than X or close to X. A comparison that says the numbers are indistinguishable means exactly that. If one views things in such fashion, comparisons performed by casting to float may not be as "informative" as those done with double, but are less likely to yield results that are just plain wrong. By comparison, consider:
double x, y;
float f = x;
If one compares f and y, it's possible that what one is interested in is how y compares with the value of x rounded to a float, but it's more likely that what one really wants to know is whether, knowing the rounded value of x, whether one can say anything about the relationship between x and y. If x is 0.1 and y is 0.2, f will have enough information to say whether x is larger than y; if y is 0.100000001, it will not. In the latter case, if both operands are cast to double, the comparison will erroneously imply that x was larger; if they are both cast to float, the comparison will report them as indistinguishable. Note that comparison results when casting both operands to double may be erroneous not only when values are within a part per million; they may be off by hundreds of orders of magnitude, such as if x=1e40 and y=1e300. Compare f and y as float and they'll compare indistinguishable; compare them as double and the smaller value will erroneously compare larger.
The reason why the rounding error occurs with 1.1 and not with 1.5 is due to the number of bits required to accurately represent a number like 0.1 in floating point format. In fact an accurate representation is not possible.
See How To Represent 0.1 In Floating Point Arithmetic And Decimal for an example, particularly the answer by #paxdiablo.

double variables in c++ are showing equal even when they are not

I just wrote the following code in C++:
double variable1;
double variable2;
cout<<"\nVariable1==Variable2 ? "<<(variable1==variable2);
The answer to the cout statement comes out 1, even when variable2 and variable1 are not equal.Can someone help me with this? Why is this happening?
I knew the concept of imprecise floating point math but didn't think this would happen with comparing two doubles directly. Also I am getting the same resuklt when I replace variable1 with:
double variable1=(numeric_limits<double>::max()-10000000000000);
The comparison still shows them as equal. How much would I have to subtract to see them start differing?
The maximum value for a double is 1.7976931348623157E+308. Due to lack of precision, adding and removing small values such as 50 and 5 does not actually changes the values of the variable. Thus they stay the same.
There isn't enough precision in a double to differentiate between M and M-45 where M is the largest value that can be represented by a double.
Imagine you're counting atoms to the nearest million. "123,456 million atoms" plus 1 atom is still "123,456 million atoms" because there's no space in the "millions" counting system for the 1 extra atom to make any difference.
is a huuuuuge number. But the greater the absolute value of a double, the smaller is its precision. Apparently in this case max-50and max-5 are indistinguishable from double's point of view.
You should read the floating point comparison guide. In short, here are some examples:
float a = 0.15 + 0.15
float b = 0.1 + 0.2
if(a == b) // can be false!
if(a >= b) // can also be false!
The comparison with an epsilon value is what most people do.
#define EPSILON 0.00000001
bool AreSame(double a, double b)
return fabs(a - b) < EPSILON;
In your case, that max value is REALLY big. Adding or subtracting 50 does nothing. Thus they look the same because of the size of the number. See #RichieHindle's answer.
Here are some additional resources for research.
See this blog post.
Also, there was a stack overflow question on this very topic (language agnostic).
From the C++03 standard:
3.9.1/ [...] The value representation of floating-point types is
5/ [...] If during the evaluation of an expression, the result is not
mathematically defined or not in the range of representable values for
its type, the behavior is undefined, unless such an expression is a
constant expression (5.19), in which case the program is ill-formed.
and (about numeric_limits<T>::max()) Maximum finite value.
This implies that once you add something to std::numeric_limits<T>::max(), the behavior of the program is implementation defined if T is floating point, perfectly defined if T is an unsigned type, and undefined otherwise.
If you happen to have std::numeric_limits<T>::is_iec559 == true, in this case the behavior is defined by IEEE 754. I don't have it handy, so I cannot tell whether variable1 is finite or infinite in this case. It seems (according to some lecture notes on IEEE 754 on the internet) that it depends on the rounding mode..
Please read What Every Computer Scientist Should Know About Floating-Point Arithmetic.

C++ double operator+ [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicates:
Incorrect floating point math?
Float compile-time calculation not happening?
Strange stuff going on today, I'm about to lose it...
#include <iomanip>
#include <iostream>
using namespace std;
int main()
cout << setprecision(14);
cout << (1/9+1/9+4/9) << endl;
This code outputs 0 on MSVC 9.0 x64 and x86 and on GCC 4.4 x64 and x86 (default options and strict math...). And as far as I remember, 1/9+1/9+4/9 = 6/9 = 2/3 != 0
1/9 is zero, because 1 and 9 are integers and divided by integer division. The same applies to 4/9.
If you want to express floating-point division through arithmetic literals, you have to either use floating-point literals 1.0/9 + 1.0/9 + 4.0/9 (or 1/9. + 1/9. + 4/9. or 1.f/9 + 1.f/9 + 4.f/9) or explicitly cast one operand to the desired floating-point type (double) 1/9 + (double) 1/9 + (double) 4/9.
P.S. Finally my chance to answer this question :)
Use a decimal point in your calculations to force floating point math optionally along with one of these suffixes: f l F L on your numbers. A number alone without a decimal point and without one of those suffixes is not considered a floating point literal.
C++03 2.13.3-1 on Floating literals:
A floating literal consists of an
integer part, a decimal point, a
fraction part, an e or E, an
optionally signed integer exponent,
and an optional type suffix. The
integer and fraction parts both
consist of a sequence of decimal (base
ten) digits. Either the integer part
or the fraction part (not both) can be
omitted; either the decimal point or
the letter e (or E) and the exponent
(not both) can be omitted. The integer
part, the optional decimal point and
the optional fraction part form the
significant part of the floating
literal. The exponent, if present,
indicates the power of 10 by which the
significant part is to be scaled. If
the scaled value is in the range of
representable values for its type, the
result is the scaled value if
representable, else the larger or
smaller representable value nearest
the scaled value, chosen in an
implementation-defined manner. The
type of a floating literal is double
unless explicitly specified by a
suffix. The suffixes f and F specify
float, the suffixes l and L specify
long double. If the scaled value is
not in the range of representable
values for its type, the program is
ill-formed. 18
They are all integers. So 1/9 is 0. 4/9 is also 0. And 0 + 0 + 0 = 0. So the result is 0. If you want fractions, cast your fractions to floats.
1/9(=0)+1/9(=0)+4/9(=0) = 0
well, in C++ (and many other languages), 1/9+1/9+4/9 is zero, because it is integer arithmetic.
You probably want to write 1/9.0+1/9.0+4/9.0
Unless you specifically specify the decimal, the numbers C++ uses are integers, so 1/9 = 4/9 = 0 and 0 + 0 + 0 = 0.
You should simply add the decimal 1.0 etc...
By the C rules of types, you're doing all integer math there. 1/9 and 4/9 are both truncated to 0 (as integers). If you wrote 1.0/9.0 etc, it would use double precision math and do what you want.
You might make it a habit to use more parentheses. They cost little time, make clear what you intend, and ensure you get what you wanted. Well mostly... ;)