IEEE 754 Floating Point - ieee-754

Could someone explain to me why these do not equal in IEEE 754 floating point:
(1 + 1e300) - 1e100 and 1 + (1e300 - 1e100)
Many thanks!

It depends on precisely which IEEE floating point format you're working in, and the rounding mode that you are using.
F64
The comments above that say that they are equal are most likely checking this in at least double (F64 binary) precision with e.g. RNE rounding. This is what usually happens when using double in C-like languages.
In that case, all of your numbers get converted and rounded to an F64 value. The other numbers are so far apart from 1e300 that any addition will round back to roughly 1e300. To display this number in decimals, it gets rounded again and shown as 1e300.
If your rounding mode is not RNE, your final answers may be slightly different from 1e300, although still probably equal.
F32
However, if you are working in single (F32) precision - most of your numbers are much too large to represent, and will likely be converted to an Inf.
Following the IEEE 745 rules: you end up calculating Inf-Inf in both cases, which should result in NaN. Then finally, if you end up comparing the values as floats: NaN==NaN, the answer must be false, which appears to be consistent with what you are seeing.

For Python,
>>> (1 + 1e300) - 1e100 == 1 + (1e300 - 1e100)
True

Related

Is a floating-point value of 0.0 represented differently from other floating-point values?

I've been going back through my C++ book, and I came across a statement that says zero can be represented exactly as a floating-point number. I was wondering how this is possible unless the value of 0.0 is stored as a type other than a floating point value. I wrote the following code to test this:
#include <iomanip>
#include <iostream>
int main()
{
float value1 {0.0};
float value2 {0.1};
std::cout << std::setprecision(10) << std::fixed;
std::cout << value1 << '\n'
<< value2 << std::endl;
}
Running this code gave the following output:
0.0000000000
0.1000000015
To 10 digits of precision, 0.0 is still 0, and 0.1 has some inaccuracies (which is to be expected). Is a value of 0.0 different from other floating point numbers in the way it is represented, and is this a feature of the compiler or the computer's architecture?
How can 2 be represented as an exact number? 4? 15? 0.5? The answer is just that some numbers can be represented exactly in the floating-point format (which is based on base-2/binary) and others can't.
This is no different from in decimal. You can't represent 1/3 exactly in decimal, but that doesn't mean you can't represent 0.
Zero is special in a way, because (like the other real numbers) it's more trivial to prove this property than for some arbitrary fractional number. But that's about it.
So:
what is it about these values (0, 1/16, 1/2048, ...) that allows them to be represented exactly.
Simple mathematics. In any given base, in the sort of representation we're talking about, some numbers can be written out with a fixed number of decimal places; others can't. That's it.
You can play online with H. Schmidt's IEEE-754 Floating Point Converter for different numbers to see a bunch of different representations, and what errors come about as a result of encoding into those representations. For starters, try 0.5, 0.2 and 0.1.
It was my (perhaps naive) understanding that all floating point values contained some instability.
No, absolutely not.
You want to treat every floating point value in your program as potentially having some small error on it, because you generally don't know what sequence of calculations led to it. You can't trust it, in general. I expect someone half-taught this to you in the past, and that's what led to your misunderstanding.
But, if you do know the error (or lack thereof) involved at each step in the creation of the value (e.g. "all I've done is initialised it to zero"), then that's fine! No need to worry about it then.
Here is one way to look at the situation: with 64 bits to store a number, there are 2^64 bit patterns. Some of these are "not-a-number" representations, but most of the 2^64 patterns represent numbers. The number that is represented is represented exactly, with no error. This might seem strange after learning about floating point math; a caveat lurks ahead.
However, as huge as 2^64 is, there are infinitely many more real numbers. When a calculation produces a non-integer result, the odds are pretty good that the answer will not be a number represented by one of the 2^64 patterns. There are exceptions. For example, 1/2 is represented by one of the patterns. If you store 0.5 in a floating point variable, it will actually store 0.5. Let's try that for other single-digit denominators. (Note: I am writing fractions for their expressive power; I do not intend integer arithmetic.)
1/1 – stored exactly
1/2 – stored exactly
1/3 – not stored exactly
1/4 – stored exactly
1/5 – not stored exactly
1/6 – not stored exactly
1/7 – not stored exactly
1/8 – stored exactly
1/9 – not stored exactly
So with these simple examples, over half are not stored exactly. When you get into more complicated calculations, any one piece of the calculation can throw you off the islands of exact representation. Do you see why the general rule of thumb is that floating point values are not exact? It is incredibly easy to fall into that realm. It is possible to avoid it, but don't count on it.
Some numbers can be represented exactly by a floating point value. Most cannot.

Why does printf("%.2f", (double) 12.555) return 12.55?

I was writing a program where I had to round double to second decimal place. I noticed printf("%.2f", (double) 12.555) return 12.55. However, printf("%.2f", (float) 12.555) returns 12.56. Can anyone explain why this happens?
12.555 is a number that is not representable precisely in binary floating point. It so happens that the closest value to 1.2555 that is representable in double precision floating point on your system is slightly less than 1.2555, and the closest value to 1.2555 that is representable in single precision floating point is slightly more than 1.2555.
Assuming the rounding mode used by the conversion is round to nearest (ties to even), which is the default in IEEE 754 standard, then the described output is to be expected.
Floats and doubles are stored internally using IEEE 754 representation. The part that is relevant to your question is that both floats and doubles store the closest to the decimal number they can, given the limits of their representation. Roughly speaking, those limits are to do with the conversion of the decimal part of the original number to a binary number with a finite number of bits.
It turns out, the closest float to 12.555 is actually 12.55500030517578125 while the closest double to 12.555 is 12.554999999999999715782905696. Notice how the double provides more accuracy, but the error is negative.
At this point, it's probably now obvious why the round function goes up for float but down for double - they're both rounding to the closest decimal number to the underlying representation.

Is hardcode float precise if it can be represented by binary format in IEEE 754?

for example, 0 , 0.5, 0.15625 , 1 , 2 , 3... are values converted from IEEE 754. Are their hardcode version precise?
for example:
is
float a=0;
if(a==0){
return true;
}
always return true? other example:
float a=0.5;
float b=0.25;
float c=0.125;
is a * b always equal to 0.125 and a * b==c always true? And one more example:
int a=123;
float b=0.5;
is a * b always be 61.5? or in general, is integer multiply by IEEE 754 binary float precise?
Or a more general question: if the value is hardcode and both the value and result can be represented by binary format in IEEE 754 (e.g.:0.5 - 0.125), is the value precise?
There is no inherent fuzzyness in floating-point numbers. It's just that some, but not all, real numbers can't be exactly represented.
Compare with a fixed-width decimal representation, let's say with three digits. The integer 1 can be represented, using 1.00, and 1/10 can be represented, using 0.10, but 1/3 can only be approximated, using 0.33.
If we instead use binary digits, the integer 1 would be represented as 1.00 (binary digits), 1/2 as 0.10, 1/4 as 0.01, but 1/3 can (again) only be approximated.
There are some things to remember, though:
It's not the same numbers as with decimal digits. 1/10 can be
written exactly as 0.1 using decimal digits, but not using binary
digits, no matter how many you use (short of infinity).
In practice, it is difficult to keep track of which numbers can be
represented and which can't. 0.5 can, but 0.4 can't. So when you need
exact numbers, such as (often) when working with money, you shouldn't
use floating-point numbers.
According to some sources, some processors do strange things
internally when performing floating-point calculations on numbers
that can't be exactly represented, causing results to vary in a way
that is, in practice, unpredictable.
(My opinion is that it's actually a reasonable first approximation to say that yes, floating-point numbers are inherently fuzzy, so unless you are sure your particular application can handle that, stay away from them.)
For more details than you probably need or want, read the famous What Every Computer Scientist Should Know About Floating-Point Arithmetic. Also, this somewhat more accessible website: The Floating-Point Guide.
No, but as Thomas Padron-McCarthy says, some numbers can be exactly represented using binary but not all of them can.
This is the way I explain it to non-developers who I work with (like Mahmut Ali I also work on an very old financial package): Imagine having a very large cake that is cut into 256 slices. Now you can give 1 person the whole cake, 2 people half of the slices but soon as you decide to split it between 3 you can't - it's either 85 or 86 - you can't split the cake any further. The same is with floating point. You can only get exact numbers on some representations - some numbers can only be closely approximated.
C++ does not require binary floating point representation. Built-in integers are required to have a binary representation, commonly two's complement, but one's complement and sign and magnitude are also supported. But floating point can be e.g. decimal.
This leaves open the question of whether C++ floating point can have a radix that does not have 2 as a prime factor, like 2 and 10. Are other radixes permitted? I don't know, and last time I tried to check that, I failed.
However, assuming that the radix must be 2 or 10, then all your examples involve values that are powers of 2 and therefore can be exactly represented.
This means that the single answer to most of your questions is “yes”. The exception is the question “is integer multiply by IEEE 754 binary float [exact]”. If the result exceeds the precision available, then it can't be exact, but otherwise it is.
See the classic “What Every Computer Scientist Should Know About Floating-Point Arithmetic” for background info about floating point representation & properties in general.
If a value can be exactly represented in 32-bit or 64-bit IEEE 754, then that doesn't mean that it can be exactly represented with some other floating point representation. That's because different 32-bit representations and different 64-bit representations use different number of bits to hold the mantissa and have different exponent ranges. So a number that can be exactly represented in one way, can be beyond the precision or range of some other representation.
You can use std::numeric_limits<T>::is_iec559 (where e.g. T is double) to check whether your implementation claims to be IEEE 754 compatible. However, when floating point optimizations are turned on, at least the g++ compiler (1)erroneously claims to be IEEE 754, while not treating e.g. NaN values correctly according to that standard. In effect, the is_iec559 only tells you whether the number representation is IEEE 754, and not whether the semantics conform.
(1) Essentially, instead of providing different types for different semantics, gcc and g++ try to accommodate different semantics via compiler options. And with separate compilation of parts of a program, that can't conform to the C++ standard.
In principle, this should be possible. If you restrict yourself to exactly this class of numbers with a finite 2-power representation.
But it is dangerous: what if someone takes your code and changes your 0.5 to 0.4 or your .0625 to .065 due to whatever reasons? Then your code is broken. And no, even excessive comments won't help about that - someone will always ignore them.

Should I worry about precision when I use C++ mathematical functions with integers?

For example, The code below will give undesirable result due to precision of floating point numbers.
double a = 1 / 3.0;
int b = a * 3; // b will be 0 here
I wonder whether similar problems will show up if I use mathematical functions. For example
int a = sqrt(4); // Do I have guarantee that I will always get 2 here?
int b = log2(8); // Do I have guarantee that I will always get 3 here?
If not, how to solve this problem?
Edit:
Actually, I came across this problem when I was programming for an algorithm task. There I want to get
the largest integer which is power of 2 and is less than or equal to integer N
So round function can not solve my problem. I know I can solve this problem through a loop, but it seems not very elegant.
I want to know if
int a = pow(2, static_cast<int>(log2(N)));
can always give correct result. For example if N==8, is it possible that log2(N) gives me something like 2.9999999999999 and the final result become 4 instead of 8?
Inaccurate operands vs inaccurate results
I wonder whether similar problems will show up if I use mathematical functions.
Actually, the problem that could prevent log2(8) to be 3 does not exist for basic operations (including *). But it exists for the log2 function.
You are confusing two different issues:
double a = 1 / 3.0;
int b = a * 3; // b will be 0 here
In the example above, a is not exactly 1/3, so it is possible that a*3 does not produce 1.0. The product could have happened to round to 1.0, it just doesn't. However, if a somehow had been exactly 1/3, the product of a by 3 would have been exactly 1.0, because this is how IEEE 754 floating-point works: the result of basic operations is the nearest representable value to the mathematical result of the same operation on the same operands. When the exact result is representable as a floating-point number, then that representation is what you get.
Accuracy of sqrt and log2
sqrt is part of the “basic operations”, so sqrt(4) is guaranteed always, with no exception, in an IEEE 754 system, to be 2.0.
log2 is not part of the basic operations. The result of an implementation of this function is not guaranteed by the IEEE 754 standard to be the closest to the mathematical result. It can be another representable number further away. So without more hypotheses on the log2 function that you use, it is impossible to tell what log2(8.0) can be.
However, most implementations of reasonable quality for elementary functions such as log2 guarantee that the result of the implementation is within 1 ULP of the mathematical result. When the mathematical result is not representable, this means either the representable value above or the one below (but not necessarily the closest one of the two). When the mathematical result is exactly representable (such as 3.0), then this representation is still the only one guaranteed to be returned.
So about log2(8), the answer is “if you have a reasonable quality implementation of log2, you can expect the result to be 3.0`”.
Unfortunately, not every implementation of every elementary function is a quality implementation. See this blog post, caused by a widely used implementation of pow being inaccurate by more than 1 ULP when computing pow(10.0, 2.0), and thus returning 99.0 instead of 100.0.
Rounding to the nearest integer
Next, in each case, you assign the floating-point to an int with an implicit conversion. This conversion is defined in the C++ standard as truncating the floating-point values (that is, rounding towards zero). If you expect the result of the floating-point computation to be an integer, you can round the floating-point value to the nearest integer before assigning it. It will help obtain the desired answer in all cases where the error does not accumulate to a value larger than 1/2:
int b = std::nearbyint(log2(8.0));
To conclude with a straightforward answer to the question the the title: yes, you should worry about accuracy when using floating-point functions for the purpose of producing an integral end-result. These functions do not come even with the guarantees that basic operations come with.
Unfortunately the default conversion from a floating point number to integer in C++ is really crazy as it works by dropping the decimal part.
This is bad for two reasons:
a floating point number really really close to a positive integer, but below it will be converted to the previous integer instead (e.g. 3-1×10-10 = 2.9999999999 will be converted to 2)
a floating point number really really close to a negative integer, but above it will be converted to the next integer instead (e.g. -3+1×10-10 = -2.9999999999 will be converted to -2)
The combination of (1) and (2) means also that using int(x + 0.5) will not work reasonably as it will round negative numbers up.
There is a reasonable round function, but unfortunately returns another floating point number, thus you need to write int(round(x)).
When working with C99 or C++11 you can use lround(x).
Note that the only numbers that can be represented correctly in floating point are quotients where the denominator is an integral power of 2.
For example 1/65536 = 0.0000152587890625 can be represented correctly, but even just 0.1 is impossible to represent correctly and thus any computation involving that quantity will be approximated.
Of course when using 0.1 approximations can cancel out leaving a correct result occasionally, but even just adding ten times 0.1 will not give 1.0 as result when doing the computation using IEEE754 double-precision floating point numbers.
Even worse the compilers are allowed to use higher precision for intermediate results. This means that adding 10 times 0.1 may give back 1 when converted to an integer if the compiler decides to use higher accuracy and round to closest double at the end.
This is "worse" because despite being the precision higher the results are compiler and compiler options dependent, making reasoning about the computations harder and making the exact result non portable among different systems (even if they use the same precision and format).
Most compilers have special options to avoid this specific problem.

Rounding a float upward to an integer, how reliable is that?

I've seen static_cast<int>(std::ceil(floatValue)); before.
My question though, is can I absolutely count on this not "needlessly" rounding up? I've read that some whole numbers can't be perfectly represented in floating point, so my worry is that the miniscule "error" will trick ceil() into rounding upwards when it logically shouldn't. Not only that, but once rounded up, I worry it may be possible for a small "error" in representation to cause the number to be slightly less than a whole number, causing the cast to int to truncate it.
Is this worry unfounded? I remember a while back, an example in python where printing a specific whole number would cause it to print something very slightly less (like x.999, though I can't remember the exact number)
The reason I need to make sure, is I'm writing a texture buffer. The common case is whole numbers as floating point, but it'll occasionally get between values that need to be rounded to the nearest integer width and height that contains them. It increments in steps of power of 2, so the cost of rounding up needlessly can cause what should've only took a 256x256 texture to need a 512x512 texture.
If floatValue is exact, then there is no problem with rounding in your code. The only possible problem is overflow (if the result doesn't fit inside an int). Of course with such large values, the float will typically not have enough precision to distinguish adjacent integers anyway.
However, the danger usually lies in floatValue itself not being exact. For example, if it is the result of some computation whose exact answer is a whole number, it may end up a tiny amount greater than a whole number due to floating point rounding errors in the computation.
So whether you have a problem depends on how you got floatValue.
can I absolutely count on this not "needlessly" rounding up? I've read that some whole numbers can't be perfectly represented in floating point, so my worry is that the miniscule "error" will trick ceil()
Yes, some large numbers are impossible to represent exactly as floating-point numbers. In the zone where this happens, all floating-point numbers are integers. The error is not minuscule: the error in representing an integer by a floating-point, if error there is, is at least one. And, obviously, in the zone where some integers cannot be represented as floating-point and where all floating-point numbers are integers, ceil(f) == f.
The zone in question is |f| > 224 (16*1024*1024) for IEEE 754 single-precision and |f| > 253 for IEEE 754 double-precision.
A problem you are more likely to come across does not come from the impossibility of representing integers in floating-point format but from the cumulative effects of rounding errors. If your compiler offers IEEE 754 (the floating-point standard implemented exactly by the SSE2 instructions of modern and not so modern Intel processors) semantics, then any +, -, *, / and sqrt operation that results in a number exactly representable as floating-point is guaranteed to produce that result, but if several of the operations you apply do not have exactly representable results, the floating-point computation may drift away from the mathematical computation, even when the final result is an integer and is exactly representable. Then you may end up with a floating-point result slightly above the target integer and cause ceil() to return something other than you would have obtained with exact mathematical computations.
There are ways to be confident that some floating-point operations are exact (because the result is always representable). For instance (double)float1 * (double)float2, where float1 and float2 are two single-precision variables, is always exact, because the mathematical result of the multiplication of two single-precision numbers is always representable as a double. By doing the computation the “right” way, it is possible to minimize or eliminate the error in the end result.
The range is 0.0 to ~1024.0
All integers in this range can be represented exactly as float, so you'll be fine.
You'll only start having issues once you stray beyond the 24 bits of mantissa afforded by float.