Rounding error using the floor function in C++ - c++

I was asked what will be the output of the following code:
floor((0.7+0.6)*10);
It returns 12.
I know that the floating point representation does not allow to represent all numbers with infinite precision and that I should expect some discrepancies.
My questions are:
How should I know that this piece of code returns 12, not 13? Why is (0.7+0.6)*10 a bit less than 13, not a bit more?
When can I expect the floor function to work incorrectly and when it works correctly for sure?
Note: I'm not asking how floating representation looks like or why the output isn't exactly 13. I'd like to know how should I infer that (0.7+0.6)*10 is a bit less than 13.

How should I know that this piece of code returns 12, not 13? Why is (0.7+0.6)*10 a bit less than 13, not a bit more?
Assume that your compilation platform uses strictly the IEEE 754 standard formats and operations. Then, convert all the constants involved to binary, keeping 53 significant digits, and apply the basic operations, as defined in IEEE 754, by computing the mathematical result and rounding to 53 significant binary digits at each step. A computer does not need to be involved at any stage, but you can make your life easier by using C99's hexadecimal floating-point format for input and output.
When can I expect the floor function to work incorrectly and when it works correctly for sure?
floor() is exact for all positive arguments. It is working correctly in your example. The behavior that surprises you does not originate with floor and has nothing to do with floor. The surprising behavior starts with the fact that 6/10 and 7/10 are not representable exactly as binary floating-point values, and continues with the fact that since these values have long expansions, floating-point operations + and * can produce a slightly rounded result wrt the mathematical result you could expect from the arguments they are actually applied to. floor() is the only place in your code that does not involve approximation.
Example program to see what is happening:
#include <stdio.h>
#include <math.h>
int main(void) {
printf("%a\n%a\n%a\n%a\n%a\n",
0.7,
0.6,
0.7 + 0.6,
(0.7+0.6)*10,
floor((0.7+0.6)*10));
}
Result:
0x1.6666666666666p-1
0x1.3333333333333p-1
0x1.4ccccccccccccp+0
0x1.9ffffffffffffp+3
0x1.8p+3
IEEE 754 double-precision is really defined with respect to binary, but for conciseness the significand is written in hexadecimal. The exponent after p represents a power of two. For instance the last two results are both of the form <number roughly halfway between 1 and 2>*23.
0x1.8p+3 is 12. The next integer, 13, is 0x1.ap+3, but the computation does not quite reach that value, and so the behavior of floor() is to round down to 12.

How should I know that this piece of code returns 12, not 13?
You should know that it can and may be either 12 or 13. You can verify by testing on a given cpu.
You can not know what the value will be, in general, because the C++ standard does not specify the representation of floating point numbers. If you know the format on given architecture (let's say IEEE 754), then you can perform the calculation by hand, but that result would only apply to that particular representation.
Why is (0.7+0.6)*10 a bit less than 13, not a bit more?
It's an implementation detail and not useful knowledge to the programmer. All you need to know that it may be either. Relying on the knowledge that it's one or the other, would make you depend on the implementation detail.
When can I expect the floor function to work incorrectly and when it works correctly for sure?
It always works correctly, that is accroding to how it's specified to work.
Now, speaking of the value that you are expecting to see. If you know that your number is very close to an integer, but might be off a little bit due to representation error, you can add 0.5 before flooring.
double calculated_integer = (0.7+0.6)*10;
floor(calculated_integer + 0.5);
That way, you will always get the expected value, unless the error exceeds 0.5, which would be quite a big error.
If you don't know that the result should be an integer, then you simply have to accept the fact that floor and ceil operations increase the maximum error of your calculation to 1.0.

There are standard like the IEEE floating point standard which try to make floating point calculations at least a little bit predictive
by defining rules how operations like additions and rounding should be implemented.
To know the result, you need to compute the expression
according to the standard rules. Then you can be sure, that
it gives the same result on every machine, that implements the standard.

How should I know that this piece of code returns 12, not 13?
Since that depends on the numbers involved, by trying.
Why is (0.7+0.6)*10 a bit less than 13, not a bit more?
Well, because that's the result of the calculation.
When can I expect the floor function to work incorrectly and when it works correctly for sure?
Correctly for sure: on multiples of powers of two only, iff your floating point number is represented in binary.
To really take all the confusion out of this:
You cannot know the result without calculating it; it depends on both the machine/algorithmics involved and the numbers.

Very short answer: You can not. It depends on the platform and the float iso that is used on this platform.

In general, you can't. The fundamental problem is that the conversion from text representation to floating-point value is often not implemented as accurately as it could be. That's in part momentum, and in part because getting the floating-point value that's closest to the value expressed in text can be expensive, in some cases requiring large integer calculations. So conversions are often off by a few ULPs (i.e., low-end bits) from the ideal value, in ways that you can't predict a priori. So the question of what that code will produce is unanswerable. The question of what it should produce may be a bit more tractable, but it's still an exercise in time-wasting.

Related

I'm trying to round a float to two decimal points but it's incorrect. How to fix this rounding error in C++?

I'm having trouble with rounding floats. I'm solving a task where you need to round your result to two decimal points. But I can't do it when the third decimal point is 5 because it's stored incorrectly.
For example: My result is equal to 1.005 and that should be rounded to 1.01. But C++ rounds it to 1.00 because the original float is stored as 1.0049999... and not 1.005.
I've already tried always adding a very small float to the result but there are some other test cases which are then rounded up but should be rounded down.
I know how floating-point works and that it is often not completely accurate. I'm just wondering whether anyone has found a way around this specific problem.
When you say "my result is equal to 1.005", you are assuming some count of true decimal digits. This can be 1.005 (three digits of fractional part), 1.0050 (four digits), 1.005000, and so on.
So, you should first round, using some usual rounding, to that count of digits. It is simpler to do this in integers: for example, with 6 fractional digits, it means some usual round(), rint(), etc. after multiplication by 1,000,000. With this step, you are getting exact decimal number. After this, you are able to make the required final rounding to what you need.
In your example, this will round 1,004,999.99... to 1,005,000. Then, divide by 10000 and round again.
(Notice that there are suggestions to make this rounding in yet specific way. The General Decimal Arithmetic specification and IBM arithmetic manuals suggest this rounding is done in the way that exact fractional part 0.5 shall be rounded away from zero unless least significant result bit becomes 0 or 5, in that case it is rounded toward zero. But, if you have no such rounding available, a general away-from-zero is also suitable.)
If you are implementing arithmetic for money accounting, it is reasonable to avoid floating point at all and use fixed-point arithmetic (emulated with integers, if needed). This is better because you the methods I've described for rounding are inevitably containing conversion to integers (and back), so, it's cheaper to use such integers directly. You will get inexact operation checking as well (by cost of explicit integer overflow).
If you can use a library like boost with its Multiprecision support.
Another option would be to use a long double, maybe that's precise enough for you.

`std::sin` is wrong in the last bit

I am porting some program from Matlab to C++ for efficiency. It is important for the output of both programs to be exactly the same (**).
I am facing different results for this operation:
std::sin(0.497418836818383950) = 0.477158760259608410 (C++)
sin(0.497418836818383950) = 0.47715876025960846000 (Matlab)
N[Sin[0.497418836818383950], 20] = 0.477158760259608433 (Mathematica)
So, as far as I know both C++ and Matlab are using IEEE754 defined double arithmetic. I think I have read somewhere that IEEE754 allows differents results in the last bit. Using mathematica to decide, seems like C++ is more close to the result. How can I force Matlab to compute the sin with precision to the last bit included, so that the results are the same?
In my program this behaviour leads to big errors because the numerical differential equation solver keeps increasing this error in the last bit. However I am not sure that C++ ported version is correct. I am guessing that even if the IEEE754 allows the last bit to be different, somehow guarantees that this error does not get bigger when using the result in more IEEE754 defined double operations (because otherwise, two different programs correct according to the IEEE754 standard could produce completely different outputs). So the other question is Am I right about this?
I would like get an answer to both bolded questions. Edit: The first question is being quite controversial, but is the less important, can someone comment about the second one?
Note: This is not an error in the printing, just in case you want to check, this is how I obtained these results:
http://i.imgur.com/cy5ToYy.png
Note (**): What I mean by this is that the final output, which are the results of some calculations showing some real numbers with 4 decimal places, need to be exactly the same. The error I talk about in the question gets bigger (because of more operations, each of one is different in Matlab and in C++) so the final differences are huge) (If you are curious enough to see how the difference start getting bigger, here is the full output [link soon], but this has nothing to do with the question)
Firstly, if your numerical method depends on the accuracy of sin to the last bit, then you probably need to use an arbitrary precision library, such as MPFR.
The IEEE754 2008 standard doesn't require that the functions be correctly rounded (it does "recommend" it though). Some C libms do provide correctly rounded trigonometric functions: I believe that the glibc libm does (typically used on most linux distributions), as does CRlibm. Most other modern libms will provide trig functions that are within 1 ulp (i.e. one of the two floating point values either side of the true value), often termed faithfully rounded, which is much quicker to compute.
None of those values you printed could actually arise as IEEE 64bit floating point values (even if rounded): the 3 nearest (printed to full precision) are:
0.477158760259608 405451814405751065351068973541259765625
0.477158760259608 46096296563700889237225055694580078125
0.477158760259608 516474116868266719393432140350341796875
The possible values you could want are:
The exact sin of the decimal .497418836818383950, which is
0.477158760259608 433132061388630377105954125778369485736356219...
(this appears to be what Mathematica gives).
The exact sin of the 64-bit float nearest .497418836818383950:
0.477158760259608 430531153841011107415427334794384396325832953...
In both cases, the first of the above list is the nearest (though only barely in the case of 1).
The sine of the double constant you wrote is about 0x1.e89c4e59427b173a8753edbcb95p-2, whose nearest double is 0x1.e89c4e59427b1p-2. To 20 decimal places, the two closest doubles are 0.47715876025960840545 and 0.47715876025960846096.
Perhaps Matlab is displaying a truncated value? (EDIT: I now see that the fourth-last digit is a 6, not a 0. Matlab is giving you a result that's still faithfully-rounded, but it's the farther of the two closest doubles to the desired result. And it's still printing out the wrong number.
I should also point out that Mathematica is probably trying to solve a different problem---compute the sine of the decimal number 0.497418836818383950 to 20 decimal places. You should not expect this to match either the C++ code's result or Matlab's result.

Should I worry about precision when I use C++ mathematical functions with integers?

For example, The code below will give undesirable result due to precision of floating point numbers.
double a = 1 / 3.0;
int b = a * 3; // b will be 0 here
I wonder whether similar problems will show up if I use mathematical functions. For example
int a = sqrt(4); // Do I have guarantee that I will always get 2 here?
int b = log2(8); // Do I have guarantee that I will always get 3 here?
If not, how to solve this problem?
Edit:
Actually, I came across this problem when I was programming for an algorithm task. There I want to get
the largest integer which is power of 2 and is less than or equal to integer N
So round function can not solve my problem. I know I can solve this problem through a loop, but it seems not very elegant.
I want to know if
int a = pow(2, static_cast<int>(log2(N)));
can always give correct result. For example if N==8, is it possible that log2(N) gives me something like 2.9999999999999 and the final result become 4 instead of 8?
Inaccurate operands vs inaccurate results
I wonder whether similar problems will show up if I use mathematical functions.
Actually, the problem that could prevent log2(8) to be 3 does not exist for basic operations (including *). But it exists for the log2 function.
You are confusing two different issues:
double a = 1 / 3.0;
int b = a * 3; // b will be 0 here
In the example above, a is not exactly 1/3, so it is possible that a*3 does not produce 1.0. The product could have happened to round to 1.0, it just doesn't. However, if a somehow had been exactly 1/3, the product of a by 3 would have been exactly 1.0, because this is how IEEE 754 floating-point works: the result of basic operations is the nearest representable value to the mathematical result of the same operation on the same operands. When the exact result is representable as a floating-point number, then that representation is what you get.
Accuracy of sqrt and log2
sqrt is part of the “basic operations”, so sqrt(4) is guaranteed always, with no exception, in an IEEE 754 system, to be 2.0.
log2 is not part of the basic operations. The result of an implementation of this function is not guaranteed by the IEEE 754 standard to be the closest to the mathematical result. It can be another representable number further away. So without more hypotheses on the log2 function that you use, it is impossible to tell what log2(8.0) can be.
However, most implementations of reasonable quality for elementary functions such as log2 guarantee that the result of the implementation is within 1 ULP of the mathematical result. When the mathematical result is not representable, this means either the representable value above or the one below (but not necessarily the closest one of the two). When the mathematical result is exactly representable (such as 3.0), then this representation is still the only one guaranteed to be returned.
So about log2(8), the answer is “if you have a reasonable quality implementation of log2, you can expect the result to be 3.0`”.
Unfortunately, not every implementation of every elementary function is a quality implementation. See this blog post, caused by a widely used implementation of pow being inaccurate by more than 1 ULP when computing pow(10.0, 2.0), and thus returning 99.0 instead of 100.0.
Rounding to the nearest integer
Next, in each case, you assign the floating-point to an int with an implicit conversion. This conversion is defined in the C++ standard as truncating the floating-point values (that is, rounding towards zero). If you expect the result of the floating-point computation to be an integer, you can round the floating-point value to the nearest integer before assigning it. It will help obtain the desired answer in all cases where the error does not accumulate to a value larger than 1/2:
int b = std::nearbyint(log2(8.0));
To conclude with a straightforward answer to the question the the title: yes, you should worry about accuracy when using floating-point functions for the purpose of producing an integral end-result. These functions do not come even with the guarantees that basic operations come with.
Unfortunately the default conversion from a floating point number to integer in C++ is really crazy as it works by dropping the decimal part.
This is bad for two reasons:
a floating point number really really close to a positive integer, but below it will be converted to the previous integer instead (e.g. 3-1×10-10 = 2.9999999999 will be converted to 2)
a floating point number really really close to a negative integer, but above it will be converted to the next integer instead (e.g. -3+1×10-10 = -2.9999999999 will be converted to -2)
The combination of (1) and (2) means also that using int(x + 0.5) will not work reasonably as it will round negative numbers up.
There is a reasonable round function, but unfortunately returns another floating point number, thus you need to write int(round(x)).
When working with C99 or C++11 you can use lround(x).
Note that the only numbers that can be represented correctly in floating point are quotients where the denominator is an integral power of 2.
For example 1/65536 = 0.0000152587890625 can be represented correctly, but even just 0.1 is impossible to represent correctly and thus any computation involving that quantity will be approximated.
Of course when using 0.1 approximations can cancel out leaving a correct result occasionally, but even just adding ten times 0.1 will not give 1.0 as result when doing the computation using IEEE754 double-precision floating point numbers.
Even worse the compilers are allowed to use higher precision for intermediate results. This means that adding 10 times 0.1 may give back 1 when converted to an integer if the compiler decides to use higher accuracy and round to closest double at the end.
This is "worse" because despite being the precision higher the results are compiler and compiler options dependent, making reasoning about the computations harder and making the exact result non portable among different systems (even if they use the same precision and format).
Most compilers have special options to avoid this specific problem.

Warning for inexact floating-point constants

Questions like "Why isn't 0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1 = 0.8?" got me thinking that...
... It would probably be nice to have the compiler warn about the floating-point constants that it rounds to the nearest representable in the binary floating-point type (e.g. 0.1 and 0.8 are rounded in radix-2 floating-point, otherwise they'd need an infinite amount of space to store the infinite number of digits).
I've looked up gcc warnings and so far found none for this purpose (-Wall, -Wextra, -Wfloat-equal, -Wconversion, -Wcoercion (unsupported or C only?), -Wtraditional (C only) don't appear to be doing what I want).
I haven't found such a warning in Microsoft Visual C++ compiler either.
Am I missing a hidden or rarely-used option?
Is there any compiler at all that has this kind of warning?
EDIT: This warning could be useful for educational purposes and serve as a reminder to those new to floating-point.
There is no technical reason the compiler could not issue such warnings. However, they would be useful only for students (who ought to be taught how floating-point arithmetic works before they start doing any serious work with it) and people who do very fine work with floating-point. Unfortunately, most floating-point work is rough; people throw numbers at the computer without much regard for how the computer works, and they accept whatever results they get.
The warning would have to be off by default to support the bulk of existing floating-point code. Were it available, I would turn it on for my code in the Mac OS X math library. Certainly there are points in the library where we depend on every bit of the floating-point value, such as places where we use extended-precision arithmetic, and values are represented across more than one floating-point object (e.g., we would have one object with the high bits of 1/π, another object with 1/π minus the first object, and a third object with 1/π minus the first two objects, giving us about 150 bits of 1/π). Some such values are represented in hexadecimal floating-point in the source text, to avoid any issues with compiler conversion of decimal numerals, and we could readily convert any remaining numerals to avoid the new compiler warning.
However, I doubt we could convince the compiler developers that enough people would use this warning or that it would catch enough bugs to make it worth their time. Consider the case of libm. Suppose we generally wrote exact numerals for all constants but, on one occasion, wrote some other numeral. Would this warning catch a bug? Well, what bug is there? Most likely, the numeral is converted to exactly the value we wanted anyway. When writing code with this warning turned on, we are likely thinking about how the floating-point calculations will be performed, and the value we have written is one that is suitable for our purpose. E.g., it may be a coefficient of some minimax polynomial we calculated, and the coefficient is as good as it is going to get, whether represented approximately in decimal or converted to some exactly-representable hexadecimal floating-point numeral.
So, this warning will rarely catch bugs. Perhaps it would catch an occasion where we mistyped a numeral, accidentally inserting an extra digit into a hexadecimal floating-point numeral, causing it to extend beyond the representable significand. But that is rare. In most cases, the numerals we use are either simple and short or are copied and pasted from software that has calculated them. On some occasions, we will hand-type special values, such as 0x1.fffffffffffffp0. A warning when an extra “f” slips into that numeral might catch a bug during compilation, but that error would almost certainly be caught quickly in testing, since it drastically alters the special value.
So, such a compiler warning has little utility: Very few people will use it, and it will catch very few bugs for the people who do use it.
The warning is in the source: when you write float, double, or long double including any of their respective literals. Obviously, some literals are exact but even this doesn't help much: the sum of two exact values may inexact, e.g., if the have rather different scales. Having the compiler warn about inexact floating point constants would generate a false sense of security. Also, what are you meant to do about rounded constants? Writing the exact closest value explicitly would be error prone and obfuscate the intent. Writing them differently, e.g., writing 1.0 / 10.0 instead of 0.1 also obfuscates the intent and could yield different values.
There will be no such compiler switch and the reason is obvious.
We are writing down the binary components in decimal:
First fractional bit is 0.5
Second fractional bit is 0.25
Third fractional bit is 0.125
....
Do you see it ? Due to the odd endings with the number 5 every bit needs
another decimal to represent it exactly. One bit needs one decimal, two bits
needs two decimals and so on.
So for fractional floating points it would mean that for most decimal numbers
you need 24(!) decimal digits for single precision floats and
53(!!) decimal digits for double precision.
Worse, the exact digits carry no extra information, they are pure artifacts
caused by the base change.
Noone is going to write down 3.141592653589793115997963468544185161590576171875
for pi to avoid a compiler warning.
I don't see how a compiler would know or that the compiler can warn you about something like that. It is only a coincidence that a number can be exactly represented by something that is inherently inaccurate.

How to convert a double to a string without using the CRT

My question has no practical application. I'm just interested. Suppose, I have a double value and I want to obtain its string representation similarly to the printf function. How would I do that without the C runtime library? Let's suppose I'm on the x86 architecture.
Given that you state your question has no practical application, I figure you're trying to learn about floating point number representations.
Thus, if you're looking for a solution without using any library support, start with the format specification. From that you can discern the various "special" values (Infinity, NAN, etc) as well as decoding/calculating the actual numeric value. Once you have the significand and exponent, you know where to put the decimal point. You'll have to write your own itoa type routine. For radices which are a power of two, this can be as simple as a lookup table. For decimal, you'll have to do a little extra math.
you can get all values on left side by (double % 10) and then divide by 10 every time.
they will be in right to left.
to get values on right of dot you have to multiply by 10 and then (double % 10). they will be in left-to-right.
If you want to do it simply with a "close enough" result, see my article http://www.exploringbinary.com/quick-and-dirty-floating-point-to-decimal-conversion/ . It describes a simple program that uses floating-point to convert from floating-point to decimal, and explains why that approach can never be accurate for all conversions. (The program doesn't do decimal rounding like printf, but that should be easy enough to add.)