Float rounding off error - c++

#include<iostream>
long myround(float f)
{
if (f >= UINT_MAX) return f;
return f + 0.5f;
}
int main()
{
f = 8388609.0f;
std:cout.precision(16);
std::cout << myround(f) << std::endl;
}
Output: 8388610.0
I'm trying to make sense of the output. The next floating number larger than
8388609.0 is 8388610 But why the rounded value is not 8388609?

IEEE-754 defines several possible rounding modes, but in practice, the one almost always used is "round to nearest, ties to even". This is also known as "Banker's rounding", for no reason anybody can discern.
"Ties to even" means that if the to-be-rounded result of a floating point computation is precisely halfway between two representable numbers, the rounding will be in whichever direction makes the LSB of the result zero. In your case, 8388609.5 is halfway between 8388609 and 8388610, but only the latter has a zero in the last bit, so the rounding is upwards. If you had passed in 8388610.0 instead, the result would be rounded downwards; if you had passed in 8388611.0, it would be rounded upwards.

If you change your example to use double then the error disappears. The problem is float is more limited than double in the number of significant digits it can store. Adding 0.5 to your value simply exceeds the precision limits for a float, causing it to preform some rounding. In this case, 8388609.0f + 0.5f == 8388610.0f.
#include<iostream>
long myround(double f)
{
if (f >= UINT_MAX) return f;
return f + 0.5;
}
int main()
{
double f = 8388609.0;
std::cout.precision(16);
std::cout << myround(f) << std::endl;
}
If you continue to add digits to your number, it will also eventually fail for double.
Edit:
You can test this easily with a static_assert. This compiles on my platform static_assert(8388609.0f + 0.5f == 8388610.0f, "");. It will likely compile on yours.

Related

Rounding off floating numbers in cpp

For a particular question, I need to perform calculations on a floating number, round it off to 2 digits after the decimal place, and assign it to a variable for comparison purposes. I tried to find a solution to this but all I keep finding is how to print those rounded numbers (using printf or setprecision) instead of assigning them to a variable.
Please help.
I usually do something like that:
#include <cmath> // 'std::floor'
#include <limits> // 'std::numeric_limits'
// Round value to granularity
template<typename T> inline T round(const T x, const T gran)
{
//static_assert(gran!=0);
return gran * std::floor( x/gran + std::numeric_limits<T>::round_error() );
}
double rounded_to_cent = round(1.23456, 0.01); // Gives something near 1.23
Be sure you know how floating point types work though.
Addendum: I know that this topic has already been extensively covered in other questions, but let me put this small paragraph here.
Given a real number, you can represent it with -almost- arbitrary accuracy with a (base10) literal like 1.2345, that's a string that you can type with your keyboard.
When you store that value in a floating point type, let's say a double, you -almost- always loose accuracy because probably your number won't have an exact representation in the finite set of the numbers representable by that type.
Nowadays double uses 64 bits, so it has 2^64 symbols to represent the not numerable infinity of real numbers: that's a H2O molecule in an infinity of infinite oceans.
The representation error is relative to the value; for example in a IEEE 754 double, over 2^53 not all the integer values can be represented.
So when someone tells that the "result is wrong" they're technically right; the "acceptable" result is application dependent.
round it off to 2 digits after the decimal place, and assign it to a variable for comparison purposes
To avoid errors that creep in when using binary floating point in a decimal problem, consider alternatives.
Direct approach has corner errors due to double rounding and overflow. These errors may be tolerable for OP larger goals
// Errors:
// 1) x*100.0, round(x*100.0)/100.0 inexact.
// Select `x` values near "dddd.dd5" form an inexact product `x*100.0`
// and may lead to a double rounding error and then incorrect result when comparing.
// 2) x*100.0 may overflow.
int compare_hundredth1(double x, double ref) {
x = round(x*100.0)/100.0;
return (x > ref) - (x < ref);
}
We can do better.
When a wider floating point type exist:
int compare_hundredth2(double x, double ref) {
auto x_rounded = math::round(x*100.0L);
auto ref_rounded = ref*100.0L;
return (x_rounded > ref_rounded) - (x_rounded < ref_rounded);
}
To use the same width floating point type takes more work:
All finite large larges of x, ref are whole numbers and need no rounding to the nearest 0.01.
int compare_hundredth3(double x, double ref) {
double x_whole;
auto x_fraction = modf(x, &x_whole);
// If rounding needed ...
if (x_fraction != 0.0) {
if (x - 0.01 > ref) return 1; // x much more than ref
if (x + 0.01 < ref) return -1; // x much less than ref
// x, ref nearly the same
double ref_whole;
auto ref_fraction = modf(x, &ref_whole);
x -= ref_whole;
auto x100 = (x - ref_whole)*100; // subtraction expected to be exact here.
auto ref100 = ref_fraction*100;
return (x100 > ref100) - (x100 < ref100);
}
return (x > ref) - (x < ref);
}
The above assume ref is without error. If this is not so, consider using a scaled ref.
Note: The above sets aside not-a-number concerns.
More clean-up later.
Here's an example with a custom function that rounds up the floating number f to n decimal places. Basically, it multiplies the floating number by 10 to the power of N to separate the decimal places, then uses roundf to round the decimal places up or down, and finally divides back the floating number by 10 to the power of N (N is the amount of decimal places). Works for C and C++:
#include <stdio.h>
#include <math.h>
float my_round(float f, unsigned int n)
{
float p = powf(10.0f, (float)n);
f *= p;
f = roundf(f);
f /= p;
return f;
}
int main()
{
float f = 0.78901f;
printf("%f\n", f);
f = my_round(f, 2); /* Round with 2 decimal places */
printf("%f\n", f);
return 0;
}
Output:
0.789010
0.790000

How to round a floating point type to two decimals or more in C++? [duplicate]

How can I round a float value (such as 37.777779) to two decimal places (37.78) in C?
If you just want to round the number for output purposes, then the "%.2f" format string is indeed the correct answer. However, if you actually want to round the floating point value for further computation, something like the following works:
#include <math.h>
float val = 37.777779;
float rounded_down = floorf(val * 100) / 100; /* Result: 37.77 */
float nearest = roundf(val * 100) / 100; /* Result: 37.78 */
float rounded_up = ceilf(val * 100) / 100; /* Result: 37.78 */
Notice that there are three different rounding rules you might want to choose: round down (ie, truncate after two decimal places), rounded to nearest, and round up. Usually, you want round to nearest.
As several others have pointed out, due to the quirks of floating point representation, these rounded values may not be exactly the "obvious" decimal values, but they will be very very close.
For much (much!) more information on rounding, and especially on tie-breaking rules for rounding to nearest, see the Wikipedia article on Rounding.
Using %.2f in printf. It only print 2 decimal points.
Example:
printf("%.2f", 37.777779);
Output:
37.77
Assuming you're talking about round the value for printing, then Andrew Coleson and AraK's answer are correct:
printf("%.2f", 37.777779);
But note that if you're aiming to round the number to exactly 37.78 for internal use (eg to compare against another value), then this isn't a good idea, due to the way floating point numbers work: you usually don't want to do equality comparisons for floating point, instead use a target value +/- a sigma value. Or encode the number as a string with a known precision, and compare that.
See the link in Greg Hewgill's answer to a related question, which also covers why you shouldn't use floating point for financial calculations.
How about this:
float value = 37.777779;
float rounded = ((int)(value * 100 + .5) / 100.0);
printf("%.2f", 37.777779);
If you want to write to C-string:
char number[24]; // dummy size, you should take care of the size!
sprintf(number, "%.2f", 37.777779);
Always use the printf family of functions for this. Even if you want to get the value as a float, you're best off using snprintf to get the rounded value as a string and then parsing it back with atof:
#include <math.h>
#include <stdio.h>
#include <stddef.h>
#include <stdlib.h>
double dround(double val, int dp) {
int charsNeeded = 1 + snprintf(NULL, 0, "%.*f", dp, val);
char *buffer = malloc(charsNeeded);
snprintf(buffer, charsNeeded, "%.*f", dp, val);
double result = atof(buffer);
free(buffer);
return result;
}
I say this because the approach shown by the currently top-voted answer and several others here -
multiplying by 100, rounding to the nearest integer, and then dividing by 100 again - is flawed in two ways:
For some values, it will round in the wrong direction because the multiplication by 100 changes the decimal digit determining the rounding direction from a 4 to a 5 or vice versa, due to the imprecision of floating point numbers
For some values, multiplying and then dividing by 100 doesn't round-trip, meaning that even if no rounding takes place the end result will be wrong
To illustrate the first kind of error - the rounding direction sometimes being wrong - try running this program:
int main(void) {
// This number is EXACTLY representable as a double
double x = 0.01499999999999999944488848768742172978818416595458984375;
printf("x: %.50f\n", x);
double res1 = dround(x, 2);
double res2 = round(100 * x) / 100;
printf("Rounded with snprintf: %.50f\n", res1);
printf("Rounded with round, then divided: %.50f\n", res2);
}
You'll see this output:
x: 0.01499999999999999944488848768742172978818416595459
Rounded with snprintf: 0.01000000000000000020816681711721685132943093776703
Rounded with round, then divided: 0.02000000000000000041633363423443370265886187553406
Note that the value we started with was less than 0.015, and so the mathematically correct answer when rounding it to 2 decimal places is 0.01. Of course, 0.01 is not exactly representable as a double, but we expect our result to be the double nearest to 0.01. Using snprintf gives us that result, but using round(100 * x) / 100 gives us 0.02, which is wrong. Why? Because 100 * x gives us exactly 1.5 as the result. Multiplying by 100 thus changes the correct direction to round in.
To illustrate the second kind of error - the result sometimes being wrong due to * 100 and / 100 not truly being inverses of each other - we can do a similar exercise with a very big number:
int main(void) {
double x = 8631192423766613.0;
printf("x: %.1f\n", x);
double res1 = dround(x, 2);
double res2 = round(100 * x) / 100;
printf("Rounded with snprintf: %.1f\n", res1);
printf("Rounded with round, then divided: %.1f\n", res2);
}
Our number now doesn't even have a fractional part; it's an integer value, just stored with type double. So the result after rounding it should be the same number we started with, right?
If you run the program above, you'll see:
x: 8631192423766613.0
Rounded with snprintf: 8631192423766613.0
Rounded with round, then divided: 8631192423766612.0
Oops. Our snprintf method returns the right result again, but the multiply-then-round-then-divide approach fails. That's because the mathematically correct value of 8631192423766613.0 * 100, 863119242376661300.0, is not exactly representable as a double; the closest value is 863119242376661248.0. When you divide that back by 100, you get 8631192423766612.0 - a different number to the one you started with.
Hopefully that's a sufficient demonstration that using roundf for rounding to a number of decimal places is broken, and that you should use snprintf instead. If that feels like a horrible hack to you, perhaps you'll be reassured by the knowledge that it's basically what CPython does.
Also, if you're using C++, you can just create a function like this:
string prd(const double x, const int decDigits) {
stringstream ss;
ss << fixed;
ss.precision(decDigits); // set # places after decimal
ss << x;
return ss.str();
}
You can then output any double myDouble with n places after the decimal point with code such as this:
std::cout << prd(myDouble,n);
There isn't a way to round a float to another float because the rounded float may not be representable (a limitation of floating-point numbers). For instance, say you round 37.777779 to 37.78, but the nearest representable number is 37.781.
However, you can "round" a float by using a format string function.
You can still use:
float ceilf(float x); // don't forget #include <math.h> and link with -lm.
example:
float valueToRound = 37.777779;
float roundedValue = ceilf(valueToRound * 100) / 100;
In C++ (or in C with C-style casts), you could create the function:
/* Function to control # of decimal places to be output for x */
double showDecimals(const double& x, const int& numDecimals) {
int y=x;
double z=x-y;
double m=pow(10,numDecimals);
double q=z*m;
double r=round(q);
return static_cast<double>(y)+(1.0/m)*r;
}
Then std::cout << showDecimals(37.777779,2); would produce: 37.78.
Obviously you don't really need to create all 5 variables in that function, but I leave them there so you can see the logic. There are probably simpler solutions, but this works well for me--especially since it allows me to adjust the number of digits after the decimal place as I need.
Use float roundf(float x).
"The round functions round their argument to the nearest integer value in floating-point format, rounding halfway cases away from zero, regardless of the current rounding direction." C11dr §7.12.9.5
#include <math.h>
float y = roundf(x * 100.0f) / 100.0f;
Depending on your float implementation, numbers that may appear to be half-way are not. as floating-point is typically base-2 oriented. Further, precisely rounding to the nearest 0.01 on all "half-way" cases is most challenging.
void r100(const char *s) {
float x, y;
sscanf(s, "%f", &x);
y = round(x*100.0)/100.0;
printf("%6s %.12e %.12e\n", s, x, y);
}
int main(void) {
r100("1.115");
r100("1.125");
r100("1.135");
return 0;
}
1.115 1.115000009537e+00 1.120000004768e+00
1.125 1.125000000000e+00 1.129999995232e+00
1.135 1.134999990463e+00 1.139999985695e+00
Although "1.115" is "half-way" between 1.11 and 1.12, when converted to float, the value is 1.115000009537... and is no longer "half-way", but closer to 1.12 and rounds to the closest float of 1.120000004768...
"1.125" is "half-way" between 1.12 and 1.13, when converted to float, the value is exactly 1.125 and is "half-way". It rounds toward 1.13 due to ties to even rule and rounds to the closest float of 1.129999995232...
Although "1.135" is "half-way" between 1.13 and 1.14, when converted to float, the value is 1.134999990463... and is no longer "half-way", but closer to 1.13 and rounds to the closest float of 1.129999995232...
If code used
y = roundf(x*100.0f)/100.0f;
Although "1.135" is "half-way" between 1.13 and 1.14, when converted to float, the value is 1.134999990463... and is no longer "half-way", but closer to 1.13 but incorrectly rounds to float of 1.139999985695... due to the more limited precision of float vs. double. This incorrect value may be viewed as correct, depending on coding goals.
Code definition :
#define roundz(x,d) ((floor(((x)*pow(10,d))+.5))/pow(10,d))
Results :
a = 8.000000
sqrt(a) = r = 2.828427
roundz(r,2) = 2.830000
roundz(r,3) = 2.828000
roundz(r,5) = 2.828430
double f_round(double dval, int n)
{
char l_fmtp[32], l_buf[64];
char *p_str;
sprintf (l_fmtp, "%%.%df", n);
if (dval>=0)
sprintf (l_buf, l_fmtp, dval);
else
sprintf (l_buf, l_fmtp, dval);
return ((double)strtod(l_buf, &p_str));
}
Here n is the number of decimals
example:
double d = 100.23456;
printf("%f", f_round(d, 4));// result: 100.2346
printf("%f", f_round(d, 2));// result: 100.23
I made this macro for rounding float numbers.
Add it in your header / being of file
#define ROUNDF(f, c) (((float)((int)((f) * (c))) / (c)))
Here is an example:
float x = ROUNDF(3.141592, 100)
x equals 3.14 :)
Let me first attempt to justify my reason for adding yet another answer to this question. In an ideal world, rounding is not really a big deal. However, in real systems, you may need to contend with several issues that can result in rounding that may not be what you expect. For example, you may be performing financial calculations where final results are rounded and displayed to users as 2 decimal places; these same values are stored with fixed precision in a database that may include more than 2 decimal places (for various reasons; there is no optimal number of places to keep...depends on specific situations each system must support, e.g. tiny items whose prices are fractions of a penny per unit); and, floating point computations performed on values where the results are plus/minus epsilon. I have been confronting these issues and evolving my own strategy over the years. I won't claim that I have faced every scenario or have the best answer, but below is an example of my approach so far that overcomes these issues:
Suppose 6 decimal places is regarded as sufficient precision for calculations on floats/doubles (an arbitrary decision for the specific application), using the following rounding function/method:
double Round(double x, int p)
{
if (x != 0.0) {
return ((floor((fabs(x)*pow(double(10.0),p))+0.5))/pow(double(10.0),p))*(x/fabs(x));
} else {
return 0.0;
}
}
Rounding to 2 decimal places for presentation of a result can be performed as:
double val;
// ...perform calculations on val
String(Round(Round(Round(val,8),6),2));
For val = 6.825, result is 6.83 as expected.
For val = 6.824999, result is 6.82. Here the assumption is that the calculation resulted in exactly 6.824999 and the 7th decimal place is zero.
For val = 6.8249999, result is 6.83. The 7th decimal place being 9 in this case causes the Round(val,6) function to give the expected result. For this case, there could be any number of trailing 9s.
For val = 6.824999499999, result is 6.83. Rounding to the 8th decimal place as a first step, i.e. Round(val,8), takes care of the one nasty case whereby a calculated floating point result calculates to 6.8249995, but is internally represented as 6.824999499999....
Finally, the example from the question...val = 37.777779 results in 37.78.
This approach could be further generalized as:
double val;
// ...perform calculations on val
String(Round(Round(Round(val,N+2),N),2));
where N is precision to be maintained for all intermediate calculations on floats/doubles. This works on negative values as well. I do not know if this approach is mathematically correct for all possibilities.
...or you can do it the old-fashioned way without any libraries:
float a = 37.777779;
int b = a; // b = 37
float c = a - b; // c = 0.777779
c *= 100; // c = 77.777863
int d = c; // d = 77;
a = b + d / (float)100; // a = 37.770000;
That of course if you want to remove the extra information from the number.
this function takes the number and precision and returns the rounded off number
float roundoff(float num,int precision)
{
int temp=(int )(num*pow(10,precision));
int num1=num*pow(10,precision+1);
temp*=10;
temp+=5;
if(num1>=temp)
num1+=10;
num1/=10;
num1*=10;
num=num1/pow(10,precision+1);
return num;
}
it converts the floating point number into int by left shifting the point and checking for the greater than five condition.

Numerical accuracy of pow(a/b,x) vs pow(b/a,-x)

Is there a difference in accuracy between pow(a/b,x) and pow(b/a,-x)?
If there is, does raising a number less than 1 to a positive power or a number greater than 1 to a negative power produce more accurate result?
Edit: Let's assume x86_64 processor and gcc compiler.
Edit: I tried comparing using some random numbers. For example:
printf("%.20f",pow(8.72138221/1.761329479,-1.51231)) // 0.08898783049228660424
printf("%.20f",pow(1.761329479/8.72138221, 1.51231)) // 0.08898783049228659037
So, it looks like there is a difference (albeit minuscule in this case), but maybe someone who knows about the algorithm implementation could comment on what the maximum difference is, and under what conditions.
Here's one way to answer such questions, to see how floating-point behaves. This is not a 100% correct way to analyze such question, but it gives a general idea.
Let's generate random numbers. Calculate v0=pow(a/b, n) and v1=pow(b/a, -n) in float precision. And calculate ref=pow(a/b, n) in double precision, and round it to float. We use ref as a reference value (we suppose that double has much more precision than float, so we can trust that ref can be considered the best possible value. This is true for IEEE-754 for most of the time). Then sum the difference between v0-ref and v1-ref. The difference should calculated with "the number of floating point numbers between v and ref".
Note, that the results may be depend on the range of a, b and n (and on the random generator quality. If it's really bad, it may give a biased result). Here, I've used a=[0..1], b=[0..1] and n=[-2..2]. Furthermore, this answer supposes that the algorithm of float/double division/pow is the same kind, have the same characteristics.
For my computer, the summed differences are: 2604828 2603684, it means that there is no significant precision difference between the two.
Here's the code (note, this code supposes IEEE-754 arithmetic):
#include <cmath>
#include <stdio.h>
#include <string.h>
long long int diff(float a, float b) {
unsigned int ai, bi;
memcpy(&ai, &a, 4);
memcpy(&bi, &b, 4);
long long int diff = (long long int)ai - bi;
if (diff<0) diff = -diff;
return diff;
}
int main() {
long long int e0 = 0;
long long int e1 = 0;
for (int i=0; i<10000000; i++) {
float a = 1.0f*rand()/RAND_MAX;
float b = 1.0f*rand()/RAND_MAX;
float n = 4.0f*rand()/RAND_MAX - 2.0f;
if (a==0||b==0) continue;
float v0 = std::pow(a/b, n);
float v1 = std::pow(b/a, -n);
float ref = std::pow((double)a/b, n);
e0 += diff(ref, v0);
e1 += diff(ref, v1);
}
printf("%lld %lld\n", e0, e1);
}
... between pow(a/b,x) and pow(b/a,-x) ... does raising a number less than 1 to a positive power or a number greater than 1 to a negative power produce more accurate result?
Whichever division is more arcuate.
Consider z = xy = 2y * log2(x).
Roughly: The error in y * log2(x) is magnified by the value of z to form the error in z. xy is very sensitive to the error in x. The larger the |log2(x)|, the greater concern.
In OP's case, both pow(a/b,p) and pow(b/a,-p), in general, have the same y * log2(x) and same z and similar errors in z. It is a question of how x, y are formed:
a/b and b/a, in general, both have the same error of +/- 0.5*unit in the last place and so both approaches are of similar error.
Yet with select values of a/b vs. b/a, one quotient will be more exact and it is that approach with the lower pow() error.
pow(7777777/4,-p) can be expected to be more accurate than pow(4/7777777,p).
Lacking assurance about the error in the division, the general rule applies: no major difference.
In general, the form with the positive power is slightly better, although by so little it will likely have no practical effect. Specific cases could be distinguished. For example, if either a or b is a power of two, it ought to be used as the denominator, as the division then has no rounding error.
In this answer, I assume IEEE-754 binary floating-point with round-to-nearest-ties-to-even and that the values involved are in the normal range of the floating-point format.
Given a, b, and x with values a, b, and x, and an implementation of pow that computes the representable value nearest the ideal mathematical value (actual implementations are generally not this good), pow(a/b, x) computes (a/b•(1+e0))x•(1+e1), where e0 is the rounding error that occurs in the division and e1 is the rounding error that occurs in the pow, and pow(b/a, -x) computes (b/a•(1+e2))−x•(1+e3), where e2 and e3 are the rounding errors in this division and this pow, respectively.
Each of the errors, e0…e3 lies in the interval [−u/2, u/2], where u is the unit of least precision (ULP) of 1 in the floating-point format. (The notation [p, q] is the interval containing all values from p to q, including p and q.) In case a result is near the edge of a binade (where the floating-point exponent changes and the significand is near 1), the lower bound may be −u/4. At this time, I will not analyze this case.
Rewriting, these are (a/b)x•(1+e0)x•(1+e1) and (a/b)x•(1+e2)−x•(1+e3). This reveals the primary difference is in (1+e0)x versus (1+e2)−x. The 1+e1 versus 1+e3 is also a difference, but this is just the final rounding. [I may consider further analysis of this later but omit it for now.]
Consider (1+e0)x and (1+e2)−x.The potential values of the first expression span [(1−u/2)x, (1+u/2)x], while the second spans [(1+u/2)−x, (1−u/2)−x]. When x > 0, the second interval is longer than the first:
The length of the first is (1+u/2)x−(1+u/2)x.
The length of the second is (1/(1−u/2))x−(1/(1+u/2))x.
Multiplying the latter by (1−u2/22)x produces ((1−u2/22)/(1−u/2))x−( (1−u2/22)/(1+u/2))x = (1+u/2)x−(1+u/2)x, which is the length of the first interval.
1−u2/22 < 1, so (1−u2/22)x < 1 for positive x.
Since the first length equals the second length times a number less than one, the first interval is shorter.
Thus, the form in which the exponent is positive is better in the sense that it has a shorter interval of potential results.
Nonetheless, this difference is very slight. I would not be surprised if it were unobservable in practice. Also, one might be concerned with the probability distribution of errors rather than the range of potential errors. I suspect this would also favor positive exponents.
For evaluation of rounding errors like in your case, it might be useful to use some multi-precision library, such as Boost.Multiprecision. Then, you can compare results for various precisions, e.g, such as with the following program:
#include <iomanip>
#include <iostream>
#include <boost/multiprecision/cpp_bin_float.hpp>
#include <boost/multiprecision/cpp_dec_float.hpp>
namespace mp = boost::multiprecision;
template <typename FLOAT>
void comp() {
FLOAT a = 8.72138221;
FLOAT b = 1.761329479;
FLOAT c = 1.51231;
FLOAT e = mp::pow(a / b, -c);
FLOAT f = mp::pow(b / a, c);
std::cout << std::fixed << std::setw(40) << std::setprecision(40) << e << std::endl;
std::cout << std::fixed << std::setw(40) << std::setprecision(40) << f << std::endl;
}
int main() {
std::cout << "Double: " << std::endl;
comp<mp::cpp_bin_float_double>();
td::cout << std::endl;
std::cout << "Double extended: " << std::endl;
comp<mp::cpp_bin_float_double_extended>();
std::cout << std::endl;
std::cout << "Quad: " << std::endl;
comp<mp::cpp_bin_float_quad>();
std::cout << std::endl;
std::cout << "Dec-100: " << std::endl;
comp<mp::cpp_dec_float_100>();
std::cout << std::endl;
}
Its output reads, on my platform:
Double:
0.0889878304922865903670015086390776559711
0.0889878304922866181225771242679911665618
Double extended:
0.0889878304922865999079806265115166752366
0.0889878304922865999012043629334822725241
Quad:
0.0889878304922865999004910375213273866639
0.0889878304922865999004910375213273505527
Dec-100:
0.0889878304922865999004910375213273881004
0.0889878304922865999004910375213273881004
Live demo: https://wandbox.org/permlink/tAm4sBIoIuUy2lO6
For double, the first calculation was more accurate, however, I guess one cannot make any generic conclusions here.
Also, note that your input numbers are not accurately representable with the IEEE 754 double precision floating-point type (none of them). The question is whether you care about the accuracy of calculations with either those exact numbers of their closest representations.

c++ float subtraction rounding error

I have a float value between 0 and 1. I need to convert it with -120 to 80.
To do this, first I multiply with 200 after 120 subtract.
When subtract is made I had rounding error.
Let's look my example.
float val = 0.6050f;
val *= 200.f;
Now val is 121.0 as I expected.
val -= 120.0f;
Now val is 0.99999992
I thought maybe I can avoid this problem with multiplication and division.
float val = 0.6050f;
val *= 200.f;
val *= 100.f;
val -= 12000.0f;
val /= 100.f;
But it didn't help. I have still 0.99 on my hand.
Is there a solution for it?
Edit: After with detailed logging, I understand there is no problem with this part of code. Before my log shows me "0.605", after I had detailed log and I saw "0.60499995946884155273437500000000000000000000000000"
the problem is in different place.
Edit2: I think I found the guilty. The initialised value is 0.5750.
std::string floatToStr(double d)
{
std::stringstream ss;
ss << std::fixed << std::setprecision(15) << d;
return ss.str();
}
int main()
{
float val88 = 0.57500000000f;
std::cout << floatToStr(val88) << std::endl;
}
The result is 0.574999988079071
Actually I need to add and sub 0.0025 from this value every time.
Normally I expected 0.575, 0.5775, 0.5800, 0.5825 ....
Edit3: Actually I tried all of them with double. And it is working for my example.
std::string doubleToStr(double d)
{
std::stringstream ss;
ss << std::fixed << std::setprecision(15) << d;
return ss.str();
}
int main()
{
double val88 = 0.575;
std::cout << doubleToStr(val88) << std::endl;
val88 += 0.0025;
std::cout << doubleToStr(val88) << std::endl;
val88 += 0.0025;
std::cout << doubleToStr(val88) << std::endl;
val88 += 0.0025;
std::cout << doubleToStr(val88) << std::endl;
return 0;
}
The results are:
0.575000000000000
0.577500000000000
0.580000000000000
0.582500000000000
But I bound to float unfortunately. I need to change lots of things.
Thank you for all to help.
Edit4: I have found my solution with strings. I use ostringstream's rounding and convert to double after that. I can have 4 precision right numbers.
std::string doubleToStr(double d, int precision)
{
std::stringstream ss;
ss << std::fixed << std::setprecision(precision) << d;
return ss.str();
}
double val945 = (double)0.575f;
std::cout << doubleToStr(val945, 4) << std::endl;
std::cout << doubleToStr(val945, 15) << std::endl;
std::cout << atof(doubleToStr(val945, 4).c_str()) << std::endl;
and results are:
0.5750
0.574999988079071
0.575
Let us assume that your compiler implements IEEE 754 binary32 and binary64 exactly for float and double values and operations.
First, you must understand that 0.6050f does not represent the mathematical quantity 6050 / 10000. It is exactly 0.605000019073486328125, the nearest float to that. Even if you write perfect computations from there, you have to remember that these computations start from 0.605000019073486328125 and not from 0.6050.
Second, you can solve nearly all your accumulated roundoff problems by computing with double and converting to float only in the end:
$ cat t.c
#include <stdio.h>
int main(){
printf("0.6050f is %.53f\n", 0.6050f);
printf("%.53f\n", (float)((double)0.605f * 200. - 120.));
}
$ gcc t.c && ./a.out
0.6050f is 0.60500001907348632812500000000000000000000000000000000
1.00000381469726562500000000000000000000000000000000000
In the above code, all computations and intermediate values are double-precision.
This 1.0000038… is a very good answer if you remember that you started with 0.605000019073486328125 and not 0.6050 (which doesn't exist as a float).
If you really care about the difference between 0.99999992 and 1.0, float is not precise enough for your application. You need to at least change to double.
If you need an answer in a specific range, and you are getting answers slightly outside that range but within rounding error of one of the ends, replace the answer with the appropriate range end.
The point everybody is making can be summarised: in general, floating point is precise but not exact.
How precise is governed by the number of bits in the mantissa -- which is 24 for float, and 53 for double (assuming IEEE 754 binary formats, which is pretty safe these days ! [1]).
If you are looking for an exact result, you have to be ready to deal with values that differ (ever so slightly) from that exact result, but...
(1) The Exact Binary Fraction Problem
...the first issue is whether the exact value you are looking for can be represented exactly in binary floating point form...
...and that is rare -- which is often a disappointing surprise.
The binary floating point representation of a given value can be exact, but only under the following, restricted circumstances:
the value is an integer, < 2^24 (float) or < 2^53 (double).
this is the simplest case, and perhaps obvious. Since you are looking a result >= -120 and <= 80, this is sufficient.
or:
the value is an integer which divides exactly by 2^n and is then (as above) < 2^24 or < 2^53.
this includes the first rule, but is more general.
or:
the value has a fractional part, but when the value is multiplied by the smallest 2^n necessary to produce an integer, that integer is < 2^24 (float) or 2^53 (double).
This is the part which may come as a surprise.
Consider 27.01, which is a simple enough decimal value, and clearly well within the ~7 decimal digit precision of a float. Unfortunately, it does not have an exact binary floating point form -- you can multiply 27.01 by any 2^n you like, for example:
27.01 * (2^ 6) = 1728.64 (multiply by 64)
27.01 * (2^ 7) = 3457.28 (multiply by 128)
...
27.01 * (2^10) = 27658.24
...
27.01 * (2^20) = 28322037.76
...
27.01 * (2^25) = 906305208.32 (> 2^24 !)
and you never get an integer, let alone one < 2^24 or < 2^53.
Actually, all these rules boil down to one rule... if you can find an 'n' (positive or negative, integer) such that y = value * (2^n), and where y is an exact, odd integer, then value has an exact representation if y < 2^24 (float) or if y < 2^53 (double) -- assuming no under- or over-flow, which is another story.
This looks complicated, but the rule of thumb is simply: "very few decimal fractions can be represented exactly as binary fractions".
To illustrate how few, let us consider all the 4 digit decimal fractions, of which there are 10000, that is 0.0000 up to 0.9999 -- including the trivial, integer case 0.0000. We can enumerate how many of those have exact binary equivalents:
1: 0.0000 = 0/16 or 0/1
2: 0.0625 = 1/16
3: 0.1250 = 2/16 or 1/8
4: 0.1875 = 3/16
5: 0.2500 = 4/16 or 1/4
6: 0.3125 = 5/16
7: 0.3750 = 6/16 or 3/8
8: 0.4375 = 7/16
9: 0.5000 = 8/16 or 1/2
10: 0.5625 = 9/16
11: 0.6250 = 10/16 or 5/8
12: 0.6875 = 11/16
13: 0.7500 = 12/16 or 3/4
14: 0.8125 = 13/16
15: 0.8750 = 14/16 or 7/8
16: 0.9375 = 15/16
That's it ! Just 16/10000 possible 4 digit decimal fractions (including the trivial 0 case) have exact binary fraction equivalents, at any precision. All the other 9984/10000 possible decimal fractions give rise to recurring binary fractions. So, for 'n' digit decimal fractions only (2^n) / (10^n) can be represented exactly -- that's 1/(5^n) !!
This is, of course, because your decimal fraction is actually the rational x / (10^n)[2] and your binary fraction is y / (2^m) (for integer x, y, n and m), and for a given binary fraction to be exactly equal to a decimal fraction we must have:
y = (x / (10^n)) * (2^m)
= (x / ( 5^n)) * (2^(m-n))
which is only the case when x is an exact multiple of (5^n) -- for otherwise y is not an integer. (Noting that n <= m, assuming that x has no (spurious) trailing zeros, and hence n is as small as possible.)
(2) The Rounding Problem
The result of a floating point operation may need to be rounded to the precision of the destination variable. IEEE 754 requires that the operation is done as if there were no limit to the precision, and the ("true") result is then rounded to the nearest value at the precision of the destination. So, the final result is as precise as it can be... given the limitations on how precise the arguments are, and how precise the destination is... but not exact !
(With floats and doubles, 'C' may promote float arguments to double (or long double) before performing an operation, and the result of that will be rounded to double. The final result of an expression may then be a double (or long double), which is then rounded (again) if it is to be stored in a float variable. All of this adds to the fun ! See FLT_EVAL_METHOD for what your system does -- noting the default for a floating point constant is double.)
So, the other rules to remember are:
floating point values are not reals (they are, in fact, rationals with a limited denominator).
The precision of a floating point value may be large, but there are lots of real numbers that cannot be represented exactly !
floating point expressions are not algebra.
For example, converting from degrees to radians requires division by π. Any arithmetic with π has a problem ('cos it's irrational), and with floating point the value for π is rounded to whatever floating precision we are using. So, the conversion of (say) 27 (which is exact) degrees to radians involves division by 180 (which is exact) and multiplication by our "π". However exact the arguments, the division and the multiplication may round, so the result is may only approximate. Taking:
float pi = 3.14159265358979 ; /* plenty for float */
float x = 27.0 ;
float y = (x / 180.0) * pi ;
float z = (y / pi) * 180.0 ;
printf("z-x = %+6.3e\n", z-x) ;
my (pretty ordinary) machine gave: "z-x = +1.907e-06"... so, for our floating point:
x != (((x / 180.0) * pi) / pi) * 180 ;
at least, not for all x. In the case shown, the relative difference is small -- ~ 1.2 / (2^24) -- but not zero, which simple algebra might lead us to expect.
hence: floating point equality is a slippery notion.
For all the reasons above, the test x == y for two floating values is problematic. Depending on how x and y have been calculated, if you expect the two to be exactly the same, you may very well be sadly disappointed.
[1] There exists a standard for decimal floating point, but generally binary floating point is what people use.
[2] For any decimal fraction you can write down with a finite number of digits !
Even with double precision, you'll run into issues such as:
200. * .60499999999999992 = 120.99999999999997
It appears that you want some type of rounding so that 0.99999992 is rounded to 1.00000000 .
If the goal is to produce values to the nearest multiple of 1/1000, try:
#include <math.h>
val = (float) floor((200000.0f*val)-119999.5f)/1000.0f;
If the goal is to produce values to the nearest multiple of 1/200, try:
val = (float) floor((40000.0f*val)-23999.5f)/200.0f;
If the goal is to produce values to the nearest integer, try:
val = (float) floor((200.0f*val)-119.5f);

How do I do floating point rounding with a bias (always round up or down)?

I want to round floats with a bias, either always down or always up. There is a specific point in the code where I need this, the rest of the program should round to the nearest value as usual.
For example, I want to round to the nearest multiple of 1/10. The closest floating point number to 7/10 is approximately 0.69999998807, but the closest number to 8/10 is approximately 0.80000001192. When I round off numbers, these are the two results I get. I'd rather get them rounded the same way. 7/10 should round to 0.70000004768 and 8/10 should round to 0.80000001192.
In this example I am always rounding up, but I have some places where I want to always round down. Fortunately, I am only dealing with positive values in each of these places.
The line I am using to round is floor(val * 100 + 0.5) / 100. I am programming in C++.
I think the best way to achieve this is to rely on the fact that according to the IEEE 754 floating point standard, the integer representation of floating point bits are lexicographically ordered as a 2-complement integer.
I.e. you could simply add one ulp (units in the last place) to get the next floating point representation (which will always be slightly larger than your treshold if it was smaller, since the round error is at most 1/2 ulp)
e.g.
float floatValue = 7.f/10;
std::cout << std::setprecision(20) << floatValue << std::endl;
int asInt = *(int*)&floatValue;
asInt += 1;
floatValue = *(float*)&asInt;
std::cout << floatValue << std::endl;
prints (on my system)
0.69999998807907104492
0.70000004768371582031
To know when you need to add one ulp, you'll have to rely on the difference of floor and a rounded floor
if (std::floor(floatValue * 100.) != std::floor(floatValue * 100. + 0.5)) {
int asInt = *(int*)&floatValue;
asInt += 1;
floatValue = *(float*)&asInt;
}
Would correctly convert 0.69.. to 0.70.. but leave 0.80.. alone.
Note that the float gets promoted to a double via the multiplication with 100. before the floor is applied.
If you don't do this you risk getting in the situation that for
7.f/10.f * 100.f
The (limited in precision) float representation would be 70.00...