How to calculate 32-bit floating-point epsilon? - c++

In book Game Engine Architecture : "..., let’s say we use a floating-point variable to track absolute game time in seconds. How long can we run our game before the magnitude of our clock variable gets so large that adding 1/30th of a second to it no longer changes its value? The answer is roughly 12.9 days."
Why 12.9 days, how to calculate it ?

When the result of a floating point computation can't be represented exactly, it is rounded to the nearest value. So you want to find the smallest value x such that the increment f = 1/30 is less than half the width h between x and the next largest float, which means that x+f will round back to x.
Since the gap is the same for all elements in the same binade, we know that x must be the smallest element in its binade, which is a power of 2.
So if x = 2k, then h = 2k-23 since a float has a 24-bit significand. So we need to find the smallest integer k such that
2k-23/2 > 1/30
which implies k > 19.09, hence k = 20, and x = 220 = 1048576 (seconds).
Note that x / (60 × 60 × 24) = 12.14 (days), which is a little bit less that what your answer proposes, but checks out empirically: in Julia
julia> x = 2f0^20
1.048576f6
julia> f = 1f0/30f0
0.033333335f0
julia> x+f == x
true
julia> p = prevfloat(x)
1.04857594f6
julia> p+f == p
false
UPDATE: Okay, so where did the 12.9 come from? The 12.14 is in game time, not actual time: these will have diverged due to the rounding error involved in floating point (especially near the end, when the rounding error is actually quite large relative to f). As far as I know, there's no way to calculate this directly, but it's actually fairly quick to iterate through 32-bit floats.
Again, in Julia:
julia> function timestuff(f)
t = 0
x = 0f0
while true
t += 1
xp = x
x += f
if x == xp
return (t,x)
end
end
end
timestuff (generic function with 1 method)
julia> t,x = timestuff(1f0/30f0)
(24986956,1.048576f6)
x matches our result we calculated earlier, and t is the clock time in 30ths of a second. Converting to days:
julia> t/(30*60*60*24)
9.640029320987654
which is even further away. So I don't know where the 12.9 came from...
UPDATE 2: My guess is that the 12.9 comes from the calculation
y = 4 × f / ε = 1118481.125 (seconds)
where ε is the standard machine epsilon (the gap between 1 and the next largest floating point number). Scaling this to days gives 12.945. This provides an upper bound on x, but it is not the correct answer as explained above.

#include <iostream>
#include <iomanip>
/*
https://en.wikipedia.org/wiki/Machine_epsilon#How_to_determine_machine_epsilon
*/
typedef union
{
int32_t i32;
float f32;
} fi32_t;
float float_epsilon(float nbr)
{
fi32_t flt;
flt.f32 = nbr;
flt.i32++;
return (flt.f32 - nbr);
}
int main()
{
// How to calculate 32-bit floating-point epsilon?
const float one {1.}, ten_mills {10e6};
std::cout << "epsilon for number " << one << " is:\n"
<< std::fixed << std::setprecision(25)
<< float_epsilon(one)
<< std::defaultfloat << "\n\n";
std::cout << "epsilon for number " << ten_mills << " is:\n"
<< std::fixed << std::setprecision(25)
<< float_epsilon(ten_mills)
<< std::defaultfloat << "\n\n";
// In book Game Engine Architecture : "..., let’s say we use a
// floating-point variable to track absolute game time in seconds.
// How long can we run our game before the magnitude of our clock
// variable gets so large that adding 1/30th of a second to it no
// longer changes its value? The answer is roughly 12.9 days."
// Why 12.9 days, how to calculate it ?
const float one_30th {1.f/30}, day_sec {60*60*24};
float time_sec {}, time_sec_old {};
while ((time_sec += one_30th) > time_sec_old)
{
time_sec_old = time_sec;
}
std::cout << "We can run our game for "
<< std::fixed << std::setprecision(5)
<< (time_sec / day_sec)
<< std::defaultfloat << " days.\n";
return EXIT_SUCCESS;
}
This outputs
epsilon for number 1 is:
0.0000001192092895507812500
epsilon for number 10000000 is:
1.0000000000000000000000000
We can run our game for 12.13630 days.

This is due to zones of expressibility in floating point representation.
Check out this lecture from my uni.
As the exponent gets larger, the jump on the really number line between the values actually represented increases; when the exponent is low, the density of representation is high. To give an example, imaging decimal numbers with a finite number of place values. given 1.0001e1 and 1.0002e1 the difference is 0.0001 between the two values. But if the exponent increases 1.0001-10 1.0002-10 the difference between the two is 0.000100135. Obviously this gets larger as the exponent increases. In the case you talk about, it is possible that the jump becomes so large, the increase does not promote a rounding increase of the least significant bit
Interestingly, towards the limits of the representations, the accuracy of a larger float type is worse! Simply because the increase in the bit pattern in the mantissa jumps much further on the number line when more bits are available for the exponent; as is the case for double, over float

Related

Numerical accuracy of pow(a/b,x) vs pow(b/a,-x)

Is there a difference in accuracy between pow(a/b,x) and pow(b/a,-x)?
If there is, does raising a number less than 1 to a positive power or a number greater than 1 to a negative power produce more accurate result?
Edit: Let's assume x86_64 processor and gcc compiler.
Edit: I tried comparing using some random numbers. For example:
printf("%.20f",pow(8.72138221/1.761329479,-1.51231)) // 0.08898783049228660424
printf("%.20f",pow(1.761329479/8.72138221, 1.51231)) // 0.08898783049228659037
So, it looks like there is a difference (albeit minuscule in this case), but maybe someone who knows about the algorithm implementation could comment on what the maximum difference is, and under what conditions.
Here's one way to answer such questions, to see how floating-point behaves. This is not a 100% correct way to analyze such question, but it gives a general idea.
Let's generate random numbers. Calculate v0=pow(a/b, n) and v1=pow(b/a, -n) in float precision. And calculate ref=pow(a/b, n) in double precision, and round it to float. We use ref as a reference value (we suppose that double has much more precision than float, so we can trust that ref can be considered the best possible value. This is true for IEEE-754 for most of the time). Then sum the difference between v0-ref and v1-ref. The difference should calculated with "the number of floating point numbers between v and ref".
Note, that the results may be depend on the range of a, b and n (and on the random generator quality. If it's really bad, it may give a biased result). Here, I've used a=[0..1], b=[0..1] and n=[-2..2]. Furthermore, this answer supposes that the algorithm of float/double division/pow is the same kind, have the same characteristics.
For my computer, the summed differences are: 2604828 2603684, it means that there is no significant precision difference between the two.
Here's the code (note, this code supposes IEEE-754 arithmetic):
#include <cmath>
#include <stdio.h>
#include <string.h>
long long int diff(float a, float b) {
unsigned int ai, bi;
memcpy(&ai, &a, 4);
memcpy(&bi, &b, 4);
long long int diff = (long long int)ai - bi;
if (diff<0) diff = -diff;
return diff;
}
int main() {
long long int e0 = 0;
long long int e1 = 0;
for (int i=0; i<10000000; i++) {
float a = 1.0f*rand()/RAND_MAX;
float b = 1.0f*rand()/RAND_MAX;
float n = 4.0f*rand()/RAND_MAX - 2.0f;
if (a==0||b==0) continue;
float v0 = std::pow(a/b, n);
float v1 = std::pow(b/a, -n);
float ref = std::pow((double)a/b, n);
e0 += diff(ref, v0);
e1 += diff(ref, v1);
}
printf("%lld %lld\n", e0, e1);
}
... between pow(a/b,x) and pow(b/a,-x) ... does raising a number less than 1 to a positive power or a number greater than 1 to a negative power produce more accurate result?
Whichever division is more arcuate.
Consider z = xy = 2y * log2(x).
Roughly: The error in y * log2(x) is magnified by the value of z to form the error in z. xy is very sensitive to the error in x. The larger the |log2(x)|, the greater concern.
In OP's case, both pow(a/b,p) and pow(b/a,-p), in general, have the same y * log2(x) and same z and similar errors in z. It is a question of how x, y are formed:
a/b and b/a, in general, both have the same error of +/- 0.5*unit in the last place and so both approaches are of similar error.
Yet with select values of a/b vs. b/a, one quotient will be more exact and it is that approach with the lower pow() error.
pow(7777777/4,-p) can be expected to be more accurate than pow(4/7777777,p).
Lacking assurance about the error in the division, the general rule applies: no major difference.
In general, the form with the positive power is slightly better, although by so little it will likely have no practical effect. Specific cases could be distinguished. For example, if either a or b is a power of two, it ought to be used as the denominator, as the division then has no rounding error.
In this answer, I assume IEEE-754 binary floating-point with round-to-nearest-ties-to-even and that the values involved are in the normal range of the floating-point format.
Given a, b, and x with values a, b, and x, and an implementation of pow that computes the representable value nearest the ideal mathematical value (actual implementations are generally not this good), pow(a/b, x) computes (a/b•(1+e0))x•(1+e1), where e0 is the rounding error that occurs in the division and e1 is the rounding error that occurs in the pow, and pow(b/a, -x) computes (b/a•(1+e2))−x•(1+e3), where e2 and e3 are the rounding errors in this division and this pow, respectively.
Each of the errors, e0…e3 lies in the interval [−u/2, u/2], where u is the unit of least precision (ULP) of 1 in the floating-point format. (The notation [p, q] is the interval containing all values from p to q, including p and q.) In case a result is near the edge of a binade (where the floating-point exponent changes and the significand is near 1), the lower bound may be −u/4. At this time, I will not analyze this case.
Rewriting, these are (a/b)x•(1+e0)x•(1+e1) and (a/b)x•(1+e2)−x•(1+e3). This reveals the primary difference is in (1+e0)x versus (1+e2)−x. The 1+e1 versus 1+e3 is also a difference, but this is just the final rounding. [I may consider further analysis of this later but omit it for now.]
Consider (1+e0)x and (1+e2)−x.The potential values of the first expression span [(1−u/2)x, (1+u/2)x], while the second spans [(1+u/2)−x, (1−u/2)−x]. When x > 0, the second interval is longer than the first:
The length of the first is (1+u/2)x−(1+u/2)x.
The length of the second is (1/(1−u/2))x−(1/(1+u/2))x.
Multiplying the latter by (1−u2/22)x produces ((1−u2/22)/(1−u/2))x−( (1−u2/22)/(1+u/2))x = (1+u/2)x−(1+u/2)x, which is the length of the first interval.
1−u2/22 < 1, so (1−u2/22)x < 1 for positive x.
Since the first length equals the second length times a number less than one, the first interval is shorter.
Thus, the form in which the exponent is positive is better in the sense that it has a shorter interval of potential results.
Nonetheless, this difference is very slight. I would not be surprised if it were unobservable in practice. Also, one might be concerned with the probability distribution of errors rather than the range of potential errors. I suspect this would also favor positive exponents.
For evaluation of rounding errors like in your case, it might be useful to use some multi-precision library, such as Boost.Multiprecision. Then, you can compare results for various precisions, e.g, such as with the following program:
#include <iomanip>
#include <iostream>
#include <boost/multiprecision/cpp_bin_float.hpp>
#include <boost/multiprecision/cpp_dec_float.hpp>
namespace mp = boost::multiprecision;
template <typename FLOAT>
void comp() {
FLOAT a = 8.72138221;
FLOAT b = 1.761329479;
FLOAT c = 1.51231;
FLOAT e = mp::pow(a / b, -c);
FLOAT f = mp::pow(b / a, c);
std::cout << std::fixed << std::setw(40) << std::setprecision(40) << e << std::endl;
std::cout << std::fixed << std::setw(40) << std::setprecision(40) << f << std::endl;
}
int main() {
std::cout << "Double: " << std::endl;
comp<mp::cpp_bin_float_double>();
td::cout << std::endl;
std::cout << "Double extended: " << std::endl;
comp<mp::cpp_bin_float_double_extended>();
std::cout << std::endl;
std::cout << "Quad: " << std::endl;
comp<mp::cpp_bin_float_quad>();
std::cout << std::endl;
std::cout << "Dec-100: " << std::endl;
comp<mp::cpp_dec_float_100>();
std::cout << std::endl;
}
Its output reads, on my platform:
Double:
0.0889878304922865903670015086390776559711
0.0889878304922866181225771242679911665618
Double extended:
0.0889878304922865999079806265115166752366
0.0889878304922865999012043629334822725241
Quad:
0.0889878304922865999004910375213273866639
0.0889878304922865999004910375213273505527
Dec-100:
0.0889878304922865999004910375213273881004
0.0889878304922865999004910375213273881004
Live demo: https://wandbox.org/permlink/tAm4sBIoIuUy2lO6
For double, the first calculation was more accurate, however, I guess one cannot make any generic conclusions here.
Also, note that your input numbers are not accurately representable with the IEEE 754 double precision floating-point type (none of them). The question is whether you care about the accuracy of calculations with either those exact numbers of their closest representations.

How many floats can be added together before floating point precision becomes an issue

I am currently recording some frame times in MS instead of ticks. I know this can be an issue as we are adding all the frame times (in MS) together and then dividing by the number of frames. This could cause bad results due to floating point precision.
It would make more sense to add all the tick counts together then convert to MS once at the end.
However, I am wondering what the actual difference would be for a small number of samples? I expect to have between 900-1800 samples. Would this be an issue at all?
I have made this small example and run it on GCC 4.9.2:
// Example program
#include <iostream>
#include <string>
int main()
{
float total = 0.0f;
double total2 = 0.0f;
for(int i = 0; i < 1000000; ++i)
{
float r = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
total += r;
total2 += r;
}
std::cout << "Total: " << total << std::endl;
std::cout << "Total2: " << total2 << std::endl;
}
Result:
Total: 500004 Total2: 500007
So as far as I can tell with 1 million values we do not lose a lot of precision. Though I am not sure if what I have written is a reasonable test or actually testing what I want to test.
So my question is, how many floats can I add together before precision becomes an issue? I expect my values to be between 1 and 60 MS. I would like the end precision to be within 1 millisecond. I have 900-1800 values.
Example Value: 15.1345f for 15 milliseconds.
Counterexample
Using the assumptions below about the statement of the problem (times are effectively given as values such as .06 for 60 milliseconds), if we convert .06 to float and add it 1800 times, the computed result is 107.99884796142578125. This differs from the mathematical result, 108.000, by more than .001. Therefore, the computed result will sometimes differ from the mathematical result by more than 1 millisecond, so the goal desired in the question is not achievable in these conditions. (Further refinement of the problem statement and alternate means of computation may be able to achieve the goal.)
Original Analysis
Suppose we have 1800 integer values in [1, 60] that are converted to float using float y = x / 1000.f;, where all operations are implemented using IEEE-754 basic 32-bit binary floating-point with correct rounding.
The conversions of 1 to 60 to float are exact. The division by 1000 has an error of at most ½ ULP(.06), which is ½ • 2−5 • 2−23 = 2−29. 1800 such errors amount to at most 1800 • 2−29.
As the resulting float values are added, there may be an error of at most ½ ULP in each addition, where the ULP is that of the current result. For a loose analysis, we can bound this with the ULP of the final result, which is at most around 1800 • .06 = 108, which has an ULP of 26 • 2−23 = 2−17. So each of the 1799 additions has an error of at most 2−17, so the total errors in the additions is at most 1799 • 2−18.
Thus, the total error during divisions and additions is at most 1800 • 2−29 + 1799 • 2−18, which is about .006866.
That is a problem. I expect a better analysis of the errors in the additions would halve the error bound, as it is an arithmetic progression from 0 to the total, but that still leaves a potential error above .003, which means there is a possibility the sum could be off by several milliseconds.
Note that if the times are added as integers, the largest potential sum is 1800•60 = 108,000, which is well below the first integer not representable in float (16,777,217). Addition of these integers in float would be error-free.
This bound of .003 is small enough that some additional constraints on the problem and some additional analysis might, just might, push it below .0005, in which case the computed result will always be close enough to the correct mathematical result that rounding the computed result to the nearest millisecond would produce the correct answer.
For example, if it were known that, while the times range from 1 to 60 milliseconds, the total is always less than 7.8 seconds, that could suffice.
As much as possible, reduce the errors caused by floating point calculations
Since you've already described measuring your individual timings in milliseconds, it's far better if you accumulate those timings using integer values before you finally divide them:
std::milliseconds duration{};
for(Timing const& timing : timings) {
//Lossless integer accumulation, in a scenario where overflow is extremely unlikely
//or possibly even impossible for your problem domain
duration += std::milliseconds(timing.getTicks());
}
//Only one floating-point calculation performed, error is minimal
float averageTiming = duration.count() / float(timings.size());
The Errors that accumulate are highly particular to the scenario
Consider these two ways of accumulating values:
#include<iostream>
int main() {
//Make them volatile to prevent compilers from optimizing away the additions
volatile float sum1 = 0, sum2 = 0;
for(float i = 0.0001; i < 1000; i += 0.0001) {
sum1 += i;
}
for(float i = 1000; i > 0; i -= 0.0001) {
sum2 += i;
}
std::cout << "Sum1: " << sum1 << std::endl;
std::cout << "Sum2: " << sum2 << std::endl;
std::cout << "% Difference: " << (sum2 - sum1) / (sum1 > sum2 ? sum1 : sum2) * 100 << "%" << std::endl;
return 0;
}
Results may vary on some machines (particularly machines that don't have IEEE754 floats), but in my tests, the second value was 3% different than the first value, a difference of 13 million. That can be pretty significant.
Like before, the best option is to minimize the number of calculations performed using floating point values until the last possible step before you need them as floating point values. That will minimize accuracy losses.
Just for what it's worth, here's some code to demonstrate that yes, after 1800 items, a simple accumulation can be incorrect by more than 1 millisecond, but Kahan summation maintains the required level of accuracy.
#include <iostream>
#include <iterator>
#include <iomanip>
#include <vector>
#include <numeric>
template <class InIt>
typename std::iterator_traits<InIt>::value_type accumulate(InIt begin, InIt end)
{
typedef typename std::iterator_traits<InIt>::value_type real;
real sum = real();
real running_error = real();
for (; begin != end; ++begin)
{
real difference = *begin - running_error;
real temp = sum + difference;
running_error = (temp - sum) - difference;
sum = temp;
}
return sum;
}
int main()
{
const float addend = 0.06f;
const float count = 1800.0f;
std::vector<float> d;
std::fill_n(std::back_inserter(d), count, addend);
float result = std::accumulate(d.begin(), d.end(), 0.0f);
float result2 = accumulate(d.begin(), d.end());
float reference = count * addend;
std::cout << " simple: " << std::setprecision(20) << result << "\n";
std::cout << " Kahan: " << std::setprecision(20) << result2 << "\n";
std::cout << "Reference: " << std::setprecision(20) << reference << "\n";
}
For this particular test, it appears that double precision is sufficient, at least for the input values I tried--but to be honest, I'm still a bit leery of it, especially when exhaustive testing isn't reasonable, and better techniques are easily available.

Is it correct to state that the first number that collide in single precision is 131072.02? (positive, considering 2 digits as mantissa)

I was trying to figure it out for my audio application if float can be used to represent correctly the range of parameters I'll use.
The "biggest" mask it needs is for frequency params, which is positive, and allow max two digits as mantissa (i.e. from 20.00 hz to 22000.00 hz). Conceptually, the following digits will be rounded out, so I don't care for them.
So I made this script to check the first number that collide in single precision:
float temp = 0.0;
double valueDouble = 0.0;
double increment = 1e-2;
bool found = false;
while(!found) {
double oldValue = valueDouble;
valueDouble += increment;
float value = valueDouble;
// found
if(temp == value) {
std::cout << "collision found: " << valueDouble << std::endl;
std::cout << " collide with: " << oldValue << std::endl;
std::cout << "float stored as: " << value << std::endl;
found = true;
}
temp = value;
}
and its seems its 131072.02 (with 131072.01, stored as the same 131072.015625 value), which is far away than 22000.00. And it seems I would be ok using float.
But I'd like to understand if that reasoning is correct. It is?
The whole problem would be if I set a param of XXXXX.YY (7 digits) and it collides with some other numbers having a less number of digits (because single precision only guarantee 6 digits)
Note: of course numbers such as 1024.0002998145910169114358723163604736328125 or 1024.000199814591042013489641249179840087890625 collide, and they are within the interval, but they do it at a longer significative digits than my required mantissa, so I don't care.
IEEE 754 Single precision is defined as
1 sign bit
8 exponent bits: range 2^-126 to 2^127 ~ 10^-38 to 10^38)
23 fraction (mantissa) bits: decimal precision depending on the exponent)
At 22k the exponent will represent an offset of 16384=2^14, so the 23-bit mantissa will give you a precision of 2^14/2^23= 1/2^9 = 0.001953125... which is sufficient for your case.
For 131072.01, the exponent will represent an offset 131072 = 2^17, so the mantissa will give a precision of 2^17/2^23 = 1/2^6 = 0.015625 which is larger then your target precision of 0.01
Your program does not verify exactly what you want, but your underlying reasoning should be ok.
The problem with the program is that valueDouble will accumulate slight errors (since 0.01 isn't represented accurately) - and converting the string "20.01" to a floating point number will introduce slight round-off errors.
But those errors should be on the order of DBL_EPSILON and be much smaller than the error you see.
If you really wanted to test it you would have to write "20.00" to "22000.00" and scan them all using the scanf-variant you plan to use and verify that they differ.
Is it correct to state that the first number that collide in single precision is 131072.02? (positive, considering 2 digits as mantissa after the decimal point)
Yes.
I'd like to understand if that reasoning is correct. It is?
For values just less than 131072.0f, each successive representable float value is 1/128th apart.
For values in the range [131072.0f ... 2*131072.0f), each successive representable float value is 1/64th apart.
With values of the decimal textual form "131072.xx", there are 100 combinations, yet only 64 differ float. It is not surprising that 100-64 or 36 collisions occurs - see below. For numbers of this form, this is the first place the density of float is too sparse: the least significant bit in float > 0.01 in this range.
int main(void) {
volatile float previous = 0.0;
for (long i = 1; i <= 99999999; i++) {
volatile float f1 = i / 100.0;
if (previous == f1) {
volatile float f0 = nextafterf(f1, 0);
volatile float f2 = nextafterf(f1, f1 * 2);
printf("%f %f %f delta fraction:%f\n", f0, f1, f2, 1.0 / (f1 - f0));
static int count = 100 - 64;
if (--count == 0) return 0;
}
previous = f1;
}
printf("Done\n");
}
Output
131072.000000 131072.015625 131072.031250 delta fraction:64.000000
131072.031250 131072.046875 131072.062500 delta fraction:64.000000
131072.046875 131072.062500 131072.078125 delta fraction:64.000000
...
131072.921875 131072.937500 131072.953125 delta fraction:64.000000
131072.937500 131072.953125 131072.968750 delta fraction:64.000000
131072.968750 131072.984375 131073.000000 delta fraction:64.000000
Why floating-points number's significant numbers is 7 or 6 may also help.

How to remove last significant digits/mantissa bits for floating numbers in C++

I would like to remove last 2 or 3 significant digits for floating number in C++ in efficient way. To formulate the question more accurately -- I would like to discard last few mantissa bits in floating number representation.
Some background: I can arrive to the same floating point number using different ways. For example, if I use bilinear interpolation over rectangle with equal values in its corners results will vary in last couple of digits in different points of the rectangle due to machine accuracy limits. The absolute order of these deviations depends on order of interpolated values. For example if p[i]~1e10 (i == 1..4, values at corners of rectangle) then interpolation error caused by machine accuracy is ~1e-4 (for 8-byte floats). If p[i]~1e-10 then error would be ~1e-24. As the interpolated values are used to calculate 1st or 2nd order derivatives I need to 'smooth out' these difference. One idea is to remove last couple of digits off final result. Below is my take on it:
#include <iostream> // std::cout
#include <limits> // std::numeric_limits
#include <math.h> // fabs
template<typename real>
real remove_digits(const real& value, const int num_digits)
{
//return value;
if (value == 0.) return 0;
const real corrector_power =
(std::numeric_limits<real>::digits10 - num_digits)
- round(log10(fabs(value)));
//the value is too close to the limits of the double minimum value
//to be corrected -- return 0 instead.
if (corrector_power > std::numeric_limits<real>::max_exponent10 -
num_digits - 1)
{
return 0.;
}
const real corrector = pow(10, corrector_power);
const real result =
(value > 0) ? floor(value * corrector + 0.5) / corrector :
ceil(value * corrector - 0.5) / corrector;
return result;
}//remove_digits
int main() {
// g++ (Debian 4.7.2-5) 4.7.2 --- Debian GNU/Linux 7.8 (wheezy) 64-bit
std::cout << remove_digits<float>(12345.1234, 1) << std::endl; // 12345.1
std::cout << remove_digits<float>(12345.1234, 2) << std::endl; // 12345
std::cout << remove_digits<float>(12345.1234, 3) << std::endl; // 12350
std::cout << std::numeric_limits<float>::digits10 << std::endl; // 6
}
It works, but uses 2 expensive operations -- log10 and pow. Is there some smarter way of doing this? As follows from above to achieve my goals I do not need to remove actual decimal digits, but just set 3-4 bits in mantissa representation of floating number to 0.
"Some background: I can arrive to the same floating point number using different ways."
That's not a problem. The "local" resolution near x is approximately x-std::next_after(x,0.0) which over a wide range is linear in x. Your problem with p[i] varying by 20 orders of magnitude doesn't matter as the relative error varies linearly with p[i], which implies it would also vary linearly with the first expression.
More in general, if you want to compare whether a and b are similar enough, just test whether a/b is approximately 1.00000000. (You have bigger problems when b is exactly zero - there's no meaningful way to say whether 1E-10 is "almost equal" to 0)

c++ float subtraction rounding error

I have a float value between 0 and 1. I need to convert it with -120 to 80.
To do this, first I multiply with 200 after 120 subtract.
When subtract is made I had rounding error.
Let's look my example.
float val = 0.6050f;
val *= 200.f;
Now val is 121.0 as I expected.
val -= 120.0f;
Now val is 0.99999992
I thought maybe I can avoid this problem with multiplication and division.
float val = 0.6050f;
val *= 200.f;
val *= 100.f;
val -= 12000.0f;
val /= 100.f;
But it didn't help. I have still 0.99 on my hand.
Is there a solution for it?
Edit: After with detailed logging, I understand there is no problem with this part of code. Before my log shows me "0.605", after I had detailed log and I saw "0.60499995946884155273437500000000000000000000000000"
the problem is in different place.
Edit2: I think I found the guilty. The initialised value is 0.5750.
std::string floatToStr(double d)
{
std::stringstream ss;
ss << std::fixed << std::setprecision(15) << d;
return ss.str();
}
int main()
{
float val88 = 0.57500000000f;
std::cout << floatToStr(val88) << std::endl;
}
The result is 0.574999988079071
Actually I need to add and sub 0.0025 from this value every time.
Normally I expected 0.575, 0.5775, 0.5800, 0.5825 ....
Edit3: Actually I tried all of them with double. And it is working for my example.
std::string doubleToStr(double d)
{
std::stringstream ss;
ss << std::fixed << std::setprecision(15) << d;
return ss.str();
}
int main()
{
double val88 = 0.575;
std::cout << doubleToStr(val88) << std::endl;
val88 += 0.0025;
std::cout << doubleToStr(val88) << std::endl;
val88 += 0.0025;
std::cout << doubleToStr(val88) << std::endl;
val88 += 0.0025;
std::cout << doubleToStr(val88) << std::endl;
return 0;
}
The results are:
0.575000000000000
0.577500000000000
0.580000000000000
0.582500000000000
But I bound to float unfortunately. I need to change lots of things.
Thank you for all to help.
Edit4: I have found my solution with strings. I use ostringstream's rounding and convert to double after that. I can have 4 precision right numbers.
std::string doubleToStr(double d, int precision)
{
std::stringstream ss;
ss << std::fixed << std::setprecision(precision) << d;
return ss.str();
}
double val945 = (double)0.575f;
std::cout << doubleToStr(val945, 4) << std::endl;
std::cout << doubleToStr(val945, 15) << std::endl;
std::cout << atof(doubleToStr(val945, 4).c_str()) << std::endl;
and results are:
0.5750
0.574999988079071
0.575
Let us assume that your compiler implements IEEE 754 binary32 and binary64 exactly for float and double values and operations.
First, you must understand that 0.6050f does not represent the mathematical quantity 6050 / 10000. It is exactly 0.605000019073486328125, the nearest float to that. Even if you write perfect computations from there, you have to remember that these computations start from 0.605000019073486328125 and not from 0.6050.
Second, you can solve nearly all your accumulated roundoff problems by computing with double and converting to float only in the end:
$ cat t.c
#include <stdio.h>
int main(){
printf("0.6050f is %.53f\n", 0.6050f);
printf("%.53f\n", (float)((double)0.605f * 200. - 120.));
}
$ gcc t.c && ./a.out
0.6050f is 0.60500001907348632812500000000000000000000000000000000
1.00000381469726562500000000000000000000000000000000000
In the above code, all computations and intermediate values are double-precision.
This 1.0000038… is a very good answer if you remember that you started with 0.605000019073486328125 and not 0.6050 (which doesn't exist as a float).
If you really care about the difference between 0.99999992 and 1.0, float is not precise enough for your application. You need to at least change to double.
If you need an answer in a specific range, and you are getting answers slightly outside that range but within rounding error of one of the ends, replace the answer with the appropriate range end.
The point everybody is making can be summarised: in general, floating point is precise but not exact.
How precise is governed by the number of bits in the mantissa -- which is 24 for float, and 53 for double (assuming IEEE 754 binary formats, which is pretty safe these days ! [1]).
If you are looking for an exact result, you have to be ready to deal with values that differ (ever so slightly) from that exact result, but...
(1) The Exact Binary Fraction Problem
...the first issue is whether the exact value you are looking for can be represented exactly in binary floating point form...
...and that is rare -- which is often a disappointing surprise.
The binary floating point representation of a given value can be exact, but only under the following, restricted circumstances:
the value is an integer, < 2^24 (float) or < 2^53 (double).
this is the simplest case, and perhaps obvious. Since you are looking a result >= -120 and <= 80, this is sufficient.
or:
the value is an integer which divides exactly by 2^n and is then (as above) < 2^24 or < 2^53.
this includes the first rule, but is more general.
or:
the value has a fractional part, but when the value is multiplied by the smallest 2^n necessary to produce an integer, that integer is < 2^24 (float) or 2^53 (double).
This is the part which may come as a surprise.
Consider 27.01, which is a simple enough decimal value, and clearly well within the ~7 decimal digit precision of a float. Unfortunately, it does not have an exact binary floating point form -- you can multiply 27.01 by any 2^n you like, for example:
27.01 * (2^ 6) = 1728.64 (multiply by 64)
27.01 * (2^ 7) = 3457.28 (multiply by 128)
...
27.01 * (2^10) = 27658.24
...
27.01 * (2^20) = 28322037.76
...
27.01 * (2^25) = 906305208.32 (> 2^24 !)
and you never get an integer, let alone one < 2^24 or < 2^53.
Actually, all these rules boil down to one rule... if you can find an 'n' (positive or negative, integer) such that y = value * (2^n), and where y is an exact, odd integer, then value has an exact representation if y < 2^24 (float) or if y < 2^53 (double) -- assuming no under- or over-flow, which is another story.
This looks complicated, but the rule of thumb is simply: "very few decimal fractions can be represented exactly as binary fractions".
To illustrate how few, let us consider all the 4 digit decimal fractions, of which there are 10000, that is 0.0000 up to 0.9999 -- including the trivial, integer case 0.0000. We can enumerate how many of those have exact binary equivalents:
1: 0.0000 = 0/16 or 0/1
2: 0.0625 = 1/16
3: 0.1250 = 2/16 or 1/8
4: 0.1875 = 3/16
5: 0.2500 = 4/16 or 1/4
6: 0.3125 = 5/16
7: 0.3750 = 6/16 or 3/8
8: 0.4375 = 7/16
9: 0.5000 = 8/16 or 1/2
10: 0.5625 = 9/16
11: 0.6250 = 10/16 or 5/8
12: 0.6875 = 11/16
13: 0.7500 = 12/16 or 3/4
14: 0.8125 = 13/16
15: 0.8750 = 14/16 or 7/8
16: 0.9375 = 15/16
That's it ! Just 16/10000 possible 4 digit decimal fractions (including the trivial 0 case) have exact binary fraction equivalents, at any precision. All the other 9984/10000 possible decimal fractions give rise to recurring binary fractions. So, for 'n' digit decimal fractions only (2^n) / (10^n) can be represented exactly -- that's 1/(5^n) !!
This is, of course, because your decimal fraction is actually the rational x / (10^n)[2] and your binary fraction is y / (2^m) (for integer x, y, n and m), and for a given binary fraction to be exactly equal to a decimal fraction we must have:
y = (x / (10^n)) * (2^m)
= (x / ( 5^n)) * (2^(m-n))
which is only the case when x is an exact multiple of (5^n) -- for otherwise y is not an integer. (Noting that n <= m, assuming that x has no (spurious) trailing zeros, and hence n is as small as possible.)
(2) The Rounding Problem
The result of a floating point operation may need to be rounded to the precision of the destination variable. IEEE 754 requires that the operation is done as if there were no limit to the precision, and the ("true") result is then rounded to the nearest value at the precision of the destination. So, the final result is as precise as it can be... given the limitations on how precise the arguments are, and how precise the destination is... but not exact !
(With floats and doubles, 'C' may promote float arguments to double (or long double) before performing an operation, and the result of that will be rounded to double. The final result of an expression may then be a double (or long double), which is then rounded (again) if it is to be stored in a float variable. All of this adds to the fun ! See FLT_EVAL_METHOD for what your system does -- noting the default for a floating point constant is double.)
So, the other rules to remember are:
floating point values are not reals (they are, in fact, rationals with a limited denominator).
The precision of a floating point value may be large, but there are lots of real numbers that cannot be represented exactly !
floating point expressions are not algebra.
For example, converting from degrees to radians requires division by π. Any arithmetic with π has a problem ('cos it's irrational), and with floating point the value for π is rounded to whatever floating precision we are using. So, the conversion of (say) 27 (which is exact) degrees to radians involves division by 180 (which is exact) and multiplication by our "π". However exact the arguments, the division and the multiplication may round, so the result is may only approximate. Taking:
float pi = 3.14159265358979 ; /* plenty for float */
float x = 27.0 ;
float y = (x / 180.0) * pi ;
float z = (y / pi) * 180.0 ;
printf("z-x = %+6.3e\n", z-x) ;
my (pretty ordinary) machine gave: "z-x = +1.907e-06"... so, for our floating point:
x != (((x / 180.0) * pi) / pi) * 180 ;
at least, not for all x. In the case shown, the relative difference is small -- ~ 1.2 / (2^24) -- but not zero, which simple algebra might lead us to expect.
hence: floating point equality is a slippery notion.
For all the reasons above, the test x == y for two floating values is problematic. Depending on how x and y have been calculated, if you expect the two to be exactly the same, you may very well be sadly disappointed.
[1] There exists a standard for decimal floating point, but generally binary floating point is what people use.
[2] For any decimal fraction you can write down with a finite number of digits !
Even with double precision, you'll run into issues such as:
200. * .60499999999999992 = 120.99999999999997
It appears that you want some type of rounding so that 0.99999992 is rounded to 1.00000000 .
If the goal is to produce values to the nearest multiple of 1/1000, try:
#include <math.h>
val = (float) floor((200000.0f*val)-119999.5f)/1000.0f;
If the goal is to produce values to the nearest multiple of 1/200, try:
val = (float) floor((40000.0f*val)-23999.5f)/200.0f;
If the goal is to produce values to the nearest integer, try:
val = (float) floor((200.0f*val)-119.5f);