how to improve the precision of computing float numbers?

how to improve the precision of computing float numbers? - c++

I write a code snippet in Microsoft Visual Studio Community 2019 in C++ like this:
int m = 11;
int p = 3;
float step = 1.0 / (m - 2 * p);
the variable step is 0.200003, 0.2 is what i wanted. Is there any suggestion to improve the precision?
This problem comes from UNIFORM KNOT VECTOR. Knot vector is a concept in NURBS. You can think it is just an array of numbers like this: U[] = {0, 0.2, 0.4, 0.6, 0.8, 1.0}; The span between two adjacent numbers is a constant. The size of knot vector can be changed accroding to some condition, but the range is in [0, 1].
the whole function is:
typedef float NURBS_FLOAT;
void CreateKnotVector(int m, int p, bool clamped, NURBS_FLOAT* U)
{
if (clamped)
{
for (int i = 0; i <= p; i++)
{
U[i] = 0;
}
NURBS_FLOAT step = 1.0 / (m - 2 * p);
for (int i = p+1; i < m-p; i++)
{
U[i] = U[i - 1] + step;
}
for (int i = m-p; i <= m; i++)
{
U[i] = 1;
}
}
else
{
U[0] = 0;
NURBS_FLOAT step = 1.0 / m;
for (int i = 1; i <= m; i++)
{
U[i] = U[i - 1] + step;
}
}
}

Let's follow what's going on in your code:
The expression 1.0 / (m - 2 * p) yields 0.2, to which the closest representable double value is 0.200000000000000011102230246251565404236316680908203125. Notice how precise it is – to 16 significant decimal digits. It's because, due to 1.0 being a double literal, the denominator is promoted to double, and the whole calculation is done in double precision, thus yielding a double value.
The value obtained in the previous step is written to step, which has type float. So the value has to be rounded to the closest representable value, which happens to be 0.20000000298023223876953125.
So your cited result of 0.200003 is not what you should get. Instead, it should be closer to 0.200000003.
Is there any suggestion to improve the precision?
Yes. Store the value in a higher-precision variable. E.g., instead of float step, use double step. In this case the value you've calculated won't be rounded once more, so precision will be higher.
Can you get the exact 0.2 value to work with it in the subsequent calculations? With binary floating-point arithmetic, unfortunately, no. In binary, the number 0.2 is a periodic fraction:
0.210 = 0.0̅0̅1̅1̅2 = 0.0011 0011 0011...2
See Is floating point math broken? question and its answers for more details.
If you really need decimal calculations, you should use a library solution, e.g. Boost's cpp_dec_float. Or, if you need arbitrary-precision calculations, you can use e.g. cpp_bin_float from the same library. Note that both variants will be orders of magnitude slower than using bulit-in C++ binary floating-point types.

When dealing with floating point math a certain amount of rounding errors are expected.
For starters, values like 0.2 aren't exactly represented by a float, or even a double:
std::cout << std::setprecision(60) << 0.2 << '\n';
// ^^^ It outputs something like: 0.200000000000000011102230246251565404236316680908203125
Besides, the errors may accumulate when a sequence of operations are performed on imprecise values. Some operations, like summation and subctraction, are more sensitive to this kind of errors than others, so it'd be better to avoid them if possible.
That seems to be the case, here, where we can rewrite OP's function into something like the following
#include <iostream>
#include <iomanip>
#include <vector>
#include <algorithm>
#include <cassert>
#include <type_traits>
template <typename T = double>
auto make_knots(int m, int p = 0) // <- Note that I've changed the signature.
{
static_assert(std::is_floating_point_v<T>);
std::vector<T> knots(m + 1);
int range = m - 2 * p;
assert(range > 0);
for (int i = 1; i < m - p; i++)
{
knots[i + p] = T(i) / range; // <- Less prone to accumulate rounding errors
}
std::fill(knots.begin() + m - p, knots.end(), 1.0);
return knots;
}
template <typename T>
void verify(std::vector<T> const& v)
{
bool sum_is_one = true;
for (int i = 0, j = v.size() - 1; i <= j; ++i, --j)
{
if (v[i] + v[j] != 1.0) // <- That's a bold request for a floating point type
{
sum_is_one = false;
break;
}
}
std::cout << (sum_is_one ? "\n" : "Rounding errors.\n");
}
int main()
{
// For presentation purposes only
std::cout << std::setprecision(60) << 0.2 << '\n';
std::cout << std::setprecision(60) << 0.4 << '\n';
std::cout << std::setprecision(60) << 0.6 << '\n';
std::cout << std::setprecision(60) << 0.8 << "\n\n";
auto k1 = make_knots(11, 3);
for (auto i : k1)
{
std::cout << std::setprecision(60) << i << '\n';
}
verify(k1);
auto k2 = make_knots<float>(10);
for (auto i : k2)
{
std::cout << std::setprecision(60) << i << '\n';
}
verify(k2);
}
Testable here.

One solution to avoid drift (which I guess is your worry?) is to manually use rational numbers, for example in this case you might have:
// your input values for determining step
int m = 11;
int p = 3;
// pre-calculate any intermediate values, which won't have rounding issues
int divider = (m - 2 * p); // could be float or double instead of int
// input
int stepnumber = 1234; // could also be float or double instead of int
// output
float stepped_value = stepnumber * 1.0f / divider;
In other words, formulate your problem so that step of your original code is always 1 (or whatever rational number you can represent exactly using 2 integers) internally, so there is no rounding issue. If you need to display the value for user, then you can do it just for display: 1.0 / divider and round to suitable number of digits.

Related

Why using double and then cast to float?

I'm trying to improve surf.cpp performances. From line 140, you can find this function:
inline float calcHaarPattern( const int* origin, const SurfHF* f, int n )
{
double d = 0;
for( int k = 0; k < n; k++ )
d += (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w;
return (float)d;
}
Running an Intel Advisor Vectorization analysis, it shows that "1 Data type conversions present" which could be inefficient (especially in vectorization).
But my question is: looking at this function, why the authors would have created d as double and then cast it to float? If they wanted a decimal number, float would be ok. The only reason that comes to my mind is that since double is more precise than float, then it can represents smaller numbers, but the final value is big enough to be stored in a float, but I didn't run any test on d value.
Any other possible reason?

Because the author want to have higher precision during calculation, then only round the final result. This is the same as preserving more significant digit during calculation.
More precisely, when addition and subtraction, error can be accumulated. This error can be considerable when large number of floating point numbers involved.

You questioned the answer saying it's to use higher precision during the summation, but I don't see why. That answer is correct. Consider this simplified version with completely made-up numbers:
#include <iostream>
#include <iomanip>
float w = 0.012345;
float calcFloat(const int* origin, int n )
{
float d = 0;
for( int k = 0; k < n; k++ )
d += origin[k] * w;
return (float)d;
}
float calcDouble(const int* origin, int n )
{
double d = 0;
for( int k = 0; k < n; k++ )
d += origin[k] * w;
return (float)d;
}
int main()
{
int o[] = { 1111, 22222, 33333, 444444, 5555 };
std::cout << std::setprecision(9) << calcFloat(o, 5) << '\n';
std::cout << std::setprecision(9) << calcDouble(o, 5) << '\n';
}
The results are:
6254.77979
6254.7793
So even though the inputs are the same in both cases, you get a different result using double for the intermediate summation. Changing calcDouble to use (double)w doesn't change the output.
This suggests that the calculation of (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w is high-enough precision, but the accumulation of errors during the summation is what they're trying to avoid.
This is because of how errors are propagated when working with floating point numbers. Quoting The Floating-Point Guide: Error Propagation:
In general:
Multiplication and division are “safe” operations
Addition and subtraction are dangerous, because when numbers of different magnitudes are involved, digits of the smaller-magnitude number are lost.
So you want the higher-precision type for the sum, which involves addition. Multiplying the integer by a double instead of a float doesn't matter nearly as much: you will get something that is approximately as accurate as the float value you start with (as long as the result it isn't very very large or very very small). But summing float values that could have very different orders of magnitude, even when the individual numbers themselves are representable as float, will accumulate errors and deviate further and further from the true answer.
To see that in action:
float f1 = 1e4, f2 = 1e-4;
std::cout << (f1 + f2) << '\n';
std::cout << (double(f1) + f2) << '\n';
Or equivalently, but closer to the original code:
float f1 = 1e4, f2 = 1e-4;
float f = f1;
f += f2;
double d = f1;
d += f2;
std::cout << f << '\n';
std::cout << d << '\n';
The result is:
10000
10000.0001
Adding the two floats loses precision. Adding the float to a double gives the right answer, even though the inputs were identical. You need nine significant digits to represent the correct value, and that's too many for a float.

Ensure float to be smaller than exact value

I want to calculate a sum of the following form in C++
float result = float(x1)/y1+float(x2)/y2+....+float(xn)/yn
xi,yi are all integers. The result will be an approximation of the actual value. It is crucial that this approximation is smaller or equal to the actual value. I can assume that all my values are finite and positive.
I tried using nextf(,0) as in this code snippet.
cout.precision( 15 );
float a = 1.0f / 3.0f * 10; //3 1/3
float b = 2.0f / 3.0f * 10; //6 2/3
float af = nextafterf( a , 0 );
float bf = nextafterf( b , 0 );
cout << a << endl;
cout << b << endl;
cout << af << endl;
cout << bf << endl;
float sumf = 0.0f;
for ( int i = 1; i <= 3; i++ )
{
sumf = sumf + bf;
}
sumf = sumf + af;
cout << sumf << endl;
As one can see the correct solution would be 3*6,666... +3.333.. = 23,3333...
But as output I get:
3.33333349227905
6.66666698455811
3.33333325386047
6.66666650772095
23.3333339691162
Even though my summands are smaller than what they should represent, their sum is not. In this case applying nextafterf to sumf will give me 23.3333320617676 which is smaller. But does this always work? Is it possible that the rounding error gets so big that nextafterf still leaves me above the correct value?
I know that I could avoid this by implementing a class for fractions and calculating everything exactly. But I'm curious whether it is possible to achieve my goal with floats.

Try changing the float rounding mode to FE_TOWARDZERO.
See code example here:
Change floating point rounding mode

My immediate reaction is that the approach you're taking is fundamentally flawed.
The problem is that with floating point numbers, the size of step that nextafter will take will depend on the magnitude of the numbers involved. Let's consider a somewhat extreme example:
#include <iostream>
#include <iomanip>
#include <cmath>
int main() {
float num = 1.0e-10f;
float denom = 1.0e10f;
std::cout << std::setprecision(7) << num - std::nextafterf(num, 0) << "\n";
std::cout << std::setprecision(7) << denom - std::nextafterf(denom, 0) << "\n";
}
Result:
6.938894e-018
1024
So, since the numerator is a lot smaller than the denominator, the increment is also much smaller.
The result seems fairly clear: instead of the result being slightly smaller than the input, the result should be quite a bit larger than the input.
If you want to ensure the result is smaller than the correct number, the obvious choice would be to round the numerator down, but the denominator up (i.e. nextafterf(denom, positive_infinity). This way, you get a smaller numerator and a larger denominator, so the result is always smaller than the un-modified version would have been.

float result = float(x1)/y1+float(x2)/y2+....+float(xn)/yn has 3 places where rounding may occur.
Conversion of int to float - it is not always exact.
Division floating point x/floating point y
Addition: floating point quotient + floating point quotient.
By using the next, (either up or down per the equation needs), the results will certainly be less than the exact mathematical value. This approach may not generate the float closest to the exact answer, yet will be close and certainly smaller.
float foo(const int *x, const int *y, size_t n) {
float sum = 0.0;
for (size_t i=0; i<n; i++) { // assume x[0] is x1, x[1] is x2 ...
float fx = nextafterf(x[i], 0.0);
float fy = nextafterf(y[i], FLT_MAX);
// divide by slightly smaller over slightly larger
float q = nextafterf(fx / fy, 0.0);
sum = nextafterf(sum + q, 0.0);
}
return sum;
}

What is more accurate way to average, ARR[0]/N+ARR[1]/N...+ARR[N-1]/N or (ARR[0]+ARR[1]...+ARR[N-1])/N in double?

What is more accurate way to calculate average of set of numbers, ARR[0]/N+ARR[1]/N...+ARR[N-1]/N or (ARR[0]+ARR[1]...+ARR[N-1])/N? (ARR is the set of numbers and N is the count of the numbers in that set)
Consider I have set of numbers that each ranges from 0.0 to 1.0 (they are double\floating-point numbers) and there are thousands of them or even millions.
I am open to new methods like recursive average (average twin-cells into array and then again average it until it outputs one-cell array).

If the values near zero are very close to zero, you'll have an issue with rounding (could be rounding error up or down) in a summation, or any range of numbers if summing a large set of numbers. One way around this issue is to use a summation function that only adds numbers with the same exponent (until you call getsum() to get the total sum, where it keeps exponents as close as possible). Example C++ class to do this (note code was compiled using Visual Studio, written before uint64_t was available).
// SUM contains an array of 2048 IEEE 754 doubles, indexed by exponent,
// used to minimize rounding / truncation issues when doing
// a large number of summations
class SUM{
double asum[2048];
public:
SUM(){for(int i = 0; i < 2048; i++)asum[i] = 0.;}
void clear(){for(int i = 0; i < 2048; i++)asum[i] = 0.;}
// getsum returns the current sum of the array
double getsum(){double d = 0.; for(int i = 0; i < 2048; i++)d += asum[i];
return(d);}
void addnum(double);
};
void SUM::addnum(double d) // add a number into the array
{
size_t i;
while(1){
// i = exponent of d
i = ((size_t)((*(unsigned long long *)&d)>>52))&0x7ff;
if(i == 0x7ff){ // max exponent, could be overflow
asum[i] += d;
return;
}
if(asum[i] == 0.){ // if empty slot store d
asum[i] = d;
return;
}
d += asum[i]; // else add slot to d, clear slot
asum[i] = 0.; // and continue until empty slot
}
}
Example program that uses the sum class:
#include <iostream>
#include <iomanip>
using namespace std;
static SUM sum;
int main()
{
double dsum = 0.;
double d = 1./5.;
unsigned long i;
for(i = 0; i < 0xffffffffUL; i++){
sum.addnum(d);
dsum += d;
}
cout << "dsum = " << setprecision(16) << dsum << endl;
cout << "sum.getsum() = " << setprecision(16) << sum.getsum() << endl;
cout << "0xffffffff * 1/5 = " << setprecision(16) << d * (double)0xffffffffUL << endl;
return(0);
}

(ARR[0]+ARR[1]...+ARR[N-1])/N is faster and more accurate because you omit useless divisions with N that both slow down the process and add error in the calculations.

If you have a bunch of floating-point numbers, the most accurate way to get the mean is like this:
template<class T> T mean(T* arr, size_t N) {
std::sort(+arr, arr+N, [](T a, T b){return std::abs(a) < std::abs(b);});
T r = 0;
for(size_t n = 0; n < N; n++)
r += arr[n];
return r / N;
}
Important points:
The numbers of least magnitude are added first to preserve significant digits.
Only one division, to reduce rounding error there.
Still, the intermediate sum might become too big.

Loop with float iteration

First Situation
for (int i = 0 ; i <=2 ; i++)
{
cout << i << endl ;
}
output:
1
2
Second Situation
for (float i = 0 ; i <= 2 ; i+=.2)
{
cout << i << endl;
}
output
1
1.2
1.4
1.6
1.8
The question is why in the second situation he didn't take the 2 even i said ( <= )
and the funny thing if i remove the = the output will be even the same ?
Constrains
i have to use the float DataType
and i want to use the <= Operator

Because 0.2 doesn't fit exactly in a float and you accumulate floating point errors in your loop. On my computer accumulating 10 times 0.2 is 2.38419e-07 above 2.0f

You cannot compare float or double variables using == because of possible arithmetic rounding errors. You should use epsilon.
const float EPSILON = 0.00001f;
for (float f = 0.0f; EPSILON > std::fabs(f - 2.0f); f += 0.2f)
{
std::cout << f << std::endl;
}
Also try use lireral f when you are using float type (float my_float = 12.4f;).

Rounding double values in C++ like MS Excel does it

I've searched all over the net, but I could not find a solution to my problem. I simply want a function that rounds double values like MS Excel does. Here is my code:
#include <iostream>
#include "math.h"
using namespace std;
double Round(double value, int precision) {
return floor(((value * pow(10.0, precision)) + 0.5)) / pow(10.0, precision);
}
int main(int argc, char *argv[]) {
/* The way MS Excel does it:
1.27815 1.27840 -> 1.27828
1.27813 1.27840 -> 1.27827
1.27819 1.27843 -> 1.27831
1.27999 1.28024 -> 1.28012
1.27839 1.27866 -> 1.27853
*/
cout << Round((1.27815 + 1.27840)/2, 5) << "\n"; // *
cout << Round((1.27813 + 1.27840)/2, 5) << "\n";
cout << Round((1.27819 + 1.27843)/2, 5) << "\n";
cout << Round((1.27999 + 1.28024)/2, 5) << "\n"; // *
cout << Round((1.27839 + 1.27866)/2, 5) << "\n"; // *
if(Round((1.27815 + 1.27840)/2, 5) == 1.27828) {
cout << "Hurray...\n";
}
system("PAUSE");
return EXIT_SUCCESS;
}
I have found the function here at stackoverflow, the answer states that it works like the built-in excel rounding routine, but it does not. Could you tell me what I'm missing?

In a sense what you are asking for is not possible:
Floating point values on most common platforms do not have a notion of a "number of decimal places". Numbers like 2.3 or 8.71 simply cannot be represented precisely. Therefore, it makes no sense to ask for any function that will return a floating point value with a given number of non-zero decimal places -- such numbers simply do not exist.
The only thing you can do with floating point types is to compute the nearest representable approximation, and then print the result with the desired precision, which will give you the textual form of the number that you desire. To compute the representation, you can do this:
double round(double x, int n)
{
int e;
double d;
std::frexp(x, &e);
if (e >= 0) return x; // number is an integer, nothing to do
double const f = std::pow(10.0, n);
std::modf(x * f, &d); // d == integral part of 10^n * x
return d / f;
}
(You can also use modf instead of frexp to determine whether x is already an integer. You should also check that n is non-negative, or otherwise define semantics for negative "precision".)
Alternatively to using floating point types, you could perform fixed point arithmetic. That is, you store everything as integers, but you treat them as units of, say, 1/1000. Then you could print such a number as follows:
std::cout << n / 1000 << "." << n % 1000;
Addition works as expected, though you have to write your own multiplication function.

To compare double values, you must specify a range of comparison, where the result could be considered "safe". You could use a macro for that.
Here is one example of what you could use:
#define COMPARE( A, B, PRECISION ) ( ( A >= B - PRECISION ) && ( A <= B + PRECISION ) )
int main()
{
double a = 12.34567;
bool equal = COMPARE( a, 12.34567F, 0.0002 );
equal = COMPARE( a, 15.34567F, 0.0002 );
return 0;
}

Thank you all for your answers! After considering the possible solutions I changed the original Round() function in my code to adding 0.6 instead of 0.5 to the value.
The value "127827.5" (I do understand that this is not an exact representation!) becomes "127828.1" and finally through floor() and dividing it becomes "1.27828" (or something more like 1.2782800..001). Using COMPARE suggested by Renan Greinert with a correctly chosen precision I can safely compare the values now.
Here is the final version:
#include <iostream>
#include "math.h"
#define COMPARE(A, B, PRECISION) ((A >= B-PRECISION) && (A <= B+PRECISION))
using namespace std;
double Round(double value, int precision) {
return floor(value * pow(10.0, precision) + 0.6) / pow(10.0, precision);
}
int main(int argc, char *argv[]) {
/* The way MS Excel does it:
1.27815 1.27840 // 1.27828
1.27813 1.27840 -> 1.27827
1.27819 1.27843 -> 1.27831
1.27999 1.28024 -> 1.28012
1.27839 1.27866 -> 1.27853
*/
cout << Round((1.27815 + 1.27840)/2, 5) << "\n";
cout << Round((1.27813 + 1.27840)/2, 5) << "\n";
cout << Round((1.27819 + 1.27843)/2, 5) << "\n";
cout << Round((1.27999 + 1.28024)/2, 5) << "\n";
cout << Round((1.27839 + 1.27866)/2, 5) << "\n";
//Comparing the rounded value against a fixed one
if(COMPARE(Round((1.27815 + 1.27840)/2, 5), 1.27828, 0.000001)) {
cout << "Hurray!\n";
}
//Comparing two rounded values
if(COMPARE(Round((1.27815 + 1.27840)/2, 5), Round((1.27814 + 1.27841)/2, 5), 0.000001)) {
cout << "Hurray!\n";
}
system("PAUSE");
return EXIT_SUCCESS;
}
I've tested it by rounding a hundred double values and than comparing the results to what Excel gives. They were all the same.

I'm afraid the answer is that Round cannot perform magic.
Since 1.27828 is not exactly representable as a double, you cannot compare some double with 1.27828 and hope it will match.

You need to do the maths without the decimal part, to get that numbers... so something like this.
double dPow = pow(10.0, 5.0);
double a = 1.27815;
double b = 1.27840;
double a2 = 1.27815 * dPow;
double b2 = 1.27840 * dPow;
double c = (a2 + b2) / 2 + 0.5;
Using your function...
double c = (Round(a) + Round(b)) / 2 + 0.5;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js