Ensure float to be smaller than exact value

Ensure float to be smaller than exact value - c++

I want to calculate a sum of the following form in C++
float result = float(x1)/y1+float(x2)/y2+....+float(xn)/yn
xi,yi are all integers. The result will be an approximation of the actual value. It is crucial that this approximation is smaller or equal to the actual value. I can assume that all my values are finite and positive.
I tried using nextf(,0) as in this code snippet.
cout.precision( 15 );
float a = 1.0f / 3.0f * 10; //3 1/3
float b = 2.0f / 3.0f * 10; //6 2/3
float af = nextafterf( a , 0 );
float bf = nextafterf( b , 0 );
cout << a << endl;
cout << b << endl;
cout << af << endl;
cout << bf << endl;
float sumf = 0.0f;
for ( int i = 1; i <= 3; i++ )
{
sumf = sumf + bf;
}
sumf = sumf + af;
cout << sumf << endl;
As one can see the correct solution would be 3*6,666... +3.333.. = 23,3333...
But as output I get:
3.33333349227905
6.66666698455811
3.33333325386047
6.66666650772095
23.3333339691162
Even though my summands are smaller than what they should represent, their sum is not. In this case applying nextafterf to sumf will give me 23.3333320617676 which is smaller. But does this always work? Is it possible that the rounding error gets so big that nextafterf still leaves me above the correct value?
I know that I could avoid this by implementing a class for fractions and calculating everything exactly. But I'm curious whether it is possible to achieve my goal with floats.

Try changing the float rounding mode to FE_TOWARDZERO.
See code example here:
Change floating point rounding mode

My immediate reaction is that the approach you're taking is fundamentally flawed.
The problem is that with floating point numbers, the size of step that nextafter will take will depend on the magnitude of the numbers involved. Let's consider a somewhat extreme example:
#include <iostream>
#include <iomanip>
#include <cmath>
int main() {
float num = 1.0e-10f;
float denom = 1.0e10f;
std::cout << std::setprecision(7) << num - std::nextafterf(num, 0) << "\n";
std::cout << std::setprecision(7) << denom - std::nextafterf(denom, 0) << "\n";
}
Result:
6.938894e-018
1024
So, since the numerator is a lot smaller than the denominator, the increment is also much smaller.
The result seems fairly clear: instead of the result being slightly smaller than the input, the result should be quite a bit larger than the input.
If you want to ensure the result is smaller than the correct number, the obvious choice would be to round the numerator down, but the denominator up (i.e. nextafterf(denom, positive_infinity). This way, you get a smaller numerator and a larger denominator, so the result is always smaller than the un-modified version would have been.

float result = float(x1)/y1+float(x2)/y2+....+float(xn)/yn has 3 places where rounding may occur.
Conversion of int to float - it is not always exact.
Division floating point x/floating point y
Addition: floating point quotient + floating point quotient.
By using the next, (either up or down per the equation needs), the results will certainly be less than the exact mathematical value. This approach may not generate the float closest to the exact answer, yet will be close and certainly smaller.
float foo(const int *x, const int *y, size_t n) {
float sum = 0.0;
for (size_t i=0; i<n; i++) { // assume x[0] is x1, x[1] is x2 ...
float fx = nextafterf(x[i], 0.0);
float fy = nextafterf(y[i], FLT_MAX);
// divide by slightly smaller over slightly larger
float q = nextafterf(fx / fy, 0.0);
sum = nextafterf(sum + q, 0.0);
}
return sum;
}

Related

Exact double division

Consider the following function:
auto f(double a, double b) -> int
{
return std::floor(a/b);
}
So I want to compute the largest integer k such that k * b <= a in a mathematical sense.
As there could be rounding errors, I am unsure whether the above function really computes this k. I do not worry about the case that k could be out of range.
What is the proper way to determine this k for sure?

It depends how strict you are. Take a double b and an integer n, and calculate bn. Then a will be rounded. If a is rounded down, then it is less than the mathematical value of nb, and a/b is mathematically less than n. You will get a result if n instead of n-1.
On the other hand, a == b*n will be true. So the “correct” result could be surprising.
Your condition was that “kb <= a”. If we interpret this as “the result of multiplying kb using double precision is <= a”, then you’re fine. If we interpret it as “the mathematically exact product of k and b is <= a”, then you need to calculate k*b - a using the fma function and check the result. This will tell you the truth, but might return a result of 4 if a was calculated as 5.0 * b and was rounded down.

The problem is that float division is not exact.
a/b can give 1.9999 instead of 2, and std::floor can then give 1.
One simple solution is to add a small value prior calling std::floor:
std::floor (a/b + 1.0e-10);
Result:
result = 10 while 11 was expected
With eps added, result = 11
Test code:
#include <iostream>
#include <cmath>
int main () {
double b = atan (1.0);
int x = 11;
double a = x * b;
int y = std::floor (a/b);
std::cout << "result = " << y << " while " << x << " was expected\n";
double eps = 1.0e-10;
int z = std::floor (a/b + eps);
std::cout << "With eps added, result = " << z << "\n";
return 0;
}

Float operations using double

I have a function which takes two strings(floating point) , operation and floating point bit-width:
EvaluateFloat(const string &str1, const string &str2, enum operation/*add,subtract, multiply,div*/, unsigned int bit-width, string &output)
input str1 and str2 could be float(32 bit) or double (64 bit).
Is it fine If store the inputs in double and perform double operation irrespective of bit-width and depending upon bit-width typecast it to float if it was 32 bit.
e.g
double num1 = atof(str1);
double num2 = atof(str2);
double result = num1 operation num2; //! operation will resolved using switch
if(32 == bit-width)
{
float f_result = result;
output = std::to_string(f_result);
}
else
{
output = std::to_string(result);
}
Can I assume safely f_result will be exactly same if I had performed operation using float type for float operations i.e.
float f_num1 = num1;
float f_num2 = num2;
float f_result = f_num1 operation f_num2
PS:
We assume there won;t be any cascaded operation i.e. out = a + b + c
instead it will transformed to: temp = a +b out = temp + c
I'm not concerned by inf and nan values.
I'm trying to code redundancy otherwise I have two do same operation
twice once for float and other for double

C++ does not specify which formats are used for float or double. If IEEE-754 binary32 and binary64 are used, then double-rounding errors do not occur for +, -, *, /, or sqrt. Given float x and float y, the following hold (float arithmetic on the left, double on the right):
x+y = (float) ((double) x + (double) y).
x-y = (float) ((double) x - (double) y).
x*y = (float) ((double) x * (double) y).
x/y = (float) ((double) x / (double) y).
sqrt(x) = (float) sqrt((double) x).
This is per the dissertation A Rigorous Framework for Fully Supporting the IEEE Standard for Floating-Point Arithmetic in High-Level Programming Languages by Samuel A. Figueroa del Cid, January 2000, New York University. Essentially, double has so many digits (bits) beyond float that the rounding to double never conceals the information needed to round correctly to float for results of these operations. (This cannot hold for operations in general; it depends on properties of these operations.) On page 57, Figueroa del Cid gives a table showing that, if the float format has p bits, then, to avoid double rounding errors, double must have 2p+1 bits for addition or subtraction, 2p for multiplication and division, and 2p+2 for sqrt. Since binary32 has 24 bits in the significand and double has 53, these are satisfied. (See the paper for details. There are some caveats, such as that p must be at least 2 or 4 for the various operations.)

According to standards floating point operations on double is equivalent to doing the operation in infinite precision. If we convert it to float we have now rounded it twice. In general this is not equivalent to just rounding to a float in the first place. For example. 0.47 rounds to 0.5 which rounds to 1, but 0.47 rounds directly to 0. As mentioned by chtz, multiplication of two floats should always be exactly some double (using IEEE math where double has more than twice the precision of float), so when we cast to a float we have still only lost precision once and so the result should be the same. Likewise addition and subtraction should not be a problem.
Division cannot be exactly represented in a double (not even 1/3), so we may think there is a problem with division. However I have run the sample code over night, trying over 3 trillion cases and have not found any case where running the original divide as a double gives a different answer.
#include <iostream>
int main() {
long i=0;
while (1) {
float x = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
float y = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
float f = x / y;
double d = (double)x / (double)y;
if(++i % 10000000 == 0) { std::cout << i << "\t" << x << "," << y << std::endl; }
if ((float(d) != f)) {
std::cout << std::endl;
std::cout << x << "," << y << std::endl;
std::cout << std::hex << *(int*)&x << "," << std::hex << *(int*)&y << std::endl;
std::cout << float(d) - f << std::endl;
return 1;
}
}
}

how to improve the precision of computing float numbers?

I write a code snippet in Microsoft Visual Studio Community 2019 in C++ like this:
int m = 11;
int p = 3;
float step = 1.0 / (m - 2 * p);
the variable step is 0.200003, 0.2 is what i wanted. Is there any suggestion to improve the precision?
This problem comes from UNIFORM KNOT VECTOR. Knot vector is a concept in NURBS. You can think it is just an array of numbers like this: U[] = {0, 0.2, 0.4, 0.6, 0.8, 1.0}; The span between two adjacent numbers is a constant. The size of knot vector can be changed accroding to some condition, but the range is in [0, 1].
the whole function is:
typedef float NURBS_FLOAT;
void CreateKnotVector(int m, int p, bool clamped, NURBS_FLOAT* U)
{
if (clamped)
{
for (int i = 0; i <= p; i++)
{
U[i] = 0;
}
NURBS_FLOAT step = 1.0 / (m - 2 * p);
for (int i = p+1; i < m-p; i++)
{
U[i] = U[i - 1] + step;
}
for (int i = m-p; i <= m; i++)
{
U[i] = 1;
}
}
else
{
U[0] = 0;
NURBS_FLOAT step = 1.0 / m;
for (int i = 1; i <= m; i++)
{
U[i] = U[i - 1] + step;
}
}
}

Let's follow what's going on in your code:
The expression 1.0 / (m - 2 * p) yields 0.2, to which the closest representable double value is 0.200000000000000011102230246251565404236316680908203125. Notice how precise it is – to 16 significant decimal digits. It's because, due to 1.0 being a double literal, the denominator is promoted to double, and the whole calculation is done in double precision, thus yielding a double value.
The value obtained in the previous step is written to step, which has type float. So the value has to be rounded to the closest representable value, which happens to be 0.20000000298023223876953125.
So your cited result of 0.200003 is not what you should get. Instead, it should be closer to 0.200000003.
Is there any suggestion to improve the precision?
Yes. Store the value in a higher-precision variable. E.g., instead of float step, use double step. In this case the value you've calculated won't be rounded once more, so precision will be higher.
Can you get the exact 0.2 value to work with it in the subsequent calculations? With binary floating-point arithmetic, unfortunately, no. In binary, the number 0.2 is a periodic fraction:
0.210 = 0.0̅0̅1̅1̅2 = 0.0011 0011 0011...2
See Is floating point math broken? question and its answers for more details.
If you really need decimal calculations, you should use a library solution, e.g. Boost's cpp_dec_float. Or, if you need arbitrary-precision calculations, you can use e.g. cpp_bin_float from the same library. Note that both variants will be orders of magnitude slower than using bulit-in C++ binary floating-point types.

When dealing with floating point math a certain amount of rounding errors are expected.
For starters, values like 0.2 aren't exactly represented by a float, or even a double:
std::cout << std::setprecision(60) << 0.2 << '\n';
// ^^^ It outputs something like: 0.200000000000000011102230246251565404236316680908203125
Besides, the errors may accumulate when a sequence of operations are performed on imprecise values. Some operations, like summation and subctraction, are more sensitive to this kind of errors than others, so it'd be better to avoid them if possible.
That seems to be the case, here, where we can rewrite OP's function into something like the following
#include <iostream>
#include <iomanip>
#include <vector>
#include <algorithm>
#include <cassert>
#include <type_traits>
template <typename T = double>
auto make_knots(int m, int p = 0) // <- Note that I've changed the signature.
{
static_assert(std::is_floating_point_v<T>);
std::vector<T> knots(m + 1);
int range = m - 2 * p;
assert(range > 0);
for (int i = 1; i < m - p; i++)
{
knots[i + p] = T(i) / range; // <- Less prone to accumulate rounding errors
}
std::fill(knots.begin() + m - p, knots.end(), 1.0);
return knots;
}
template <typename T>
void verify(std::vector<T> const& v)
{
bool sum_is_one = true;
for (int i = 0, j = v.size() - 1; i <= j; ++i, --j)
{
if (v[i] + v[j] != 1.0) // <- That's a bold request for a floating point type
{
sum_is_one = false;
break;
}
}
std::cout << (sum_is_one ? "\n" : "Rounding errors.\n");
}
int main()
{
// For presentation purposes only
std::cout << std::setprecision(60) << 0.2 << '\n';
std::cout << std::setprecision(60) << 0.4 << '\n';
std::cout << std::setprecision(60) << 0.6 << '\n';
std::cout << std::setprecision(60) << 0.8 << "\n\n";
auto k1 = make_knots(11, 3);
for (auto i : k1)
{
std::cout << std::setprecision(60) << i << '\n';
}
verify(k1);
auto k2 = make_knots<float>(10);
for (auto i : k2)
{
std::cout << std::setprecision(60) << i << '\n';
}
verify(k2);
}
Testable here.

One solution to avoid drift (which I guess is your worry?) is to manually use rational numbers, for example in this case you might have:
// your input values for determining step
int m = 11;
int p = 3;
// pre-calculate any intermediate values, which won't have rounding issues
int divider = (m - 2 * p); // could be float or double instead of int
// input
int stepnumber = 1234; // could also be float or double instead of int
// output
float stepped_value = stepnumber * 1.0f / divider;
In other words, formulate your problem so that step of your original code is always 1 (or whatever rational number you can represent exactly using 2 integers) internally, so there is no rounding issue. If you need to display the value for user, then you can do it just for display: 1.0 / divider and round to suitable number of digits.

Why using double and then cast to float?

I'm trying to improve surf.cpp performances. From line 140, you can find this function:
inline float calcHaarPattern( const int* origin, const SurfHF* f, int n )
{
double d = 0;
for( int k = 0; k < n; k++ )
d += (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w;
return (float)d;
}
Running an Intel Advisor Vectorization analysis, it shows that "1 Data type conversions present" which could be inefficient (especially in vectorization).
But my question is: looking at this function, why the authors would have created d as double and then cast it to float? If they wanted a decimal number, float would be ok. The only reason that comes to my mind is that since double is more precise than float, then it can represents smaller numbers, but the final value is big enough to be stored in a float, but I didn't run any test on d value.
Any other possible reason?

Because the author want to have higher precision during calculation, then only round the final result. This is the same as preserving more significant digit during calculation.
More precisely, when addition and subtraction, error can be accumulated. This error can be considerable when large number of floating point numbers involved.

You questioned the answer saying it's to use higher precision during the summation, but I don't see why. That answer is correct. Consider this simplified version with completely made-up numbers:
#include <iostream>
#include <iomanip>
float w = 0.012345;
float calcFloat(const int* origin, int n )
{
float d = 0;
for( int k = 0; k < n; k++ )
d += origin[k] * w;
return (float)d;
}
float calcDouble(const int* origin, int n )
{
double d = 0;
for( int k = 0; k < n; k++ )
d += origin[k] * w;
return (float)d;
}
int main()
{
int o[] = { 1111, 22222, 33333, 444444, 5555 };
std::cout << std::setprecision(9) << calcFloat(o, 5) << '\n';
std::cout << std::setprecision(9) << calcDouble(o, 5) << '\n';
}
The results are:
6254.77979
6254.7793
So even though the inputs are the same in both cases, you get a different result using double for the intermediate summation. Changing calcDouble to use (double)w doesn't change the output.
This suggests that the calculation of (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w is high-enough precision, but the accumulation of errors during the summation is what they're trying to avoid.
This is because of how errors are propagated when working with floating point numbers. Quoting The Floating-Point Guide: Error Propagation:
In general:
Multiplication and division are “safe” operations
Addition and subtraction are dangerous, because when numbers of different magnitudes are involved, digits of the smaller-magnitude number are lost.
So you want the higher-precision type for the sum, which involves addition. Multiplying the integer by a double instead of a float doesn't matter nearly as much: you will get something that is approximately as accurate as the float value you start with (as long as the result it isn't very very large or very very small). But summing float values that could have very different orders of magnitude, even when the individual numbers themselves are representable as float, will accumulate errors and deviate further and further from the true answer.
To see that in action:
float f1 = 1e4, f2 = 1e-4;
std::cout << (f1 + f2) << '\n';
std::cout << (double(f1) + f2) << '\n';
Or equivalently, but closer to the original code:
float f1 = 1e4, f2 = 1e-4;
float f = f1;
f += f2;
double d = f1;
d += f2;
std::cout << f << '\n';
std::cout << d << '\n';
The result is:
10000
10000.0001
Adding the two floats loses precision. Adding the float to a double gives the right answer, even though the inputs were identical. You need nine significant digits to represent the correct value, and that's too many for a float.

Taylor Series Resulting in nan after sin(90) and cos(120)

doing a school project. i do not understand why the sin comes out to -NaN when after sin(90) and cos(120).
Can anyone help me understand this?
Also, when I put this in an online C++ editor it totally works, but when compiled in linux it does not.
// Nick Garver
// taylorSeries
// taylorSeries.cpp
#include <iostream>
#include <cmath>
#include <iomanip>
using namespace std;
const double PI = atan(1.0)*4.0;
double angle_in_degrees = 0;
double radians = 0;
double degreesToRadians(double d);
double factorial(double factorial);
double mySine(double x);
double myCosine(double x);
int main()
{
cout << "\033[2J\033[1;1H";
cout.width(4); cout << left << "Deg";
cout.width(9); cout << left << "Radians";
cout.width(11); cout << left << "RealSine";
cout.width(11); cout << left << "MySin";
cout.width(12); cout << left << "RealCos";
cout.width(11); cout << left << "MyCos"<<endl;
while (angle_in_degrees <= 360) //radian equivalent of 45 degrees
{
double sine = sin(degreesToRadians(angle_in_degrees));
double cosine = cos(degreesToRadians(angle_in_degrees));
//output
cout.width(4); cout << left << angle_in_degrees;
cout.width(9); cout << left << degreesToRadians(angle_in_degrees);
cout.width(11); cout << left << sine;
cout.width(11); cout << left << mySine(degreesToRadians(angle_in_degrees));
cout.width(12); cout << left << cosine;
cout.width(11); cout << left << myCosine(degreesToRadians(angle_in_degrees))<<endl;
angle_in_degrees = angle_in_degrees + 15;
}
cout << endl;
return 0;
}
double degreesToRadians(double d)
{
double answer;
answer = (d*PI)/180;
return answer;
}
double mySine(double x)
{
double result = 0;
for(int i = 1; i <= 1000; i++) {
if (i % 2 == 1)
result += pow(x, i * 2 - 1) / factorial(i * 2 - 1);
else
result -= pow(x, i * 2 - 1) / factorial(i * 2 - 1);
}
return result;
}
double myCosine(double x)
{
double positive = 0.0;
double negative= 0.0;
double result=0.0;
for (int i=4; i<=1000; i+=4)
{
positive = positive + (pow(x,i) / factorial (i));
}
for (int i=2; i<=1000; i+=4)
{
negative = negative + (pow(x,i) / factorial (i));
}
result = (1 - (negative) + (positive));
return result;
}
double factorial(double factorial)
{
float x = 1;
for (float counter = 1; counter <= factorial; counter++)
{
x = x * counter;
}
return x;
}

(Marcus has good points; I am going to ramble in other directions...)
Look at the terms in a Taylor series. They become too small to make any difference after fewer than 10 terms. Asking for 1000 is asking for trouble.
Instead of going for 1000, go until the next term does not add anything, something like:
term = pow(x, i * 2 - 1) / factorial(i * 2 - 1);
if (result + term == result) { break; }
result += term;
The series would run much faster if you iteratively calculated the pow and factorial rather than starting over each time. (But, probably speed is not an issue at this point.)
Float has 24 bits of binary precision. Beginning perhaps with 13!, you will get roundoff errors in float. Double, on the other hand, has 53 bits of precision and will last until about 22! without roundoff errors. My point is that you should have done factorial() in double.
Another problem is that the computation of the Taylor series gets somewhat 'unstable' for bigger arguments. Intermediate terms become bigger than the end result, thereby leading to other roundoff errors. To avoid this, a common way to compute sine and cosine is to first fold to between -45 and +45 degrees. No unfolding, except maybe for the sign, is needed later.
As for why you had trouble on one system but not the other -- Different implementations handle NaN differently.
Once you have gotten the NaN out of the way, try computing the series in reverse order. This will lead to a different set of roundoff errors. Will it make your sin() closer to the real sin?
The 'real' sin is probably computed in hardware with 64-bit fixed-point arithmetic, and will be "correctly rounded" to 53 or 24 bits well over 99% of the time. (This, of course, depends on the chip manufacturer, hence my 'hand-waving' statement.)
To judge how 'close' your value is, you need to compute ULPs (units in the last place). This involves looking at the bits in the float/double. (Beyond the scope of this question.)
Sorry about the TMI.

Before I answer this, a few remarks:
It's always helpful for your own debugging to keep your code tidy. Remove unnecessary empty lines, make sure your bracketing style is uniform, and properly indent. I did this for you, but believe me, you'll avoid a lot of bugs if you keep up a consistent style!
you have functions that take double as input and return double, but internally just use float; that should be a red flag!
your whole degreesToRadians would be better to read and only one third as long if you just used return (d*PI)/180;
Answers now:
in your factorial function, you calculate a factorial for values up to 1999. Hint: try to figure out the value of 1999! and look up the maximum number that float on your machine can hold. Then look up double's maximum. How many orders of magnitude is 1999! larger?
1999! is ca. 10^5732. That is a large number, about 150 orders of magnitude larger than what a 32bit float can hold, or still 18 orders of magnitude larger than what a 64bit double can hold. To compare, to store 1999! in a double would be like trying to fit the distance from sun center to earth center in the typical 0.1µm diameter of bacteria.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js