C++: Solution to this floating point error problem? - c++

This is an example of my code:
float a = 0.f;
float b = 5.f;
float increment = 0.1f;
while(a != b)
a+=increment;
This will result in an infinite loop. Is there any solutions to it, or the only way to solve this is to set a tolerance?

Avoid using floating-point calculation when possible. In this case you can treat with the numbers as integer by multiplying them by 10 and dividing by 10 in the end.
float a, b, increment;
int a_i = 0;
int b_i = 50;
int increment_i = 1;
while(a_i != b_i)
a_i+=increment_i;
a = a_i / 10.f,
b = b_i / 10.f;
increment = increment_i / 10.f;

Related

Composite Simpson's Rule in C++

I've been trying to write a function to approximate an the value of an integral using the Composite Simpson's Rule.
template <typename func_type>
double simp_rule(double a, double b, int n, func_type f){
int i = 1; double area = 0;
double n2 = n;
double h = (b-a)/(n2-1), x=a;
while(i <= n){
area = area + f(x)*pow(2,i%2 + 1)*h/3;
x+=h;
i++;
}
area -= (f(a) * h/3);
area -= (f(b) * h/3);
return area;
}
What I do is multiply each value of the function by either 2 or 4 (and h/3) with pow(2,i%2 + 1) and subtract off the edges as these should only have a weight of 1.
At first, I thought it worked just fine, however, when I compared it to my Trapezoidal Method function it was way more inaccurate which shouldn't be the case.
This is a simpler version of a code I previously wrote which had the same problem, I thought that if I cleaned it up a little the problem would go away, but alas. From another post, I get the idea that there's something going on with the types and the operations I'm doing on them which results in loss of precision, but I just don't see it.
Edit:
For completeness, I was running it for e^x from 1 to zero
\\function to be approximated
double f(double x){ double a = exp(x); return a; }
int main() {
int n = 11; //this method works best for odd values of n
double e = exp(1);
double exact = e-1; //value of integral of e^x from 0 to 1
cout << simp_rule(0,1,n,f) - exact;
The Simpson's Rule uses this approximation to estimate a definite integral:
Where
and
So that there are n + 1 equally spaced sample points xi.
In the posted code, the parameter n passed to the function appears to be the number of points where the function is sampled (while in the previous formula n is the number of intervals, that's not a problem).
The (constant) distance between the points is calculated correctly
double h = (b - a) / (n - 1);
The while loop used to sum the weighted contributes of all the points iterates from x = a up to a point with an ascissa close to b, but probably not exactly b, due to rounding errors. This implies that the last calculated value of f, f(x_n), may be slightly different from the expected f(b).
This is nothing, though, compared to the error caused by the fact that those end points are summed inside the loop with the starting weight of 4 and then subtracted after the loop with weight 1, while all the inner points have their weight switched. As a matter of fact, this is what the code calculates:
Also, using
pow(2, i%2 + 1)
To generate the sequence 4, 2, 4, 2, ..., 4 is a waste, in terms of efficency, and may add (depending on the implementation) other unnecessary rounding errors.
The following algorithm shows how to obtain the same (fixed) result, without a call to that library function.
template <typename func_type>
double simpson_rule(double a, double b,
int n, // Number of intervals
func_type f)
{
double h = (b - a) / n;
// Internal sample points, there should be n - 1 of them
double sum_odds = 0.0;
for (int i = 1; i < n; i += 2)
{
sum_odds += f(a + i * h);
}
double sum_evens = 0.0;
for (int i = 2; i < n; i += 2)
{
sum_evens += f(a + i * h);
}
return (f(a) + f(b) + 2 * sum_evens + 4 * sum_odds) * h / 3;
}
Note that this function requires the number of intervals (e.g. use 10 instead of 11 to obtain the same results of OP's function) to be passed, not the number of points.
Testable here.
The above excellent and accepted solution could benefit from liberal use of std::fma() and templatize on the floating point type.
https://en.cppreference.com/w/cpp/numeric/math/fma
#include <cmath>
template <typename fptype, typename func_type>
double simpson_rule(fptype a, fptype b,
int n, // Number of intervals
func_type f)
{
fptype h = (b - a) / n;
// Internal sample points, there should be n - 1 of them
fptype sum_odds = 0.0;
for (int i = 1; i < n; i += 2)
{
sum_odds += f(std::fma(i,h,a));
}
fptype sum_evens = 0.0;
for (int i = 2; i < n; i += 2)
{
sum_evens += f(std::fma(i,h,a);
}
return (std::fma(2,sum_evens,f(a)) +
std::fma(4,sum_odds,f(b))) * h / 3;
}

Limited float precision and infinitely harmonic signal generation problem

Suppose we need to generate a very long harmonic signal, ideally infinitely long. At first glance, the solution seems trivial:
Sample1:
float t = 0;
while (runned)
{
float v = sinf(w * t);
t += dt;
}
Unfortunately, this is a non-working solution. For t >> dt due to limited float precision incorrect values will be obtained. Fortunately we can call to mind that sin(2*PI* n + x) = sin(x) where n - arbitrary integer value, therefore modifying the example is not difficult to get an "infinite" analog
Sample2:
float t = 0;
float tau = 2 * M_PI / w;
while (runned)
{
float v = sinf(w * t);
t += dt;
if (t > tau) t -= tau;
}
For one physical simulation, I needed to get an infinite signal, which is the sum of harmonic signals, like that:
Sample3:
float getSignal(float x)
{
float ret = 0;
for (int i = 0; i < modNum; i++)
ret += sin(w[i] * x);
return ret;
}
float t = 0;
while (runned)
{
float v = getSignal(t);
t += dt;
}
In this form, the code does not work correctly for large t, for similar reasons for the Sample1. The question is - how to get an "infinite" implementation of the Sample3 algorithm? I assume that the solution should looks like an Sample2. A very important note - generally speaking, w[i] is arbitrary and not harmonics, that is, all frequencies are not multiples of some base frequency, so i can't find common tau. Using types with greater precission (double, long double) is not allowed.
Thanks for your advice!
You can choose an arbitrary tau and store the phase reminders for each mod when subtracting it from t (as #Damien suggested in the comments).
Also, representing the time as t = dt * it where it is an integer can improve numerical stability (i think).
Maybe something like this:
int ndt = 1000; // accumulate phase every 1000 steps for example
float tau = dt * ndt;
std::vector<float> phases(modNum, 0.0f);
int it = 0;
float t = 0.0f;
while (runned)
{
t = dt * it;
float v = 0.0f;
for (int i = 0; i < modNum; i++)
{
v += sinf(w[i] * t + phases[i]);
}
if (++it >= ndt)
{
it = 0;
for (int i = 0; i < modNum; ++i)
{
phases[i] = fmod(w[i] * tau + phases[i], 2 * M_PI);
}
}
}

Integration with variable limits

I need to evaluate a double integral where the inner upper Bound is variable:
integral2 between -5 and 5 ( integral1 between 0 and y f(x)dx )dy.
I'm stuck in the calculation of the outer loop which is dependent on the inner loop. My code runs for a really long time but returns zero.
How can i calculate a integral with variable limits?
First I created a function doubleIntegrate. In the first place the function holds the arrays with coefficients for the trapeziodal rule.
double NumericIntegrationDouble::doubleIntegrate(double (*doubleFunc
(const double &x), double dy, const double &innerLowBound, const double
&outerLowBound)
{
double innerValue = 0.0;
double outerValue = 0.0;
// arrays which store function values for the inner (X) and the outer (Y) integration loop
// vector filled with coefficients for the inner poop (trapezoidal rule)
std::vector<double> vecCoeffsX(numberOfIntervalsDouble+1, 2);
vecCoeffsX[0] = 1; // fist coeff = 1
vecCoeffsX[vecCoeffsX.size()-1] = 1; // last coeff = 1
std::vector<double> funcValuesX(numberOfIntervalsDouble+1);
// vector filled with coefficients for the inner poop (trapezoidal rule)
std::vector<double> vecCoeffsY(numberOfIntervalsDouble+1, 2);
vecCoeffsY[0] = 1; // same as above
vecCoeffsY[vecCoeffsY.size()-1] = 1; // same as above
std::vector<double> funcValuesY(numberOfIntervalsDouble+1)
// Then i created a loop in a loop where dy and dy stands for step size of integration. The variables xi and yi stand for the current x and y value.
// outer integration loop dy
for(int i=0; i<=numberOfIntervalsDouble; i++)
{
double yi = outerLowBound + dy*i;
funcValuesY[i] = (*doubleFunc)(yi);
// inner integration loop dx
for(int j=0; j<=numberOfIntervalsDouble; j++)
{
double dx = abs(yi - innerLowBound) / (double)numberOfIntervalsDouble;
double xi = innerLowBound + j*dx;
funcValuesX[j] = (*doubleFunc)(xi);
double multValueX = std::inner_product(vecCoeffsX.begin(), vecCoeffsX.end(), funcValuesX.begin(), 0.0);
double innerValue = 0.5 * dx * multValueX;
suminnerValue = suminnerValue + innerValue;
}
//auto multValueY = std::inner_product(vecCoeffsY.begin(), vecCoeffsY.end(), funcValuesY.begin(), 0.0);
outerValue = 0.5 * dy * suminnerValue;
}
return outerValue;
}

c++ dividing two floats results in an int

I created a little program that is supposed to calculate pi using the first 26 iterations of the Leibniz formula in c++, just to see id it would work. When I ran the code, it outputted 4 instead of a floating point number. What is going on and how can I fix it? Here is the code:
#include <iostream>
#include <math.h>
using namespace std;
int main ()
{
float a = 1/1;
float b = 1/3;
float c = 1/5;
float d = 1/7;
float e = 1/9;
float f = 1/11;
float g = 1/13;
float h = 1/15;
float i = 1/17;
float j = 1/19;
float k = 1/21;
float l = 1/23;
float m = 1/25;
float n = 1/27;
float o = 1/29;
float p = 1/31;
float q = 1/33;
float r = 1/35;
float s = 1/37;
float t = 1/39;
float u = 1/41;
float v = 1/43;
float w = 1/45;
float x = 1/47;
float y = 1/49;
float z = 1/51;
float a1 = a-b+c-d+e-f+g-h+i-j+k-l+m-n+o-p+q-r+s-t+u-v+w-x+y-z;
float b1 = a1*4;
cout << b1;
}
Yes, I know there are much more simple ways to do this, but this is just a proof of concept.
When you use:
float b = 1/3;
the RHS of the assignment operator is evaluated using integer division, which results in 0. All other variables have the 0 value except a which has the value of 1.
In order to avoid that, use
float b = 1.0f/3;
or
float b = 1.0/3;
Make similar changes to all other statements.
Another way, use casting
float a = (float)1/1; // C-style cast
or
float a = float(1)/1;

An accumulated computing error in SSE version of algorithm of the sum of squared differences

I was trying to optimize following code (sum of squared differences for two arrays):
inline float Square(float value)
{
return value*value;
}
float SquaredDifferenceSum(const float * a, const float * b, size_t size)
{
float sum = 0;
for(size_t i = 0; i < size; ++i)
sum += Square(a[i] - b[i]);
return sum;
}
So I performed optimization with using of SSE instructions of CPU:
inline void SquaredDifferenceSum(const float * a, const float * b, size_t i, __m128 & sum)
{
__m128 _a = _mm_loadu_ps(a + i);
__m128 _b = _mm_loadu_ps(b + i);
__m128 _d = _mm_sub_ps(_a, _b);
sum = _mm_add_ps(sum, _mm_mul_ps(_d, _d));
}
inline float ExtractSum(__m128 a)
{
float _a[4];
_mm_storeu_ps(_a, a);
return _a[0] + _a[1] + _a[2] + _a[3];
}
float SquaredDifferenceSum(const float * a, const float * b, size_t size)
{
size_t i = 0, alignedSize = size/4*4;
__m128 sums = _mm_setzero_ps();
for(; i < alignedSize; i += 4)
SquaredDifferenceSum(a, b, i, sums);
float sum = ExtractSum(sums);
for(; i < size; ++i)
sum += Square(a[i] - b[i]);
return sum;
}
This code works fine if the size of the arrays is not too large.
But if the size is big enough then there is a large computing error between results given by base function and its optimized version.
And so I have a question: Where is here a bug in SSE optimized code, which leads to the computing error.
The error follows from finite precision floating point numbers.
Each addition of two floating point numbers is has an computing error proportional to difference between them.
In your scalar version of algorithm the resulting sum is much greater then each term (if size of arrays is big enough of course).
So it leads to accumulation of big computing error.
In the SSE version of algorithm actually there is four sums for results accumulation. And difference between these sums and each term is lesser in four times relative to scalar code.
So this leads to the lesser computing error.
There are two ways to solve this error:
1) Using of floating point numbers of double precision for accumulating sum.
2) Using of the the Kahan summation algorithm (also known as compensated summation) which significantly reduces the numerical error in the total obtained by adding a sequence of finite precision floating point numbers, compared to the obvious approach.
https://en.wikipedia.org/wiki/Kahan_summation_algorithm
With using of Kahan summation algorithm your scalar code will look like:
inline void KahanSum(float value, float & sum, float & correction)
{
float term = value - correction;
float temp = sum + term;
correction = (temp - sum) - term;
sum = temp;
}
float SquaredDifferenceKahanSum(const float * a, const float * b, size_t size)
{
float sum = 0, correction = 0;
for(size_t i = 0; i < size; ++i)
KahanSum(Square(a[i] - b[i]), sum, correction);
return sum;
}
And SSE optimized code will look as follow:
inline void SquaredDifferenceKahanSum(const float * a, const float * b, size_t i,
__m128 & sum, __m128 & correction)
{
__m128 _a = _mm_loadu_ps(a + i);
__m128 _b = _mm_loadu_ps(b + i);
__m128 _d = _mm_sub_ps(_a, _b);
__m128 term = _mm_sub_ps(_mm_mul_ps(_d, _d), correction);
__m128 temp = _mm_add_ps(sum, term);
correction = _mm_sub_ps(_mm_sub_ps(temp, sum), term);
sum = temp;
}
float SquaredDifferenceKahanSum(const float * a, const float * b, size_t size)
{
size_t i = 0, alignedSize = size/4*4;
__m128 sums = _mm_setzero_ps(), corrections = _mm_setzero_ps();
for(; i < alignedSize; i += 4)
SquaredDifferenceKahanSum(a, b, i, sums, corrections);
float sum = ExtractSum(sums), correction = 0;
for(; i < size; ++i)
KahanSum(Square(a[i] - b[i]), sum, correction);
return sum;
}