Floating point math rounding weird in C++ compared to mathematica - c++

The following post is solved,the problem occurred because of miss interpretation of the formula on http://www.cplusplus.com/reference/random/piecewise_constant_distribution/ The reader is strongly encouraged to consider the page: http://en.cppreference.com/w/cpp/numeric/random/piecewise_constant_distribution
I have the following strange phenomenon which puzzles me!:
I have a piecewise constant probability density given as
using RandomGenType = std::mt19937_64;
RandomGenType gen(51651651651);
using PREC = long double;
std::array<PREC,5> intervals {0.59, 0.7, 0.85, 1, 1.18};
std::array<PREC,4> weights {1.36814, 1.99139, 0.29116, 0.039562};
// integral over the pdf to normalize:
PREC normalization =0;
for(unsigned int i=0;i<4;i++){
normalization += weights[i]*(intervals[i+1]-intervals[i]);
}
std::cout << std::setprecision(30) << "Normalization: " << normalization << std::endl;
// normalize all weights (such that the integral gives 1)!
for(auto & w : weights){
w /= normalization;
}
std::piecewise_constant_distribution<PREC>
distribution (intervals.begin(),intervals.end(),weights.begin());
When I draw n random numbers (radius of sphere in millimeters) from this distribution and compute the mass of the sphere and sum them up like:
unsigned int n = 1000000;
double density = 2400;
double mass = 0;
for(int i=0;i<n;i++){
auto d = 2* distribution(gen) * 1e-3;
mass += d*d*d/3.0*M_PI_2*density;
}
I get mass = 4.3283 kg (see LIVE here)
Doing the EXACT identical thing in Mathematica like:
Gives the assumably correct value of 4.5287 kg. (see mathematica)
Which is not the same, also with different seeds , C++ and Mathematica never match! ? Is that numeric inaccuracy, which I doubt it is...?
Question : What the hack is wrong with the sampling in C++?
Simple Mathematica Code:
pdf[r_] = 2*Piecewise[{{0, r < 0.59}, {1.36814, 0.59 <= r <= 0.7},
{1.99139, Inequality[0.7, Less, r, LessEqual, 0.85]},
{0.29116, Inequality[0.85, Less, r, LessEqual, 1]},
{0.039562, Inequality[1, Less, r, LessEqual, 1.18]},
{0, r > 1.18}}];
pdfr[r_] = pdf[r] / Integrate[pdf[r], {r, 0, 3}];(*normalize*)
Plot[pdf[r], {r, 0.4, 1.3}, Filling -> Axis]
PDFr = ProbabilityDistribution[pdfr[r], {r, 0, 1.18}];
(*if you put 1.18=2 then we dont get 4.52??*)
SeedRandom[100, Method -> "MersenneTwister"]
dataR = RandomVariate[PDFr, 1000000, WorkingPrecision -> MachinePrecision];
Fold[#1 + (2*#2*10^-3)^3 Pi/6 2400 &, 0, dataR]
(*Analytical Solution*)
PDFr = ProbabilityDistribution[pdfr[r], {r, 0, 3}];
1000000 Integrate[ 2400 (2 InverseCDF[PDFr, p] 10^-3)^3 Pi/6, {p, 0, 1}]
Update:
I did some analysis:
Read in the numbers (64bit doubles) generated from Mathematica into
C++ -> calculated the sum and it gives the same as Mathematica
Mass computed by reduction: 4.52528010260687096888432279229
Read in the numbers generated from C++ (64bit double) into Mathematica -> calculated the sum and it gives the same 4.32402
I almost conclude the sampling with std::piecewise_constant_distribution is inaccurate (or as accurate as it gets with 64bit floats) or has a bug... OR there is something wrong with my weights?
Densities are calculated wrongly std::piecewise_constant_distribution in http://coliru.stacked-crooked.com/a/ca171bf600b5148f ===> It seems to be a bug!
Histogramm Plot of CPP Generated values compared to the wanted Distribution:
file = NotebookDirectory[] <> "numbersCpp.bin";
dataCPP = BinaryReadList[file, "Real64"];
Hpdf = HistogramDistribution[dataCPP];
h = DiscretePlot[ PDF[ Hpdf, x], {x, 0.4, 1.2, 0.001},
PlotStyle -> Red];
Show[h, p, PlotRange -> All]
The file is generated here: Number generation CPP

It seems that the formula for the probabilities is wrongly written for std::piecewise_constant_distribution on
http://www.cplusplus.com/reference/random/piecewise_constant_distribution/
The summation of the weights is done without the interval lengths multiplied!
The correct formula is:
http://en.cppreference.com/w/cpp/numeric/random/piecewise_constant_distribution
This solves every stupid quirk previously discovered as bug/floating point error and so on!

[The following paragraph was edited for correctness. --Editor's note]
Mathematica may or may not use IEEE 754 floating point numbers. From the Wolfram documentation:
The Wolfram Language has sophisticated built-in automatic numerical precision and accuracy control. But for special-purpose optimization of numerical computations, or for studying numerical analysis, the Wolfram Language also allows detailed control over precision and accuracy.
and
The Wolfram Language handles both integers and real numbers with any number of digits, automatically tagging numerical precision when appropriate. The Wolfram Language internally uses several highly optimized number representations, but nevertheless provides a uniform interface for digit and precision manipulation, while allowing numerical analysts to study representation details when desired.

Related

Calculation sine and cosine in one shot

I have a scientific code that uses both sine and cosine of the same argument (I basically need the complex exponential of that argument). I was wondering if it were possible to do this faster than calling sine and cosine functions separately.
Also I only need about 0.1% precision. So is there any way I can find the default trig functions and truncate the power series for speed?
One other thing I have in mind is, is there any way to perform the remainder operation such that the result is always positive? In my own algorithm I used x=fmod(x,2*pi); but then I would need to add 2pi if x is negative (smaller domain means I can use a shorter power series)
EDIT: LUT turned out to be the best approach for this, however I am glad I learned about other approximation techniques. I will also advise using an explicit midpoint approximation. This is what I ended up doing:
const int N = 10000;//about 3e-4 error for 1000//3e-5 for 10 000//3e-6 for 100 000
double *cs = new double[N];
double *sn = new double[N];
for(int i =0;i<N;i++){
double A= (i+0.5)*2*pi/N;
cs[i]=cos(A);
sn[i]=sin(A);
}
The following part approximates (midpoint) sincos(2*pi*(wc2+t[j]*(cotp*t[j]-wc)))
double A=(wc2+t[j]*(cotp*t[j]-wc));
int B =(int)N*(A-floor(A));
re += cs[B]*f[j];
im += sn[B]*f[j];
Another approach could have been using the chebyshev decomposition. You can use the orthogonality property to find the coefficients. Optimized for exponential, it looks like this:
double fastsin(double x){
x=x-floor(x/2/pi)*2*pi-pi;//this line can be improved, both inside this
//function and before you input it into the function
double x2 = x*x;
return (((0.00015025063885163012*x2-
0.008034350857376128)*x2+ 0.1659789684145034)*x2-0.9995812174943602)*x;} //7th order chebyshev approx
If you seek fast evaluation with good (but not high) accuracy with powerseries you should use an expansion in Chebyshev polynomials: tabulate the coefficients (you'll need VERY few for 0.1% accuracy) and evaluate the expansion with the recursion relations for these polynomials (it's really very easy).
References:
Tabulated coefficients: http://www.ams.org/mcom/1980-34-149/S0025-5718-1980-0551302-5/S0025-5718-1980-0551302-5.pdf
Evaluation of chebyshev expansion: https://en.wikipedia.org/wiki/Chebyshev_polynomials
You'll need to (a) get the "reduced" argument in the range -pi/2..+pi/2 and consequently then (b) handle the sign in your results when the argument actually should have been in the "other" half of the full elementary interval -pi..+pi. These aspects should not pose a major problem:
determine (and "remember" as an integer 1 or -1) the sign in the original angle and proceed with the absolute value.
use a modulo function to reduce to the interval 0..2PI
Determine (and "remember" as an integer 1 or -1) whether it is in the "second" half and, if so, subtract pi*3/2, otherwise subtract pi/2. Note: this effectively interchanges sine and cosine (apart from signs); take this into account in the final evaluation.
This completes the step to get an angle in -pi/2..+pi/2
After evaluating sine and cosine with the Cheb-expansions, apply the "flags" of steps 1 and 3 above to get the right signs in the values.
Just create a lookup table. The following will let you lookup the sin and cos of any radian value between -2PI and 2PI.
// LOOK UP TABLE
var LUT_SIN_COS = [];
var N = 14400;
var HALF_N = N >> 1;
var STEP = 4 * Math.PI / N;
var INV_STEP = 1 / STEP;
// BUILD LUT
for(var i=0, r = -2*Math.PI; i < N; i++, r += STEP) {
LUT_SIN_COS[2*i] = Math.sin(r);
LUT_SIN_COS[2*i + 1] = Math.cos(r);
}
You index into the lookup table by:
var index = ((r * INV_STEP) + HALF_N) << 1;
var sin = LUT_SIN_COS[index];
var cos = LUT_SIN_COS[index + 1];
Here's a fiddle that displays the % error you can expect from different sized LUTS http://jsfiddle.net/77h6tvhj/
EDIT Here's an ideone (c++) with a ~benchmark~ vs the float sin and cos. http://ideone.com/SGrFVG For whatever a benchmark on ideone.com is worth the LUT is 5 times faster.
One way to go would be to learn how to implement the CORDIC algorithm. It is not difficult and pretty interesting intelectually. This gives you both the cosine and the sine. Wikipedia gives a MATLAB example that should be easy to adapt in C++.
Note that you can augment speed and reduce precision simply by lowering the parameter n.
About your second question, it has already been asked here (in C). It seems that there is no simple way.
You can also calculate sine using a square root, given the angle and the cosine.
The example below assumes the angle ranges from 0 to 2π:
double c = cos(angle);
double s = sqrt(1.0-c*c);
if(angle>pi)s=-s;
For single-precision floats, Microsoft uses 11-degree polynomial approximation for sine, 10-degree for cosine: XMScalarSinCos.
They also have faster version, XMScalarSinCosEst, that uses lower-degree polynomials.
If you aren’t on Windows, you’ll find same code + coefficients on geometrictools.com under Boost license.

The result of own double precision cos() implemention in a shader is NaN, but works well on the CPU. What is going wrong?

as i said, i want implement my own double precision cos() function in a compute shader with GLSL, because there is just a built-in version for float.
This is my code:
double faculty[41];//values are calculated at the beginning of main()
double myCOS(double x)
{
double sum,tempExp,sign;
sum = 1.0;
tempExp = 1.0;
sign = -1.0;
for(int i = 1; i <= 30; i++)
{
tempExp *= x;
if(i % 2 == 0){
sum = sum + (sign * (tempExp / faculty[i]));
sign *= -1.0;
}
}
return sum;
}
The result of this code is, that the sum turns out to be NaN on the shader, but on the CPU the algorithm is working well.
I tried to debug this code too and I got the following information:
faculty[i] is positive and not zero for all entries
tempExp is positive in each step
none of the other variables are NaN during each step
the first time sum is NaN is at the step with i=4
and now my question: What exactly can go wrong if each variable is a number and nothing is divided by zero especially when the algorithm works on the CPU?
Let me guess:
First you determined the problem is in the loop, and you use only the following operations: +, *, /.
The rules for generating NaN from these operations are:
The divisions 0/0 and ±∞/±∞
The multiplications 0×±∞ and ±∞×0
The additions ∞ + (−∞), (−∞) + ∞ and equivalent subtractions
You ruled out the possibility for 0/0 and ±∞/±∞ by stating that faculty[] is correctly initialized.
The variable sign is always 1.0 or -1.0 so it cannot generate the NaN through the * operation.
What remains is the + operation if tempExp ever become ±∞.
So probably tempExp is too high on entry of your function and becomes ±∞, this will make sum to be ±∞ too. At the next iteration you will trigger the NaN generating operation through: ∞ + (−∞). This is because you multiply one side of the addition by sign and sign switches between positive and negative at each iteration.
You're trying to approximate cos(x) around 0.0. So you should use the properties of the cos() function to reduce your input value to a value near 0.0. Ideally in the range [0, pi/4]. For instance, remove multiples of 2*pi, and get the values of cos() in [pi/4, pi/2] by computing sin(x) around 0.0 and so on.
What can go dramatically wrong is a loss of precision. cos(x) usually is implemented by range reduction followed by a dedicated implementation for the range [0, pi/2]. Range reduction uses cos(x+2*pi) = cos(x). But this range reduction isn't perfect. For starters, pi cannot be exactly represented in finite math.
Now what happens if you try something as absurd as cos(1<<30) ? It's quite possible that the range reduction algorithm introduces an error in x that's larger than 2*pi, in which case the outcome is meaningless. Returning NaN in such cases is reasonable.

Oh where has my precision gone with OpenMesh vector arithmetic?

Using doubles I would expect to have about 15 decimal points of precision. I know that many decimal numbers are not exactly representable in floating point notation, so I would get an approximation for 1/3 for example. However, using a double I would expect an approximation that was correct to about 15 decimal points. I would also expect to retain that level of accuracy when doing arithmetic.
However, in the following example, I try to calculate the area of a triangle using Heron's formula and OpenMesh::Vec3d which are backed by OpenMesh::VectorDataT<double,3> and end up with a result that is only accurate to 5 decimal points.
The correct result is area = 8.19922e-8, but I'm getting area=8.1992238711962083e-8. Any ideas where this is coming from?
The suggestion that this might result from the instability in Heron's Formula is a good one, but unfortunately is not the case in this example. I have added code which calculates the stable variation on Heron for those who might be interested. In this example, u.norm()>v.norm()>w.norm().
#include <OpenMesh/Core/Mesh/PolyMesh_ArrayKernelT.hh>
int main()
{
//triangle vertices
OpenMesh::Vec3d x(0.051051, 0.057411, 0.001355);
OpenMesh::Vec3d y(0.050981, 0.057337, -0.000678);
OpenMesh::Vec3d z(0.050949, 0.057303, 0.0);
//edge vectors
OpenMesh::Vec3d u = x-y;
OpenMesh::Vec3d v = x-z;
OpenMesh::Vec3d w = y-z;
//Heron's Formula
double semiP = (u.norm() + v.norm() + w.norm())/2.0;
double area = sqrt(semiP * (semiP - u.norm()) * (semiP - v.norm()) * (semiP - w.norm()) );
//Heron's Formula for small angles
double areaSmall = sqrt((u.norm() + (v.norm()+w.norm()))*(w.norm()-(u.norm()-v.norm()))*(w.norm()+(u.norm()-v.norm()))*(u.norm()+(v.norm()-w.norm())))/4.0;
}
Heron's formula is numerically unstable. If you have a very "flat" triangle with small angles, the sum of the two small sides is almost the long side, so one of the terms gets very small. If, for example, a and b are the small sides,
(s - c)
will be very small, because
s = (a + b + c)/2
is nearly equal to c.
The wikipedia article about herons formula mentions a stable alternative:
Arrange the sides such that a > b > c and use
A = 1/4*sqrt((a + (b + c))*(c - (a - b))*(c + (a - b))*(a + (b - c)))
To 75 decimal places, the correct area of your triangle is
0.000000081992238711963087279421583920293974467992093148008322378721298327364.
If I replace the nine double constants you have with their decimal equivalents, I get
0.000000081992238711965902754749500279615357792172906541206211853522524016959
It would appear that you are not getting what you're expecting because you're expecting something unreasonable.
Any calculation involving subtraction will result in a loss of precision, if the values are at all close to each other. How many significant digits do you expect from this subtraction?
1.23456789012345
- 1.23456789000000
----------------
0.00000000012345
Both operands have 15 digits of precision, but the result only has 5.

Quadrature routines for probability densities

I want to integrate a probability density function from (-\infty, a] because the cdf is not available in closed form. But I'm not sure how to do this in C++.
This task is pretty simple in Mathematica; All I need to do is define the function,
f[x_, lambda_, alpha_, beta_, mu_] :=
Module[{gamma},
gamma = Sqrt[alpha^2 - beta^2];
(gamma^(2*lambda)/((2*alpha)^(lambda - 1/2)*Sqrt[Pi]*Gamma[lambda]))*
Abs[x - mu]^(lambda - 1/2)*
BesselK[lambda - 1/2, alpha Abs[x - mu]] E^(beta (x - mu))
];
and then call the NIntegrate Routine to numerically integrate it.
F[x_, lambda_, alpha_, beta_, mu_] :=
NIntegrate[f[t, lambda, alpha, beta, mu], {t, -\[Infinity], x}]
Now I want to achieve the same thing in C++. I using the routine gsl_integration_qagil from the gsl numerics library. It is designed to integrate functions on the semi infinite intervals (-\infty, a] which is just what I want. But unfortunately I can't get it to work.
This is the density function in C++,
density(double x)
{
using namespace boost::math;
if(x == _mu)
return std::numeric_limits<double>::infinity();
return pow(_gamma, 2*_lambda)/(pow(2*_alpha, _lambda-0.5)*sqrt(_pi)*tgamma(_lambda))* pow(abs(x-_mu), _lambda - 0.5) * cyl_bessel_k(_lambda-0.5, _alpha*abs(x - _mu)) * exp(_beta*(x - _mu));
}
Then I try and integrate to obtain the cdf by calling the gsl routine.
cdf(double x)
{
gsl_integration_workspace * w = gsl_integration_workspace_alloc (1000);
double result, error;
gsl_function F;
F.function = &density;
double epsabs = 0;
double epsrel = 1e-12;
gsl_integration_qagil (&F, x, epsabs, epsrel, 1000, w, &result, &error);
printf("result = % .18f\n", result);
printf ("estimated error = % .18f\n", error);
printf ("intervals = %d\n", w->size);
gsl_integration_workspace_free (w);
return result;
}
However gsl_integration_qagil returns an error, number of iterations was insufficient.
double mu = 0.0f;
double lambda = 3.0f;
double alpha = 265.0f;
double beta = -5.0f;
cout << cdf(0.01) << endl;
If I increase the size of the workspace then the bessel function will not evaluate.
I was wondering if there was anyone that could give me any insight to my problem. A call to the corresponding Mathematica function F above with x = 0.01 returns 0.904384.
Could it be that the density is concentrated around a very small interval (i.e. outside of [-0.05, 0.05] the density is almost 0, a plot is given below). If so what can be done about this. Thanks for reading.
Re: integrating to +/- infinity:
I would use Mathematica to find an empirical bound for |x - μ| >> K, where K represents the "width" around the mean, and K is a function of alpha, beta, and lambda -- for example F is less than and approximately equal to a(x-μ)-2 or ae-b(x-μ)2 or whatever. These functions have known integrals out to infinity, for which you can evaluate empirically. Then you can integrate numerically out to K, and use the bounded approximation to get from K to infinity.
Figuring out K may be a bit tricky; I'm not very familiar with Bessel functions so I can't help you much there.
In general, I've found that for numerical calculation that's not obvious, the best way is to do as much analytical math as you can before you do numerical evaluation. (Kind of like an autofocus camera -- get it close to where you want, then let the camera do the rest.)
I haven't tried the C++ code, but by checking out the function in Mathematica, it does seem extremely peaked around mu, with the spread of the peak determined by the parameters lambda,alpha,beta.
What I would do would be to do a preliminary search of the pdf: look to the right and to the left of x=mu until you find the first value below a given tolerance. Use these as the bounds for your cdf, instead of negative infinity.
Pseudo code follows:
x_mu
step = 0.000001
adaptive_step(y_value) -> returns a small step size if close to 0, and larger if far.
while (pdf_current > tolerance):
step = adaptive_step(pdf_current)
xtest = xtest - step
pdf_current = pdf(xtest)
left_bound = xtest
//repeat for left bound
Given how tightly peaked this function seems to be, tightening the bounds would probably save you a lot of computer time that's currently wasted calculating zeros. Also, you'd be able to use the bounded integration routine, rather than -\infty,b .
Just a thought...
PS: Mathematica gives me F[0.01, 3, 265, -5, 0] = 0.884505
I found a complete description on this glsl there http://linux.math.tifr.res.in/manuals/html/gsl-ref-html/gsl-ref_16.html, you may find usefull informations.
Since I'm not GSL expert I did not focus on your problem from the math point of view, but rather I've to remind you some key aspect about floating point programming.
You can't accurately represent numbers using IEEE 754 standard. MathLab do hide the fact by using an infinite number representation logic, in order to give you rouding error-free results , this is the reason why it's slow compared to native code.
I strongly recommand this link for anyone involved in scientific calculus using a FPU:
http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
Assuming you've enjoyed that article, I've noticed this on the GSL link above: "The routines will fail to converge if the error bounds are too stringent".
Your bounds may be too stringent if the difference between the upper and the lower is less than the minimum representable value of double, that is
std::numeric_limits::epsilon();.
In addition remember, from the 2nd link, for any C/C++ compiler implementation the default rounding mode is "truncate", this introduce subtle calculus errors leeding to the wrong results. I did have the problem with a simple Liang Barsky line clipper, 1st order ! So imagine the mess in this line:
return pow(_gamma, 2*_lambda)/(pow(2*_alpha, _lambda-0.5)*sqrt(_pi)*tgamma(_lambda))* pow(abs(x-_mu), _lambda - 0.5) * cyl_bessel_k(_lambda-0.5, _alpha*abs(x - _mu)) * exp(_beta*(x - _mu));
As a general rule, it is wise in C/C++, to add additional variable holding intermediate results, so you can debug step by step, then see any rounding error, you shouldn't try to input expression like this one in any native programing langage. One can't optimize variables better than a compiler.
Finally, as a general rule, you should multiply everything then divide, unless you are confident about the dynamic behavior of your calculus.
Good luck.

Accurate summation of nested products

I would like to reduce numerical floating-point errors in the following computation.
I have an equation of the following form:
b_3+w_3*(b_2+w_2*(b_1+w_1*(b_0+w_0)))
where the variable w represents some floating-point number in the range [0,1] and b represents a floating-point constant in the range [1,~1000000]. b increases monotonically with subscript (though this may not be important). Naturally, this could be extended to any number of terms:
b_4+w_4*(c_3+w_3*(b_2+w_2*(b_1+w_1*(b_0+w_0))))
This can be defined recursively as:
func(x,n):
if(n==MAX)
return x
else
return func(b[n]+x*w[n],n+1)
func(1,0)
If I were doing an online summation, I could use the Kahan Summation Algorithm (Kahan 1965), or one of several other methods ala Higham 1993 or McNamee 2004, to bound the size of my errors. If I were doing online repeated products, I could use some sort of conversion technique to reduce the problem to summation.
As it is, I'm not sure how to approach this particular problem. Does anyone have thoughts (and citations to go with them)?
Thanks!
Higham 1993. "The accuracy of floating point summation". SIAM Journal on Scientific Computing.
Kahan 1965. "Pracniques: further remarks on reducing truncation errors". CACM. "10.1145/363707.363723".
McNamee 2004. "A comparison of methods for accurate summation". SIGSAM Bull. "10.1145/980175.980177".
Your computation looks similar to a Horner scheme, except that instead of a single variable x, there are different weights w[i] being used at every stage.
There are algorithms for compensated Horner schemes which I think you could adapt for your purposes. See for example theorem 3 and algorithm 2 in the following paper.
P. Langlois, How to Ensure a Faithful Polynomial Evaluation with the Compensated Horner Algorithm. 18th IEEE Symposium on Computer Arithmetic, 25 - 27 June 2007, ARITH '07, pp. 141-149,
http://www.acsel-lab.com/arithmetic/papers/ARITH18/ARITH18_Langlois.pdf
If in algorithm 2 you replace TwoProd (s[i+1], x) with TwoProd (s[i+1], w[i+1]) it seems you would get the desired result, but I have not tried it.
The way you have defined func, it evaluates to the following expression:
For MAX = n+1, func(1,0) ==
n n
\--- -----
1 + > | | w[j]
/--- | |
i=0 j=n-i
So, the way I would resolve the sum would be:
double s = 0.0;
double a = 1.0;
for (int i = 1; i <= MAX; ++i) {
a *= w[MAX-i];
s += a;
}
return 1.0 + s;
Even if we treat the x input value to func as a variable, it only affects the final term. But because of it's range, you should take care in calculating it.
double s = 0.0;
double a = 1.0;
double ax = x;
for (int i = 1; i < MAX; ++i) {
a *= w[MAX-i];
ax *= w[MAX-i];
s += a;
}
ax *= w[0];
s += ax;
return 1.0 + s;