Eigen LDLT slower than LLT?

Eigen LDLT slower than LLT? - c++

I'm using the Cholesky module of Eigen 3 for solving a linear equation system. The Eigen documentation states, that using LDLT instead of LLT would be faster for this purpose, but my benchmarks show a different result.
I using the following code for benchmarking:
#include <iostream>
#include <chrono>
#include <Eigen/Core>
#include <Eigen/Cholesky>
using namespace std;
using namespace std::chrono;
using namespace Eigen;
int main()
{
MatrixXf cov = MatrixXf::Random(4200, 4200);
cov = (cov + cov.transpose()) + 1000 * MatrixXf::Identity(4200, 4200);
VectorXf b = VectorXf::Random(4200), r1, r2;
r1 = b;
LLT<MatrixXf> llt;
auto start = high_resolution_clock::now();
llt.compute(cov);
if (llt.info() != Success)
{
cout << "Error on LLT!" << endl;
return 1;
}
auto middle = high_resolution_clock::now();
llt.solveInPlace(r1);
auto stop = high_resolution_clock::now();
cout << "LLT decomposition & solving in " << duration_cast<milliseconds>(middle - start).count()
<< " + " << duration_cast<milliseconds>(stop - middle).count() << " ms." << endl;
r2 = b;
LDLT<MatrixXf> ldlt;
start = high_resolution_clock::now();
ldlt.compute(cov);
if (ldlt.info() != Success)
{
cout << "Error on LDLT!" << endl;
return 1;
}
middle = high_resolution_clock::now();
ldlt.solveInPlace(r2);
stop = high_resolution_clock::now();
cout << "LDLT decomposition & solving in " << duration_cast<milliseconds>(stop - start).count()
<< " + " << duration_cast<milliseconds>(stop - middle).count() << " ms." << endl;
cout << "Total result difference: " << (r2 - r1).cwiseAbs().sum() << endl;
return 0;
}
I've compiled it with g++ -std=c++11 -O2 -o llt.exe llt.cc on Windows and this is what I get:
LLT decomposition & solving in 6515 + 15 ms.
LDLT decomposition & solving in 8562 + 15 ms.
Total result difference: 1.27354e-006
So, why is LDLT slower than LLT? Am I doing something wrong or do I missunderstand the documentation?

This sentence of the documentation is outdated. With a recent version of Eigen, LLT should be much faster than LDLT for quite large matrices because the LLT implementation leverage cache-friendly matrix-matrix operations, while the LDLT implementation involves pivoting and matrix-vector operations only. With the devel branch your example gives me:
LLT decomposition & solving in 380 + 4 ms.
LDLT decomposition & solving in 2746 + 4 ms.

Related

Parallel version of the `std::generate` performs worse than the sequential one

I'm trying to parallelize some old code using the Execution Policy from the C++ 17. My sample code is below:
#include <cstdlib>
#include <chrono>
#include <iostream>
#include <algorithm>
#include <execution>
#include <vector>
using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::duration<double>;
constexpr auto NUM = 100'000'000U;
double func()
{
return rand();
}
int main()
{
std::vector<double> v(NUM);
// ------ feature testing
std::cout << "__cpp_lib_execution : " << __cpp_lib_execution << std::endl;
std::cout << "__cpp_lib_parallel_algorithm: " << __cpp_lib_parallel_algorithm << std::endl;
// ------ fill the vector with random numbers sequentially
auto const startTime1 = Clock::now();
std::generate(std::execution::seq, v.begin(), v.end(), func);
Duration const elapsed1 = Clock::now() - startTime1;
std::cout << "std::execution::seq: " << elapsed1.count() << " sec." << std::endl;
// ------ fill the vector with random numbers in parallel
auto const startTime2 = Clock::now();
std::generate(std::execution::par, v.begin(), v.end(), func);
Duration const elapsed2 = Clock::now() - startTime2;
std::cout << "std::execution::par: " << elapsed2.count() << " sec." << std::endl;
}
The program output on my Linux desktop:
__cpp_lib_execution : 201902
__cpp_lib_parallel_algorithm: 201603
std::execution::seq: 0.971162 sec.
std::execution::par: 25.0349 sec.
Why does the parallel version performs 25 times worse than the sequential one?
Compiler: g++ (Ubuntu 10.3.0-1ubuntu1~20.04) 10.3.0

The thread-safety of rand is implementation-defined. Which means either:
Your code is wrong in the parallel case, or
It's effectively serial, with a highly contended lock, which would dramatically increase the overhead in the parallel case and get incredibly poor performance.
Based on your results, I'm guessing #2 applies, but it could be both.
Either way, the answer is: rand is a terrible test case for parallelism.

How to get Real value precision in C++ Same as Fortran (Pararel Studio XE Compiler)

I have an almost large Fortran 77 code which I'm trying to write it in c++.
the Fortran code has too many math formulas and i have to get same parameter value in c++.
I have a code like this in Fortran :
implicit real*8 (a-h,o-z)
real *8 test
test=3.14159**2
print *,test
And output is : 9.86958772810000
In the c++ code (i use pow for just a sample i have this problem in every math formula):
// 1st Try
double test=pow(3.14159,2);
cout <<std::setprecision(std::numeric_limits<double>::digits10 + 1) <<fixed <<test;
And output is : 9.86958885192871
I know that i can specify the kind of a f-p number by suffixing the kind-selector like this (but it's for fortran i need to get same value in c++
0:
real test=3.14159_8**2
As is described in this question Different precision in C++ and Fortran
i also tried this in c++ and the output was :
// 2nd Try as users suggested in the comments
float test2 = pow(3.14159, 2);
the output 9.8695878982543945
and if i try :
// 3rd Try as users suggested in the comments
float test2 = pow(3.14159f, 2);
output will be : 9.8695888519287109
which still has differences.
** I need to get same value in c++ not Fortran** because the Fortran project uses this parameter all over the project and i have to get same output.
So is there anyway i get same Float/Double precision in c++?
For Fortran i use Pararel Studio XE Compiler 2017
For c++ Visual Studio 2017
Any help would be appreciated.(thank you all for helping).
as Kerndog73 Asked i tried
std::numeric_limits<double>::digits // value is 53
std::numeric_limits<double>::is_iec559 //value is 1
P.S: More Detail
It's one part of my original FORTRAN code, as you can see i need to have all 10 precision in c++ to get same values (this code draws a shape in a text file at the end of the code, and my c++ code is not similar to that shape because precision values are not the same):
// in the last loop i have a value like this 9292780397998.33
// all precision have used
dp=p2-p1
dr=(r2-r1)/(real(gx-1))
dfi=2*3.14159*zr/(real(gy-1))
test=3.14159**2
print *,test
r11=r1
print *,'dp , dr , dfi'
print *,dp,dr,dfi
do 11 i=1,gx
r(i)=r11
st(i)=dr*r(i)*dfi
r11=r11+dr
print *, r11,r(i),st(i)
11 continue
dh=h02-h01
do 1 i=1,gx
do 2 j=1,gy
h0=h01+dh*(r(i)-r1)/(r2-r1)
hkk=hk(i,j)
if (hkk.eq.10) then
hk(i,j)=hkkk
end if
h00=h0+hk(i,j)
h(i,j)=h00/1000000.
!print *, i,j, h(i,j)
!print*, h(i,j)
2 continue
1 continue
!
! write(30,501) ' '
do 12 i=1,gx
do 22 j=1,gy
h3=h(i,j)**3
h3r(i,j)=h3*r(i)
h3ur(i,j)=h3/r(i)
!print *,i,j, h3ur(i,j)
p0(i,j)=p1+dp*(r(i)-r1)/(r2-r1)
!print *,i,j, p0(i,j)
22 continue
12 continue
drfi=dr/(dfi*48*zmu)
dfir=dfi/(dr*48*zmu)
omr=om*dr/8.
print *,'drfi,dfir,omr,zmu'
print *,drfi,dfir,omr,zmu
!p1 = 10000
!do 100 k=1,giter
do 32 i=1,gx
do 42 j=1,gy
if (i.eq.1) then
pp(i,j)=p1**2
goto 242
end if
if (i.eq.gx) then
pp(i,j)=p2**2
goto 242
end if
if (j.eq.1.) then
temp1=drfi*(2*h3ur(i,1)+h3ur(i,(gy-1))+h3ur(i,2))
a=drfi*(2*h3ur(i,1)+h3ur(i,(gy-1))+h3ur(i,2))+
& dfir*(2*h3r(i,1)+h3r(i-1,1)+h3r(i+1,1))
& -omr*r(i)*(h(i,(gy-1))-h(i,2))/p0(i,1)
b=drfi*(h3ur(i,1)+h3ur(i,(gy-1)))+
& omr*r(i)*(h(i,(gy-1))+h(i,1))/p0(i,(gy-1))
c=drfi*(h3ur(i,1)+h3ur(i,2))-
& omr*r(i)*(h(i,1)+h(i,2))/p0(i,2)
d=dfir*(h3r(i,1)+h3r(i-1,1))
e=dfir*(h3r(i,1)+h3r(i+1,1))
pp(i,j)=(b*p0(i,(gy-1))**2+c*p0(i,2)**2+
& d*p0(i-1,1)**2+e*p0(i+1,1)**2)/a
goto 242
end if
if (j.eq.gy) then
a=drfi*(2*h3ur(i,gy)+h3ur(i,(gy-1))+h3ur(i,2))+
& dfir*(2*h3r(i,gy)+h3r(i-1,gy)+h3r(i+1,gy))
& -omr*r(i)*(h(i,(gy-1))-h(i,2))/p0(i,gy)
b=drfi*(h3ur(i,gy)+h3ur(i,(gy-1)))+
& omr*r(i)*(h(i,(gy-1))+h(i,gy))/p0(i,(gy-1))
c=drfi*(h3ur(i,gy)+h3ur(i,2))-
& omr*r(i)*(h(i,gy)+h(i,2))/p0(i,2)
d=dfir*(h3r(i,gy)+h3r(i-1,gy))
e=dfir*(h3r(i,gy)+h3r(i+1,gy))
pp(i,j)=(b*p0(i,(gy-1))**2+c*p0(i,2)**2+
& d*p0(i-1,gy)**2+e*p0(i+1,gy)**2)/a
goto 242
end if
a=drfi*(2*h3ur(i,j)+h3ur(i,j-1)+h3ur(i,j+1))+
& dfir*(2*h3r(i,j)+h3r(i-1,j)+h3r(i+1,j))
& -omr*r(i)*(h(i,j-1)-h(i,j+1))/p0(i,j)
b=drfi*(h3ur(i,j)+h3ur(i,j-1))+
& omr*r(i)*(h(i,j-1)+h(i,j))/p0(i,j-1)
c=drfi*(h3ur(i,j)+h3ur(i,j+1))-
& omr*r(i)*(h(i,j)+h(i,j+1))/p0(i,j+1)
d=dfir*(h3r(i,j)+h3r(i-1,j))
e=dfir*(h3r(i,j)+h3r(i+1,j))
pp(i,j)=(b*p0(i,j-1)**2+c*p0(i,j+1)**2+
& d*p0(i-1,j)**2+e*p0(i+1,j)**2)/a
242 continue
ppp=pp(i,j)
print *,ppp
pneu=sqrt(ppp)
palt=p0(i,j)
p0(i,j)=palt+(pneu-palt)/2.
!print *,p0(i,j)
wt(i,j)=zmu*om*om*((r(i)+dr)**2+r(i)**2)/(2*h(i,j))
!print *,r(i)
p00(i,j)=p0(i,j)/100000.
!print *, p00(i,j)
42 continue
32 continue

I wrote a program to output all possible results in the 3 formats, with casting done to each type at the various possible times:
#include <cmath>
#include <iomanip>
#include <iostream>
#include <limits>
// use `volatile` extensively to inhibit "float store" optimizations
template<class T>
void pp(volatile T val)
{
const size_t prec = std::numeric_limits<T>::digits10 + 1;
std::cout << std::setprecision(prec);
std::cout << std::left;
std::cout << std::setfill('0');
std::cout << std::setw(prec+2) << val;
}
int main()
{
using L = long double;
using D = double;
using F = float;
volatile L lp = 3.14159l;
volatile D dp = 3.14159;
volatile F fp = 3.14159f;
volatile L lpl = lp;
volatile D dpl = lp;
volatile F fpl = lp;
volatile L lpd = dp;
volatile D dpd = dp;
volatile F fpd = dp;
volatile L lpf = fp;
volatile D dpf = fp;
volatile F fpf = fp;
volatile L lpl2 = powl(lpl, 2);
volatile D dpl2 = pow(dpl, 2);
volatile F fpl2 = powf(fpl, 2);
volatile L lpd2 = powl(lpd, 2);
volatile D dpd2 = pow(dpd, 2);
volatile F fpd2 = powf(fpd, 2);
volatile L lpf2 = powl(lpf, 2);
volatile D dpf2 = pow(dpf, 2);
volatile F fpf2 = powf(fpf, 2);
std::cout << "lpl2: "; pp((L)lpl2); std::cout << " "; pp((D)lpl2); std::cout << " "; pp((F)lpl2); std::cout << '\n';
std::cout << "dpl2: "; pp((L)dpl2); std::cout << " "; pp((D)dpl2); std::cout << " "; pp((F)dpl2); std::cout << '\n';
std::cout << "fpl2: "; pp((L)fpl2); std::cout << " "; pp((D)fpl2); std::cout << " "; pp((F)fpl2); std::cout << '\n';
std::cout << "lpd2: "; pp((L)lpd2); std::cout << " "; pp((D)lpd2); std::cout << " "; pp((F)lpd2); std::cout << '\n';
std::cout << "dpd2: "; pp((L)dpd2); std::cout << " "; pp((D)dpd2); std::cout << " "; pp((F)dpd2); std::cout << '\n';
std::cout << "fpd2: "; pp((L)fpd2); std::cout << " "; pp((D)fpd2); std::cout << " "; pp((F)fpd2); std::cout << '\n';
std::cout << "lpf2: "; pp((L)lpf2); std::cout << " "; pp((D)lpf2); std::cout << " "; pp((F)lpf2); std::cout << '\n';
std::cout << "dpf2: "; pp((L)dpf2); std::cout << " "; pp((D)dpf2); std::cout << " "; pp((F)dpf2); std::cout << '\n';
std::cout << "fpf2: "; pp((L)fpf2); std::cout << " "; pp((D)fpf2); std::cout << " "; pp((F)fpf2); std::cout << '\n';
return 0;
}
On my Linux system, this outputs:
long double double float
lpl2: 9.869587728100000000 9.869587728100001 9.869588
dpl2: 9.869587728099999069 9.869587728099999 9.869588
fpl2: 9.869588851928710938 9.869588851928711 9.869589
lpd2: 9.869587728099999262 9.869587728099999 9.869588
dpd2: 9.869587728099999069 9.869587728099999 9.869588
fpd2: 9.869588851928710938 9.869588851928711 9.869589
lpf2: 9.869588472080067731 9.869588472080068 9.869589
dpf2: 9.869588472080067731 9.869588472080068 9.869589
fpf2: 9.869588851928710938 9.869588851928711 9.869589
Base on this, it's possible that you're showing too few digits but Intel's 80-bit format, which is long double on Linux (and, I believe, most x86 OSes), but normally unavailable on Windows.
It's also possible that you're using decimal floats.
But it's also possible your Fortran runtime was just plain broken, many float<->string libraries can generously be described as COMPLETE AND UTTER CRAP.
It's a good habit to use hexadecimal float I/O for reliability.

Use a multiprecision arithmetic library for C++ that gives you more control over the format of numeric values than float, double, etc. in C++.
For example, using Boost.Multiprecision the following code
#include <iostream>
#include <boost/multiprecision/cpp_dec_float.hpp>
#include <iomanip>
int main() {
using number = boost::multiprecision::number<boost::multiprecision::cpp_dec_float<8>>;
number pi_five_sig_digits{ 3.14159 };
number value = boost::multiprecision::pow(pi_five_sig_digits, number(2));
std::cout << "3.14159 ** 2 = " << std::setprecision(15) << value << std::endl;
return 0;
}
yields
3.14159 ** 2 = 9.8695877281

The way that can work in cout until now is what #kerndog73 suggested.(thank you so much)
but my poblem would not solve with cout
#include <cmath>
#include <iomanip>
#include <iostream>
constexpr int prec = std::numeric_limits<double>::digits10 - 1;
int main() {
const double testD = std::pow(3.14159, 2.0);
const float testF = std::pow(3.14159f, 2.0f);
std::cout << "Double: " << std::setprecision(prec) << std::fixed << testD << '\n';
std::cout << "Float: " << std::setprecision(prec) << std::fixed << testF << '\n';
}
and outputs are :
Double
9.86958772810000 // exactly same as FORTRAN 77 output
Float
9.86958885192871

Why do I get the wrong answer when using eigen3 to inverse matrix

I'm using eigen3 to take the inverse of the matrix,but the inverse is wrong.
I tried several examples,but the following this is wrong.
#include <iostream>
#include <Eigen/Dense>
using namespace Eigen;
using namespace std;
int main(){
Matrix3d Mat1;
Mat1 << 99.999999999999972 ,-29024.672261149386 ,29024.848775176863,-29024.672261149386, 8629880.2300641891 ,-8629930.2299046051,29024.848775176863,-8629930.2299046051 , 8629980.2300641891 ;
cout << "Mat1=\n" << Mat1 << endl;
Matrix3d Mat2=Mat1.inverse();
cout << "Mat1逆矩阵：\n" << Mat2 << endl;
cout << "Mat1*Mat2：\n" << Mat1*Mat2 << endl;
cout << "Mat2*Mat1：\n" << Mat2*Mat1 << endl;
cout << "Mat1行列式：\n" << Mat1.determinant() << endl;
the result is:
Mat1=
100 -29024.7 29024.8
-29024.7 8.62988e+06 -8.62993e+06
29024.8 -8.62993e+06 8.62998e+06
Mat1逆矩阵：
44.3313 -12557.7 -12557.8
-12557.7 3.58199e+06 3.58201e+06
-12557.8 3.58201e+06 3.58204e+06
Mat1*Mat2：
1 -0.000198364 0.000823975
-80.0958 0.785156 -0.242188
80.0963 -0.0634151 0.972687
Mat2*Mat1：
1 -80.0958 80.0963
-0.000198364 0.785156 -0.0625
0.000818345 -0.243301 0.972687
Mat1行列式：
5.73875
Shouldn't mat1*mat2 be a unit matrix?

Try using pseudo-inverse instead. That problem maybe a precision issues as #paddy said.
Got that code from here
#include <Eigen/QR>
Eigen::MatrixXd A = ... // fill in A
Eigen::MatrixXd pinv = A.completeOrthogonalDecomposition().pseudoInverse();
My result:
Mat3*Mat1：
1 3.05176e-05 -3.05176e-05
0 1 -0.0078125
2.88524e-05 -0.0137121 1.00454
Mat1*Mat3：
1.00004 -0.0101929 -0.0101624
-3.05176e-05 1.00781 0
5.83113e-05 -0.0134087 0.996313

Floating point error in c++

I have found a simple floating point error and I was wondering if there is a way around it. I'm compiling in Haiku OS. I'm in Haiku r1 Alpha 4
#include <iostream>
#include <cmath>
float P(float PV, float r, int t){
return(r * PV/1-pow((1+r),-t));
}
int main(void){
float PV = 100000;
float r = .005;
int t = 350;
std::cout << "A loan valued at $" << PV << " at a rate of %" << r << " with a payment period of " << t << "months would be $" << P(PV,r,t) << ", per-payment.\n";
return 0;
}
When I run it P(PV,r,t) comes out as 499.834 it should be 500. Though if I set r = 0.06 P is correct and comes out as P = 6000.
Maybe it's a compiler error. I'm using gcc version 2.95.3-haiku-121101.

The code:
return(r * PV/1-pow((1+r),-t));
should be:
return(r * PV/(1-pow((1+r),-t)));
and the expected result is about 605.718, not 500.

Results of tbb::parallel_reduce and std::accumulate differ

I am learning Intel's TBB library. When summing all values in a std::vector the result of tbb::parallel_reduce differs from std::accumulate in the case of more than 16.777.220 elements in the vector (errors experienced at 16.777.320 elements). Here is my minimum-working-example:
#include <iostream>
#include <vector>
#include <numeric>
#include <limits>
#include "tbb/tbb.h"
int main(int argc, const char * argv[]) {
int count = std::numeric_limits<int>::max() * 0.0079 - 187800; // - 187900 works
std::vector<float> heights(size);
std::fill(heights.begin(), heights.end(), 1.0f);
float ssum = std::accumulate(heights.begin(), heights.end(), 0);
float psum = tbb::parallel_reduce(tbb::blocked_range<std::vector<float>::iterator>(heights.begin(), heights.end()), 0,
[](tbb::blocked_range<std::vector<float>::iterator> const& range, float init) {
return std::accumulate(range.begin(), range.end(), init);
}, std::plus<float>()
);
std::cout << std::endl << " Heights serial sum: " << ssum << " parallel sum: " << psum;
return 0;
}
which outputs on my OSX 10.10.3 with XCode 6.3.1 and tbb stable 4.3-20141023 (poured from Brew):
Heights serial sum: 1.67772e+07 parallel sum: 1.67773e+07
Why is that? Should I report an error to TBB developers?
Additional testing, applying your answers:
correct value is: 1949700403
cause we add 1.0f to zero 1949700403 times
using (int) init values:
Runtime: 17.407 sec. Heights serial sum: 16777216.000, wrong
Runtime: 8.482 sec. Heights parallel sum: 131127368.000, wrong
using (float) init values:
Runtime: 12.594 sec. Heights serial sum: 16777216.000, wrong
Runtime: 5.044 sec. Heights parallel sum: 303073632.000, wrong
using (double) initial values:
Runtime: 13.671 sec. Heights serial sum: 1949700352.000, wrong
Runtime: 5.343 sec. Heights parallel sum: 263690016.000, wrong
using (double) initial values and tbb::parallel_deterministic_reduce:
Runtime: 13.463 sec. Heights serial sum: 1949700352.000, wrong
Runtime: 99.031 sec. Heights parallel sum: 1949700352.000, wrong >>> almost 10x slower !
Why do all reduce calls produce the wrong sum? Is (double) not sufficient?
Here is my testing code:
#include <iostream>
#include <vector>
#include <numeric>
#include <limits>
#include <sys/time.h>
#include <iomanip>
#include "tbb/tbb.h"
#include <cmath>
class StopWatch {
private:
double elapsedTime;
timeval startTime, endTime;
public:
StopWatch () : elapsedTime(0) {}
void startTimer() {
elapsedTime = 0;
gettimeofday(&startTime, 0);
}
void stopNprintTimer() {
gettimeofday(&endTime, 0);
elapsedTime = (endTime.tv_sec - startTime.tv_sec) * 1000.0; // compute sec to ms
elapsedTime += (endTime.tv_usec - startTime.tv_usec) / 1000.0; // compute us to ms and add
std::cout << " Runtime: " << std::right << std::setw(6) << elapsedTime / 1000 << " sec."; // show in sec
}
};
int main(int argc, const char * argv[]) {
StopWatch watch;
std::cout << std::fixed << std::setprecision(3) << "" << std::endl;
size_t count = std::numeric_limits<int>::max() * 0.9079;
std::vector<float> heights(count);
std::cout << " Vector size: " << count << std::endl;
std::fill(heights.begin(), heights.end(), 1.0f);
watch.startTimer();
float ssum = std::accumulate(heights.begin(), heights.end(), 0.0); // change type of initial value here
watch.stopNprintTimer();
std::cout << " Heights serial sum: " << std::right << std::setw(8) << ssum << std::endl;
watch.startTimer();
float psum = tbb::parallel_reduce(tbb::blocked_range<std::vector<float>::iterator>(heights.begin(), heights.end()), 0.0, // change type of initial value here
[](tbb::blocked_range<std::vector<float>::iterator> const& range, float init) {
return std::accumulate(range.begin(), range.end(), init);
}, std::plus<float>()
);
watch.stopNprintTimer();
std::cout << " Heights parallel sum: " << std::right << std::setw(8) << psum << std::endl;
return 0;
}
Answer to my last question: they all produce wrong results because they are not made for integer addition with large numbers. Switching to int solves that:
[...]
std::vector<int> heights(count);
std::cout << " Vector size: " << count << std::endl;
std::fill(heights.begin(), heights.end(), 1);
watch.startTimer();
int ssum = std::accumulate(heights.begin(), heights.end(), (int)0);
watch.stopNprintTimer();
std::cout << " Heights serial sum: " << std::right << std::setw(8) << ssum << std::endl;
watch.startTimer();
int psum = tbb::parallel_reduce(tbb::blocked_range<std::vector<int>::iterator>(heights.begin(), heights.end()), (int)0,
[](tbb::blocked_range<std::vector<int>::iterator> const& range, int init) {
return std::accumulate(range.begin(), range.end(), init);
}, std::plus<int>()
);
watch.stopNprintTimer();
std::cout << " Heights parallel sum: " << std::right << std::setw(8) << psum << std::endl;
[...]
results in:
Vector size: 1949700403
Runtime: 13.041 sec. Heights serial sum: 1949700403, correct
Runtime: 4.728 sec. Heights parallel sum: 1949700403, correct and almost 4x faster

Your call to std::accumulate is doing integer addition, then transforming the result to float at the end of the calculation. In order to accumulate over floating point numbers, the accumulator should be a float*.
float ssum = std::accumulate(heights.begin(), heights.end(), 0.0f);
^^^^
* Or any other type that can accumulate float correctly.

To other correct answers for the 'why?' part, I'd also add that TBB provides parallel_deterministic_reduce which guarantees reproducible results between two and more runs on the same data (but it still can differ with std::accumulate). See the blog describing the issue and the deterministic algorithm.
Thus regarding 'Should I report an error to TBB developers?' part, the answer is obviously no (unless you find something insufficient on the TBB side).

This may fix this particular problem for you:
Your call to std::accumulate is doing integer addition, then transforming the result to float at the end of the calculation.
BUT floating point addition is NOT an associative operation:
With accumulate: (...((s+a1)+a2)+...)+an
With parralel_reduce: any parenthesis permutation possible.
http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Eigen LDLT slower than LLT? - c++

Related

Parallel version of the `std::generate` performs worse than the sequential one

How to get Real value precision in C++ Same as Fortran (Pararel Studio XE Compiler)

Why do I get the wrong answer when using eigen3 to inverse matrix

Floating point error in c++

Results of tbb::parallel_reduce and std::accumulate differ

Categories

Resources