Results of tbb::parallel_reduce and std::accumulate differ - c++

I am learning Intel's TBB library. When summing all values in a std::vector the result of tbb::parallel_reduce differs from std::accumulate in the case of more than 16.777.220 elements in the vector (errors experienced at 16.777.320 elements). Here is my minimum-working-example:
#include <iostream>
#include <vector>
#include <numeric>
#include <limits>
#include "tbb/tbb.h"
int main(int argc, const char * argv[]) {
int count = std::numeric_limits<int>::max() * 0.0079 - 187800; // - 187900 works
std::vector<float> heights(size);
std::fill(heights.begin(), heights.end(), 1.0f);
float ssum = std::accumulate(heights.begin(), heights.end(), 0);
float psum = tbb::parallel_reduce(tbb::blocked_range<std::vector<float>::iterator>(heights.begin(), heights.end()), 0,
[](tbb::blocked_range<std::vector<float>::iterator> const& range, float init) {
return std::accumulate(range.begin(), range.end(), init);
}, std::plus<float>()
);
std::cout << std::endl << " Heights serial sum: " << ssum << " parallel sum: " << psum;
return 0;
}
which outputs on my OSX 10.10.3 with XCode 6.3.1 and tbb stable 4.3-20141023 (poured from Brew):
Heights serial sum: 1.67772e+07 parallel sum: 1.67773e+07
Why is that? Should I report an error to TBB developers?
Additional testing, applying your answers:
correct value is: 1949700403
cause we add 1.0f to zero 1949700403 times
using (int) init values:
Runtime: 17.407 sec. Heights serial sum: 16777216.000, wrong
Runtime: 8.482 sec. Heights parallel sum: 131127368.000, wrong
using (float) init values:
Runtime: 12.594 sec. Heights serial sum: 16777216.000, wrong
Runtime: 5.044 sec. Heights parallel sum: 303073632.000, wrong
using (double) initial values:
Runtime: 13.671 sec. Heights serial sum: 1949700352.000, wrong
Runtime: 5.343 sec. Heights parallel sum: 263690016.000, wrong
using (double) initial values and tbb::parallel_deterministic_reduce:
Runtime: 13.463 sec. Heights serial sum: 1949700352.000, wrong
Runtime: 99.031 sec. Heights parallel sum: 1949700352.000, wrong >>> almost 10x slower !
Why do all reduce calls produce the wrong sum? Is (double) not sufficient?
Here is my testing code:
#include <iostream>
#include <vector>
#include <numeric>
#include <limits>
#include <sys/time.h>
#include <iomanip>
#include "tbb/tbb.h"
#include <cmath>
class StopWatch {
private:
double elapsedTime;
timeval startTime, endTime;
public:
StopWatch () : elapsedTime(0) {}
void startTimer() {
elapsedTime = 0;
gettimeofday(&startTime, 0);
}
void stopNprintTimer() {
gettimeofday(&endTime, 0);
elapsedTime = (endTime.tv_sec - startTime.tv_sec) * 1000.0; // compute sec to ms
elapsedTime += (endTime.tv_usec - startTime.tv_usec) / 1000.0; // compute us to ms and add
std::cout << " Runtime: " << std::right << std::setw(6) << elapsedTime / 1000 << " sec."; // show in sec
}
};
int main(int argc, const char * argv[]) {
StopWatch watch;
std::cout << std::fixed << std::setprecision(3) << "" << std::endl;
size_t count = std::numeric_limits<int>::max() * 0.9079;
std::vector<float> heights(count);
std::cout << " Vector size: " << count << std::endl;
std::fill(heights.begin(), heights.end(), 1.0f);
watch.startTimer();
float ssum = std::accumulate(heights.begin(), heights.end(), 0.0); // change type of initial value here
watch.stopNprintTimer();
std::cout << " Heights serial sum: " << std::right << std::setw(8) << ssum << std::endl;
watch.startTimer();
float psum = tbb::parallel_reduce(tbb::blocked_range<std::vector<float>::iterator>(heights.begin(), heights.end()), 0.0, // change type of initial value here
[](tbb::blocked_range<std::vector<float>::iterator> const& range, float init) {
return std::accumulate(range.begin(), range.end(), init);
}, std::plus<float>()
);
watch.stopNprintTimer();
std::cout << " Heights parallel sum: " << std::right << std::setw(8) << psum << std::endl;
return 0;
}
Answer to my last question: they all produce wrong results because they are not made for integer addition with large numbers. Switching to int solves that:
[...]
std::vector<int> heights(count);
std::cout << " Vector size: " << count << std::endl;
std::fill(heights.begin(), heights.end(), 1);
watch.startTimer();
int ssum = std::accumulate(heights.begin(), heights.end(), (int)0);
watch.stopNprintTimer();
std::cout << " Heights serial sum: " << std::right << std::setw(8) << ssum << std::endl;
watch.startTimer();
int psum = tbb::parallel_reduce(tbb::blocked_range<std::vector<int>::iterator>(heights.begin(), heights.end()), (int)0,
[](tbb::blocked_range<std::vector<int>::iterator> const& range, int init) {
return std::accumulate(range.begin(), range.end(), init);
}, std::plus<int>()
);
watch.stopNprintTimer();
std::cout << " Heights parallel sum: " << std::right << std::setw(8) << psum << std::endl;
[...]
results in:
Vector size: 1949700403
Runtime: 13.041 sec. Heights serial sum: 1949700403, correct
Runtime: 4.728 sec. Heights parallel sum: 1949700403, correct and almost 4x faster

Your call to std::accumulate is doing integer addition, then transforming the result to float at the end of the calculation. In order to accumulate over floating point numbers, the accumulator should be a float*.
float ssum = std::accumulate(heights.begin(), heights.end(), 0.0f);
^^^^
* Or any other type that can accumulate float correctly.

To other correct answers for the 'why?' part, I'd also add that TBB provides parallel_deterministic_reduce which guarantees reproducible results between two and more runs on the same data (but it still can differ with std::accumulate). See the blog describing the issue and the deterministic algorithm.
Thus regarding 'Should I report an error to TBB developers?' part, the answer is obviously no (unless you find something insufficient on the TBB side).

This may fix this particular problem for you:
Your call to std::accumulate is doing integer addition, then transforming the result to float at the end of the calculation.
BUT floating point addition is NOT an associative operation:
With accumulate: (...((s+a1)+a2)+...)+an
With parralel_reduce: any parenthesis permutation possible.
http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

Related

Trouble when using Efficient_Ransac in CGAL

I want to use the Efficient Ransac implementation of CGAL, but whenever I try to set my own parameters, the algorithm doesn't detect any shape anymore.
This work is related to the Polyfit implementation in CGAL. I want to fine tune the plane detection to see the influence it has on the algorithm. When I use the standard call to ransac.detect(), it works perfectly. However, when I want to set my own parameters it just doesn't find any plane, even if I set them manually to the default values.
Here is my code, strongly related to this example
#include <CGAL/Exact_predicates_inexact_constructions_kernel.h>
#include <CGAL/IO/read_xyz_points.h>
#include <CGAL/IO/Writer_OFF.h>
#include <CGAL/property_map.h>
#include <CGAL/Surface_mesh.h>
#include <CGAL/Shape_detection/Efficient_RANSAC.h>
#include <CGAL/Polygonal_surface_reconstruction.h>
#ifdef CGAL_USE_SCIP
#include <CGAL/SCIP_mixed_integer_program_traits.h>
typedef CGAL::SCIP_mixed_integer_program_traits<double> MIP_Solver;
#elif defined(CGAL_USE_GLPK)
#include <CGAL/GLPK_mixed_integer_program_traits.h>
typedef CGAL::GLPK_mixed_integer_program_traits<double> MIP_Solver;
#endif
#if defined(CGAL_USE_GLPK) || defined(CGAL_USE_SCIP)
#include <CGAL/Timer.h>
#include <fstream>
typedef CGAL::Exact_predicates_inexact_constructions_kernel Kernel;
typedef Kernel::Point_3 Point;
typedef Kernel::Vector_3 Vector;
// Point with normal, and plane index
typedef boost::tuple<Point, Vector, int> PNI;
typedef std::vector<PNI> Point_vector;
typedef CGAL::Nth_of_tuple_property_map<0, PNI> Point_map;
typedef CGAL::Nth_of_tuple_property_map<1, PNI> Normal_map;
typedef CGAL::Nth_of_tuple_property_map<2, PNI> Plane_index_map;
typedef CGAL::Shape_detection::Efficient_RANSAC_traits<Kernel, Point_vector, Point_map, Normal_map> Traits;
typedef CGAL::Shape_detection::Efficient_RANSAC<Traits> Efficient_ransac;
typedef CGAL::Shape_detection::Plane<Traits> Plane;
typedef CGAL::Shape_detection::Point_to_shape_index_map<Traits> Point_to_shape_index_map;
typedef CGAL::Polygonal_surface_reconstruction<Kernel> Polygonal_surface_reconstruction;
typedef CGAL::Surface_mesh<Point> Surface_mesh;
int main(int argc, char ** argv)
{
Point_vector points;
// Loads point set from a file.
const std::string &input_file = argv[1];
//const std::string input_file(input);
std::ifstream input_stream(input_file.c_str());
if (input_stream.fail()) {
std::cerr << "failed open file \'" <<input_file << "\'" << std::endl;
return EXIT_FAILURE;
}
std::cout << "Loading point cloud: " << input_file << "...";
CGAL::Timer t;
t.start();
if (!input_stream ||
!CGAL::read_xyz_points(input_stream,
std::back_inserter(points),
CGAL::parameters::point_map(Point_map()).normal_map(Normal_map())))
{
std::cerr << "Error: cannot read file " << input_file << std::endl;
return EXIT_FAILURE;
}
else
std::cout << " Done. " << points.size() << " points. Time: " << t.time() << " sec." << std::endl;
// Shape detection
Efficient_ransac ransac;
ransac.set_input(points);
ransac.add_shape_factory<Plane>();
std::cout << "Extracting planes...";
t.reset();
// Set parameters for shape detection.
Efficient_ransac::Parameters parameters;
// Set probability to miss the largest primitive at each iteration.
parameters.probability = 0.05;
// Detect shapes with at least 500 points.
parameters.min_points = 100;
// Set maximum Euclidean distance between a point and a shape.
parameters.epsilon = 0.01;
// Set maximum Euclidean distance between points to be clustered.
parameters.cluster_epsilon = 0.01;
// Set maximum normal deviation.
// 0.9 < dot(surface_normal, point_normal);
parameters.normal_threshold = 0.9;
// Detect shapes.
ransac.detect(parameters);
//ransac.detect();
Efficient_ransac::Plane_range planes = ransac.planes();
std::size_t num_planes = planes.size();
std::cout << " Done. " << num_planes << " planes extracted. Time: " << t.time() << " sec." << std::endl;
// Stores the plane index of each point as the third element of the tuple.
Point_to_shape_index_map shape_index_map(points, planes);
for (std::size_t i = 0; i < points.size(); ++i) {
// Uses the get function from the property map that accesses the 3rd element of the tuple.
int plane_index = get(shape_index_map, i);
points[i].get<2>() = plane_index;
}
//////////////////////////////////////////////////////////////////////////
std::cout << "Generating candidate faces...";
t.reset();
Polygonal_surface_reconstruction algo(
points,
Point_map(),
Normal_map(),
Plane_index_map()
);
std::cout << " Done. Time: " << t.time() << " sec." << std::endl;
//////////////////////////////////////////////////////////////////////////
Surface_mesh model;
std::cout << "Reconstructing...";
t.reset();
if (!algo.reconstruct<MIP_Solver>(model)) {
std::cerr << " Failed: " << algo.error_message() << std::endl;
return EXIT_FAILURE;
}
const std::string& output_file(input_file+"_result.off");
std::ofstream output_stream(output_file.c_str());
if (output_stream && CGAL::write_off(output_stream, model))
std::cout << " Done. Saved to " << output_file << ". Time: " << t.time() << " sec." << std::endl;
else {
std::cerr << " Failed saving file." << std::endl;
return EXIT_FAILURE;
}
//////////////////////////////////////////////////////////////////////////
// Also stores the candidate faces as a surface mesh to a file
Surface_mesh candidate_faces;
algo.output_candidate_faces(candidate_faces);
const std::string& candidate_faces_file(input_file+"_candidate_faces.off");
std::ofstream candidate_stream(candidate_faces_file.c_str());
if (candidate_stream && CGAL::write_off(candidate_stream, candidate_faces))
std::cout << "Candidate faces saved to " << candidate_faces_file << "." << std::endl;
return EXIT_SUCCESS;
}
#else
int main(int, char**)
{
std::cerr << "This test requires either GLPK or SCIP.\n";
return EXIT_SUCCESS;
}
#endif // defined(CGAL_USE_GLPK) || defined(CGAL_USE_SCIP)
When launched, I have the following message:
Loading point cloud: Scene1/test.xyz... Done. 169064 points. Time: 0.428 sec.
Extracting planes... Done. 0 planes extracted. Time: 8.328 sec.
Generating candidate faces... Done. Time: 0.028 sec.
Reconstructing... Failed: at least 4 planes required to reconstruct a closed surface mesh (only 1 provided)
While I have this when launching the code the ransac detection function without parameters:
Loading point cloud: Scene1/test.xyz... Done. 169064 points. Time: 0.448 sec.
Extracting planes... Done. 18 planes extracted. Time: 3.088 sec.
Generating candidate faces... Done. Time: 94.536 sec.
Reconstructing... Done. Saved to Scene1/test.xyz_result.off. Time: 30.28 sec.
Can someone help me setting my own parameters for the ransac shape detection?
However, when I want to set my own parameters it just doesn't find any
plane, even if I set them manually to the default values.
Just to be sure: "setting them manually to the default values" is not what you are doing in the code you shared.
Default values are documented as:
1% of the total number of points for min_points, which should be around 1700 points in your case, not 100
1% of the bounding box diagonal for epsilon and cluster_epsilon. For that obviously I don't know if that is what you used (0.01) as I don't have access to your point set, but if you want to reproduce default values, you should use the CGAL::Bbox_3 object at some point
If you use these values, there's no reason why it should behave differently than with no parameters given (if it does not work, then please let me know because there may be a bug).

memset is significantly faster then Eigen::Tensor SetZero()

I have changed Eigen::Tensor SetZero() call in my code to the memset call over the tensor data and observing significant better performance. Builded in VS 2016(SSE2 support should be enabled by default). Why does this happens? I have expected that Eigen::Tensor is highly optimized.
#include <unsupported/Eigen/CXX11/Tensor>
#include <iostream>
#include <ctime>
#define MyLayoutType Eigen::RowMajor
#define Tf3 Eigen::Tensor<float, 3, MyLayoutType>
clock_t begin = clock();
Tf3 tensor(1000, 500, 20);
for (size_t i = 0; i < 100; i++)
{
tensor.setRandom();
memset(tensor.data(), 0, tensor.size() * sizeof(float));
// VS:
//tensor.setZero();
}
clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
cout << "-----------------------" << endl;
cout << "Total time elapsed: " << elapsed_secs << "
secs" << endl;
cout << tensor(0, 0, 0);
On my env I got 2.1 for memset on avg and 2.3 for setZero. And setRandom operation is much more havy then memset. If I comment out tensor.setRandom() I get 0.4 for memset and 0.5 for setZero.IN real code difference in performance is bigger.

Proper method of using std::chrono

While I realize this is probably one of many identical questions, I can't seem to figure out how to properly use std::chrono. This is the solution I cobbled together.
#include <stdlib.h>
#include <iostream>
#include <chrono>
typedef std::chrono::high_resolution_clock Time;
typedef std::chrono::milliseconds ms;
float startTime;
float getCurrentTime();
int main () {
startTime = getCurrentTime();
std::cout << "Start Time: " << startTime << "\n";
while(true) {
std::cout << getCurrentTime() - startTime << "\n";
}
return EXIT_SUCCESS;
}
float getCurrentTime() {
auto now = Time::now();
return std::chrono::duration_cast<ms>(now.time_since_epoch()).count() / 1000;
}
For some reason, this only ever returns integer values as the difference, which increments upwards at rate of 1 per second, but starting from an arbitrary, often negative, value.
What am I doing wrong? Is there a better way of doing this?
Don't escape the chrono type system until you absolutely have to. That means don't use .count() except for I/O or interacting with legacy API.
This translates to: Don't use float as time_point.
Don't bother with high_resolution_clock. This is always a typedef to either system_clock or steady_clock. For more portable code, choose one of the latter.
.
#include <iostream>
#include <chrono>
using Time = std::chrono::steady_clock;
using ms = std::chrono::milliseconds;
To start, you're going to need a duration with a representation of float and the units of seconds. This is how you do that:
using float_sec = std::chrono::duration<float>;
Next you need a time_point which uses Time as the clock, and float_sec as its duration:
using float_time_point = std::chrono::time_point<Time, float_sec>;
Now your getCurrentTime() can just return Time::now(). No fuss, no muss:
float_time_point
getCurrentTime() {
return Time::now();
}
Your main, because it has to do the I/O, is responsible for unpacking the chrono types into scalars so that it can print them:
int main () {
auto startTime = getCurrentTime();
std::cout << "Start Time: " << startTime.time_since_epoch().count() << "\n";
while(true) {
std::cout << (getCurrentTime() - startTime).count() << "\n";
}
}
This program does a similar thing. Hopefully it shows some of the capabilities (and methodology) of std::chrono:
#include <iostream>
#include <chrono>
#include <thread>
int main()
{
using namespace std::literals;
namespace chrono = std::chrono;
using clock_type = chrono::high_resolution_clock;
auto start = clock_type::now();
for(;;) {
auto first = clock_type::now();
// note use of literal - this is c++14
std::this_thread::sleep_for(500ms);
// c++11 would be this:
// std::this_thread::sleep_for(chrono::milliseconds(500));
auto last = clock_type::now();
auto interval = last - first;
auto total = last - start;
// integer cast
std::cout << "we just slept for " << chrono::duration_cast<chrono::milliseconds>(interval).count() << "ms\n";
// another integer cast
std::cout << "also known as " << chrono::duration_cast<chrono::nanoseconds>(interval).count() << "ns\n";
// floating point cast
using seconds_fp = chrono::duration<double, chrono::seconds::period>;
std::cout << "which is " << chrono::duration_cast<seconds_fp>(interval).count() << " seconds\n";
std::cout << " total time wasted: " << chrono::duration_cast<chrono::milliseconds>(total).count() << "ms\n";
std::cout << " in seconds: " << chrono::duration_cast<seconds_fp>(total).count() << "s\n";
std::cout << std::endl;
}
return 0;
}
example output:
we just slept for 503ms
also known as 503144616ns
which is 0.503145 seconds
total time wasted: 503ms
in seconds: 0.503145s
we just slept for 500ms
also known as 500799185ns
which is 0.500799 seconds
total time wasted: 1004ms
in seconds: 1.00405s
we just slept for 505ms
also known as 505114589ns
which is 0.505115 seconds
total time wasted: 1509ms
in seconds: 1.50923s
we just slept for 502ms
also known as 502478275ns
which is 0.502478 seconds
total time wasted: 2011ms
in seconds: 2.01183s

This predefined function slowing down my program's performance

I am working on a PCL (Point Cloud Library) project. One part of it requires me to clip point clouds, for which I need to know the minimum and maximum coordinates of given point cloud.
PCL provides a predefined function called getminmax3d(). I tried and It works well, The only problem is, It takes a lot of time when I input a large point cloud file. I made my own definition of getminmax3d() and it takes lesser time. I am not understanding why these two behave like this.
I tried with 5 point cloud data files. In all cases, program that uses predefined function takes long time as compare to the program for which I defined the definition.
Here is the code:
First implementation - It uses predefined function getminmax3d()
#include <iostream>
#include <pcl/io/pcd_io.h>
#include <pcl/point_types.h>
#include <pcl/common/common.h>
int main (int, char**)
{
pcl::PointCloud<pcl::PointXYZ>::Ptr cloud;
cloud = pcl::PointCloud<pcl::PointXYZ>::Ptr (new pcl::PointCloud<pcl::PointXYZ>);
pcl::io::loadPCDFile<pcl::PointXYZ> ("your_pcd_file.pcd", *cloud);
pcl::PointXYZ minPt, maxPt;
pcl::getMinMax3D (*cloud, minPt, maxPt);
std::cout << "Max x: " << maxPt.x << std::endl;
std::cout << "Max y: " << maxPt.y << std::endl;
std::cout << "Max z: " << maxPt.z << std::endl;
std::cout << "Min x: " << minPt.x << std::endl;
std::cout << "Min y: " << minPt.y << std::endl;
std::cout << "Min z: " << minPt.z << std::endl;
return (0);
}
Second implementation - This source code uses a user-defined function definition to replace functionality of getminmax3d()
#include <iostream>
#include <pcl/io/pcd_io.h>
#include <pcl/point_types.h>
#include <pcl/common/time.h>
int main (int argc, char** argv)
{
pcl::PointCloud<pcl::PointXYZ>::Ptr cloud (new pcl::PointCloud<pcl::PointXYZ>);
if (pcl::io::loadPCDFile<pcl::PointXYZ> ("rhino.pcd", *cloud) == -1) //* load the file
{
PCL_ERROR ("Couldn't read file rhino.pcd \n");
return (-1);
}
float min_x = cloud->points[0].x, min_y = cloud->points[0].y, min_z = cloud->points[0].z, max_x = cloud->points[0].x, max_y = cloud->points[0].y, max_z = cloud->points[0].z;
pcl::StopWatch watch;
for (size_t i = 1; i < cloud->points.size (); ++i){
if(cloud->points[i].x <= min_x )
min_x = cloud->points[i].x;
else if(cloud->points[i].y <= min_y )
min_y = cloud->points[i].y;
else if(cloud->points[i].z <= min_z )
min_z = cloud->points[i].z;
else if(cloud->points[i].x >= max_x )
max_x = cloud->points[i].x;
else if(cloud->points[i].y >= max_y )
max_y = cloud->points[i].y;
else if(cloud->points[i].z >= max_z )
max_z = cloud->points[i].z;
}
pcl::console::print_highlight ("Time taken: %f\n", watch.getTimeSeconds());
std::cout << "Min x: " << min_x <<"\t";
std::cout << "Max x: " << max_x << std::endl;
std::cout << "Min y: " << min_y <<"\t";
std::cout << "Max y: " << max_y << std::endl;
std::cout << "Min z: " << min_z <<"\t";
std::cout << "Max z: " << max_z << std::endl;
return (0);
}
I tried both programs on following 5 point cloud files.
Result obtained:
ttf : Time taken factor
ttf = 15 means user definition is about 15 times faster than predefined functions. ttf value is measured by taking average of 10 trials for both implementations.
PCD file Filetype File size ttf
Rhino.pcd XYZ 2.57 MB 15.260
Bun_zipper XYZCI 1.75 MB 17.422
Armadillo XYZ 5.26 MB 15.847
Dragon_vrip XYZ 14.7 MB 17.013
Happy_vrip XYZ 18.0 MB 14.981
I am wondering why predefined function is taking more time? I want to reduce my program source code lines. I've always believed that using standard header files and their function gives you best performance, But in this case it seems to fail.
This is where you can find standard definition.
Would anyone please help me to find out why second implementation takes less times(approx 15 times), even the standard definition of getminmax3d() is similar to mine.
pcl::getMinMax3D has a very inefficient implementation. To search for the minimum and max point it does the following:
Eigen::Array4f min_p, max_p;
min_p.setConstant (FLT_MAX);
max_p.setConstant (-FLT_MAX);
for (size_t i = 0; i < cloud.points.size (); ++i)
{
// ... (check the validity of the point if it is not a dense cloud)
pcl::Array4fMapConst pt = cloud.points[i].getArray4fMap ();
min_p = min_p.min (pt);
max_p = max_p.max (pt);
}
And if you check for the getArray4fMap() function:
typedef Eigen::Map<Eigen::Array4f, Eigen::Aligned> Array4fMap;
inline pcl::Array4fMap getArray4fMap() const {
return (pcl::Array4fMap(data));
}
For each point in the cloud it is constructing an Eigen::Map and then comparing it against the current minimum and maximum points. This is VERY inefficient.
The predefined function pcl::getMinMax3D is able to be faster with optimization flags set and in Release. Since if SSE intrinsics are used by Eigen, then the operations happen on 4 aligned bytes.
More information at
https://gitter.im/PointCloudLibrary/pcl?at=5e3899d06f9d3d34981c0687

Eigen LDLT slower than LLT?

I'm using the Cholesky module of Eigen 3 for solving a linear equation system. The Eigen documentation states, that using LDLT instead of LLT would be faster for this purpose, but my benchmarks show a different result.
I using the following code for benchmarking:
#include <iostream>
#include <chrono>
#include <Eigen/Core>
#include <Eigen/Cholesky>
using namespace std;
using namespace std::chrono;
using namespace Eigen;
int main()
{
MatrixXf cov = MatrixXf::Random(4200, 4200);
cov = (cov + cov.transpose()) + 1000 * MatrixXf::Identity(4200, 4200);
VectorXf b = VectorXf::Random(4200), r1, r2;
r1 = b;
LLT<MatrixXf> llt;
auto start = high_resolution_clock::now();
llt.compute(cov);
if (llt.info() != Success)
{
cout << "Error on LLT!" << endl;
return 1;
}
auto middle = high_resolution_clock::now();
llt.solveInPlace(r1);
auto stop = high_resolution_clock::now();
cout << "LLT decomposition & solving in " << duration_cast<milliseconds>(middle - start).count()
<< " + " << duration_cast<milliseconds>(stop - middle).count() << " ms." << endl;
r2 = b;
LDLT<MatrixXf> ldlt;
start = high_resolution_clock::now();
ldlt.compute(cov);
if (ldlt.info() != Success)
{
cout << "Error on LDLT!" << endl;
return 1;
}
middle = high_resolution_clock::now();
ldlt.solveInPlace(r2);
stop = high_resolution_clock::now();
cout << "LDLT decomposition & solving in " << duration_cast<milliseconds>(stop - start).count()
<< " + " << duration_cast<milliseconds>(stop - middle).count() << " ms." << endl;
cout << "Total result difference: " << (r2 - r1).cwiseAbs().sum() << endl;
return 0;
}
I've compiled it with g++ -std=c++11 -O2 -o llt.exe llt.cc on Windows and this is what I get:
LLT decomposition & solving in 6515 + 15 ms.
LDLT decomposition & solving in 8562 + 15 ms.
Total result difference: 1.27354e-006
So, why is LDLT slower than LLT? Am I doing something wrong or do I missunderstand the documentation?
This sentence of the documentation is outdated. With a recent version of Eigen, LLT should be much faster than LDLT for quite large matrices because the LLT implementation leverage cache-friendly matrix-matrix operations, while the LDLT implementation involves pivoting and matrix-vector operations only. With the devel branch your example gives me:
LLT decomposition & solving in 380 + 4 ms.
LDLT decomposition & solving in 2746 + 4 ms.