C++ Eigen execution time difference - c++

So I'm calculating a lot of statistical distances in my application, written in C++ (11/14). I use the Eigen library for linear algebra calculations. My code was initially compiled on macOS, particularly BigSur. Since I need to make my results reproducible, I was trying to get the same results under other OS, particularly Fedora 32. However, there are significant result differences, which I cannot contribute to anything specific after trying various things.
So I made a sample code...
#include <iostream>
#include <chrono>
#include <Eigen/Core>
#include <Eigen/Dense>
using namespace std;
using namespace std::chrono;
using namespace Eigen;
int main()
{
MatrixXd cov(2,2);
cov << 1.5,0.2,0.2,1.5;
VectorXd mean(2),ne(2);
mean << 10,10;
ne << 10.2,10.2;
auto start = high_resolution_clock::now();
for(int i=0;i<2000000;i++) {
MatrixXd icov=cov.inverse();
VectorXd delta=ne-mean;
double N0=delta.transpose()*(icov*delta);
double res=sqrtf(N0);
}
auto stop = high_resolution_clock::now();
cout << "Mahalanobis calculations in "
<< duration_cast<milliseconds>(stop - start).count()
<< " ms." << endl;
return 0;
}
which was compiled with
clang++ -std=c++14 -w -O2 -I'....Eigen/include' -DNDEBUG -m64 -o benchmark benchmark.cpp
on both, macOS and Fedora32. Yes, I downloaded and installed clang on Fedora, just to be sure I'm using the same compiler. On macOS, I have clang version 12.0.0, and on Fedora 10.0.1!
The difference between these test cases is 2x
macOS:
Mahalanobis calculations in 2833 ms.
Fedora:
Mahalanobis calculations in 1490 ms.
When it comes to my specific application, the difference is almost 30x, which is quite unusual. In the meantime I checked for the following:
OpenMP support - tried switching on and off, compile time and runtime (setting the number of threads before the test code chunk)
various compiling flags and architectures
adding OpenMP support to macOS
tempering with EIGEN_USE_BLAS, EIGEN_USE_LAPACKE, and EIGEN_DONT_PARALLELIZE flags
Nothing helps. Any ideas where is the problem?
Maybe something with memory management?

Finally, to answer the question for all those that encounter the same problem. The issue is in the memory management. As someone pointed out, these is a big difference between dynamically and statically allocated Eigen objects. So
MatrixXd cov(2,2);
tends to be much slower than
Matrix<double,2,2> cov;
since the first approach uses heap to dynamically allocate the needed memory. At the end of the day, it all comes down to the way how the OS handles memory. It seems that Linux is doing it better than macOS or Windows (no surprises there actually).
I know that it is not possible always to use Matrix2d over the good old MatrixXd. Some developers even reported that Eigen matrix math tends to be slower than their own home-made simple solutions, but this comes down to the choice of doing everything yourself, or taking all-purpose linera algebra library. Depends on what you are doing...

Related

Thrust is very slow for array reduction

I am trying to use thrust to reduce an array of 1M elements to a single value. My code is as follows:
#include<chrono>
#include<iostream>
#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
#include<thrust/reduce.h>
int main()
{
int N,M;
N = 1000;
M = 1000;
thrust::device_vector<float> D(N*M,5.0);
int sum;
auto start = std::chrono::high_resolution_clock::now();
sum = thrust::reduce(D.begin(),D.end(),(float)0,thrust::plus<float>());
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end-start);
std::cout<<duration.count()<<" ";
std::cout<<sum;
}
The issue is, thrust::reduce alone takes about 4ms to run on my RTX 3070 laptop GPU. This is considerably slower than code I can write based on reduction#4 in this CUDA reference by Mark Harris, which takes about 150microseconds. Am I doing something wrong here?
EDIT 1:
Changed high_resolution_clock to steady_clock. thrust::reduce now takes 2ms to run. Updated code is as follows:
#include<chrono>
#include<iostream>
#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
#include<thrust/reduce.h>
int main()
{
int N,M;
N = 1000;
M = 1000;
thrust::device_vector<float> D(N*M,5.0);
int sum;
auto start = std::chrono::steady_clock::now();
sum = thrust::reduce(D.begin(),D.end(),(float)0,thrust::plus<float>());
auto end = std::chrono::steady_clock::now();
auto duration = std::chrono::duration<double,std::ratio<1,1000>>(end-start);
std::cout<<duration.count()<<" ";
std::cout<<sum;
}
Additional information :
I am running CUDA C++ on Ubuntu in WSL2
CUDA version - 11.4
I am using the nvcc compiler to compile:
nvcc -o reduction reduction.cu
To run:
./reduction
Am I doing something wrong here?
I would not say you are doing anything wrong here. However that might be a matter of opinion. Let's unpack it a bit, using a profiler. I'm not using the exact same setup as you (I'm using a different GPU - Tesla V100, on Linux, CUDA 11.4). In my case the measurement spit out by the code is ~0.5ms, not 2ms.
The profiler tells me that the thrust::reduce is accomplished under the hood via a call to cub::DeviceReduceKernel followed by cub::DeviceReduceSingleTileKernel. This two-kernel approach should make sense to you if you have studied Mark Harris' reduction material. The profiler tells me that together, these two calls account for ~40us of the ~500us overall time. This is the time that would be most comparable to the measurement you made of your implementation of Mark Harris' reduction code, assuming you are timing the kernels only. If we multiply by 4 to account for the overall perf ratio, it is pretty close to your 150us measurement of that.
The profiler tells me that the big contributors to the ~500us reported time in my case are a call to cudaMalloc (~200us) and a call to cudaFree (~200us). This isn't surprising because if you study the cub::DeviceReduce methodology that is evidently being used by thrust, it requires an initial call to do a temporary allocation. Since thrust provides a self-contained call for thrust::reduce, it has to perform that call, as well as do a cudaMalloc and cudaFree operation for the indicated temporary storage.
So is there anything that can be done?
The thrust designers were aware of this situation. To get a (closer to) apples-apples comparison between just measuring the kernel duration(s) of a CUDA C++ implementation, and using thrust to do the same thing, you could use a profiler to compare measurements, or else take control of the temporary allocations yourself.
One way to do this would be to switch from thrust to cub.
The thrust way to do it is to use a thrust custom allocator.
There may be a few other detail differences in methodology that are impacting your measurement. For example, the thrust call intrinsically copies the reduction result back to host memory. You may or may not be timing that step in your other approach which you haven't shown. But according to my profiler measurement, that only accounts for a few microseconds.

C++ Eigen for solving linear systems fast

So I wanted to test the speed of C++ vs Matlab for solving a linear system of equations. For this purpose I create a random system and measure the time required to solve it using Eigen on Visual Studio:
#include <Eigen/Core>
#include <Eigen/Dense>
#include <chrono>
using namespace Eigen;
using namespace std;
int main()
{
chrono::steady_clock sc; // create an object of `steady_clock` class
int n;
n = 5000;
MatrixXf m = MatrixXf::Random(n, n);
VectorXf b = VectorXf::Random(n);
auto start = sc.now(); // start timer
VectorXf x = m.lu().solve(b);
auto end = sc.now();
// measure time span between start & end
auto time_span = static_cast<chrono::duration<double>>(end - start);
cout << "Operation took: " << time_span.count() << " seconds !!!";
}
Solving this 5000 x 5000 system takes 6.4 seconds on average. Doing the same in Matlab takes 0.9 seconds. The matlab code is as follows:
a = rand(5000); b = rand(5000,1);
tic
x = a\b;
toc
According to this flowchart of the backslash operator:
given that a random matrix is not triangular, permuted triangular, hermitian or upper heisenberg, the backslash operator in Matlab uses a LU solver, which I believe is the same solver that I'm using on the C++ code, that is, lu().solve
Probably there is something that I'm missing, because I thought C++ was faster.
I am running it with release mode active on the Configuration Manager
Project Properties - C/C++ - Optimization - /O2 is active
Tried using Enhanced Instructions (SSE and SSE2). SSE actually made it slower and SSE2 barely made any difference.
I am using Community version of Visual Studio, if that makes any difference
First of all, for this kind of operations Eigen is very unlikely to beat MatLab because the later will directly call Intel's MKL which is heavily optimized and multi-threaded. Note that you can also configure Eigen to fallback to MKL, see how. If you do so, you'll end up with similar performance.
Nonetheless, 6.4s is way to much. The Eigen's documentation reports 0.7s for factorizing a 4k x 4k matrix. Running your example on my computer (Haswell laptop #2.6GHz) I got 1.6s (clang 7, -O3 -march=native), and 1s with multithreading enabled (-fopenmp). So make sure you enable all your CPU's feature (AVX, FMA) and openmp. With OpenMP you might need to explicitly reduce the number of openmp threads to the number of physical cores.

Possible compiler bug: Weird results using boost bessel functions with Intel compiler between two machines?

I'm trying to use boost's bessel function (cyl_bessel_j) in a project. However, I'm finding that the function is returning results with an incorrect sign after around 2000 calls to it.
I've tested this between two machines, one is a CentOS 5.8 (Final) machine, where it oddly enough works, and a RHEL 6.3 (Santiago) machine where it fails.
Both machines are using Boost 1.50.0, and the 13.1.3 20130607 Intel compiler. The CentOS machine is using gcc 4.1.2 20080704, and the RHEL machine is using gcc 4.4.6 20120305.
Here is my code:
template<typename FloatType> FloatType funcT(FloatType z, FloatType phi,
int n, int m, int p)
{
using namespace boost::math;
FloatType sqrt2PiZ = sqrt((2 * M_PI)/z);
FloatType nrmLeg = normalizedLegendre(n,m,-sin(phi));
FloatType besselJ = cyl_bessel_j(p + 0.5, z);
std::cout << " " << p << "," << z << " besselJ: " << besselJ << std::endl;
return sqrt2PiZ * nrmLeg * besselJ;
}
Running on the two machines, I found the only term that was coming out different between the two was the besselJ term. For the first 1980 calls to the function, they return identical results, however, on the 1981st call, the RHEL machine suddenly switches sign in its results. The first few failed terms print out as follows (RHEL SIDE):
...
1,7.90559 besselJ: -0.0504874
2,7.90559 besselJ: 0.264237
3,7.90559 besselJ: 0.217608
...
Running a reference test in MATLAB using the besselJ function, I find that for these inputs, the signs should be reversed, and indeed the CentOS machine agrees with MATLAB.
I decided to write a simple hello-world style example with the besselJ function to try and determine the cause of the failure:
#include <boost/math/special_functions/bessel.hpp>
#include <iostream>
int main(int argc, char** argv)
{
double besselTerm = boost::math::cyl_bessel_j(1.5, 7.90559);
std::cout << besselTerm << std::endl;
}
This test returns the expected value of 0.0504874 on BOTH machines.
At this point, I'm ripping my hair out trying to determine the cause of the problem. It seems to be some weird compiler bug or stack corruption.But then, how can stack corruption give the exact correct answer with the exception of a single bit?
Has anyone run into an issue like this with boost or the Intel compiler in general?
Additional info:
It has been found that the minimal case test breaks on GCC 4.4.6 with the --fast-math flag. I was also able to get the minimal test case to fail with the Intel compiler by using the -std=c++11 flag (which is what the larger project uses).
A bug was submitted to the boost library trac system.
It was found, however, that the problem is in the std c++ libraries and indicates an issue in GCC 4.4.7.
Boost devs created a workaround patch that addresses the issue at:
https://github.com/boostorg/math/commit/9f8ffee4b7a3f82b1c582735d43522d7d0cde746

Linux C++ time measurement library, fast printing library

I just started programming C++ in Linux, can anyone recommend a good way for measurement of code elapsed time, ideally to nanoseconds precision, but milli-seconds will do as well.
And also a fast printing method, I am using std::cout at the moment, but I feel it's kind of slow.
Thanks.
You could use gettimeofday, or clock_gettime.
To get a time in nanoseconds, use clock_gettime(). To measure an elapsed time taken by the code, CLOCK_MONOTONIC_RAW clock type must be used. Using other clock types is not really a solution because they are subject to NTP adjustments.
As for the printing part - define slow. A "general" code to convert built-in data types into ASCII strings is always slow. There is also a buffering going on (which is good in most cases). If you can make some good assumptions about your data, you can always throw in your own conversion to ASCII which will beat a general-purpose solutions, and make it faster.
EDIT:
See also an example of using clock_gettime() function and OS X specific mach_absolute_time() functions here:
stopwatch.h
stopwatch.c
stopwatch_example.c
For timing you can use the <chrono> standard library:
#include <chrono>
#include <iostream>
int main() {
using Clock = std::chrono::high_resolution_clock;
using std::chrono::milliseconds;
using std::chrono::nanoseconds;
using std::chrono::duration_cast;
auto start = Clock::now();
// code to time
std::this_thread::sleep_for(milliseconds(500));
auto end = Clock::now();
std::cout << duration_cast<nanoseconds>(end-start).count() << " ns\n";
}
The actual clock resolution depends on the implementation, but this will always output the correct units.
The performance of std::cout depends on the implementation as well. IME, as long as you don't use std::endl everywhere its performance compares quite well with printf on Linux or OS X. Microsoft's implementation in VC++ seems to be much slower.
Printing things is normally slow because of the terminal you're watching it in, rather than because you're printing something in the first place. You can redirect output to a file, then you might see a significant speedup if you're printing a lot to the console.
I think you probably also want to have a look at the time [0] command, which reports the time taken by a specific program to complete execution.
[0] http://linux.about.com/library/cmd/blcmdl1_time.htm
Time measurement:
Boost.Chrono: http://www.boost.org/doc/libs/release/doc/html/chrono.html
// note that if you have a modern C++11 (used to be C++0x) compiler you already have this out of the box, since "Boost.Chrono aims to implement the new time facilities in C++0x, as proposed in N2661 - A Foundation to Sleep On."
Boost.Timer: http://www.boost.org/doc/libs/release/libs/timer/
Posix Time from Boost.Date_Time: http://www.boost.org/doc/libs/release/doc/html/date_time/posix_time.html
Fast printing:
FastFormat: http://www.fastformat.org/
Benchmarks: http://www.fastformat.org/performance.html
Regarding the performance of C++ streams -- remember about std::ios_base::sync_with_stdio, see:
http://en.cppreference.com/w/cpp/io/ios_base/sync_with_stdio
http://www.cplusplus.com/reference/iostream/ios_base/sync_with_stdio/

Fortran-style multidimensional arrays in C++

Is there a C++ library which provides Fortran-style multidimensional arrays with support for slicing, passing as procedural parameter and decent documentation? I've looked into blitz++ but its dead!
I highly suggest Armadillo:
Armadillo is a C++ linear algebra library (matrix maths) aiming towards a good balance between speed and ease of use
It is a C++ template library:
A delayed evaluation approach is employed (at compile-time) to combine several operations into one and reduce (or eliminate) the need for temporaries; this is automatically accomplished through template meta-programming
A simple example from the web page:
#include <iostream>
#include <armadillo>
int main(int argc, char** argv)
{
arma::mat A = arma::randu<arma::mat>(4,5);
arma::mat B = arma::randu<arma::mat>(4,5);
std::cout << A*B.t() << std::endl;
return 0;
}
If you are running OSX the you can use the vDSP libs for free.
If you want to deploy on windows targets then either license the intel equivalents (MKL) or I think that the AMD vector math libs (ACML) are free.