C++ Eigen for solving linear systems fast - c++

So I wanted to test the speed of C++ vs Matlab for solving a linear system of equations. For this purpose I create a random system and measure the time required to solve it using Eigen on Visual Studio:
#include <Eigen/Core>
#include <Eigen/Dense>
#include <chrono>
using namespace Eigen;
using namespace std;
int main()
{
chrono::steady_clock sc; // create an object of `steady_clock` class
int n;
n = 5000;
MatrixXf m = MatrixXf::Random(n, n);
VectorXf b = VectorXf::Random(n);
auto start = sc.now(); // start timer
VectorXf x = m.lu().solve(b);
auto end = sc.now();
// measure time span between start & end
auto time_span = static_cast<chrono::duration<double>>(end - start);
cout << "Operation took: " << time_span.count() << " seconds !!!";
}
Solving this 5000 x 5000 system takes 6.4 seconds on average. Doing the same in Matlab takes 0.9 seconds. The matlab code is as follows:
a = rand(5000); b = rand(5000,1);
tic
x = a\b;
toc
According to this flowchart of the backslash operator:
given that a random matrix is not triangular, permuted triangular, hermitian or upper heisenberg, the backslash operator in Matlab uses a LU solver, which I believe is the same solver that I'm using on the C++ code, that is, lu().solve
Probably there is something that I'm missing, because I thought C++ was faster.
I am running it with release mode active on the Configuration Manager
Project Properties - C/C++ - Optimization - /O2 is active
Tried using Enhanced Instructions (SSE and SSE2). SSE actually made it slower and SSE2 barely made any difference.
I am using Community version of Visual Studio, if that makes any difference

First of all, for this kind of operations Eigen is very unlikely to beat MatLab because the later will directly call Intel's MKL which is heavily optimized and multi-threaded. Note that you can also configure Eigen to fallback to MKL, see how. If you do so, you'll end up with similar performance.
Nonetheless, 6.4s is way to much. The Eigen's documentation reports 0.7s for factorizing a 4k x 4k matrix. Running your example on my computer (Haswell laptop #2.6GHz) I got 1.6s (clang 7, -O3 -march=native), and 1s with multithreading enabled (-fopenmp). So make sure you enable all your CPU's feature (AVX, FMA) and openmp. With OpenMP you might need to explicitly reduce the number of openmp threads to the number of physical cores.

Related

Thrust is very slow for array reduction

I am trying to use thrust to reduce an array of 1M elements to a single value. My code is as follows:
#include<chrono>
#include<iostream>
#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
#include<thrust/reduce.h>
int main()
{
int N,M;
N = 1000;
M = 1000;
thrust::device_vector<float> D(N*M,5.0);
int sum;
auto start = std::chrono::high_resolution_clock::now();
sum = thrust::reduce(D.begin(),D.end(),(float)0,thrust::plus<float>());
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end-start);
std::cout<<duration.count()<<" ";
std::cout<<sum;
}
The issue is, thrust::reduce alone takes about 4ms to run on my RTX 3070 laptop GPU. This is considerably slower than code I can write based on reduction#4 in this CUDA reference by Mark Harris, which takes about 150microseconds. Am I doing something wrong here?
EDIT 1:
Changed high_resolution_clock to steady_clock. thrust::reduce now takes 2ms to run. Updated code is as follows:
#include<chrono>
#include<iostream>
#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
#include<thrust/reduce.h>
int main()
{
int N,M;
N = 1000;
M = 1000;
thrust::device_vector<float> D(N*M,5.0);
int sum;
auto start = std::chrono::steady_clock::now();
sum = thrust::reduce(D.begin(),D.end(),(float)0,thrust::plus<float>());
auto end = std::chrono::steady_clock::now();
auto duration = std::chrono::duration<double,std::ratio<1,1000>>(end-start);
std::cout<<duration.count()<<" ";
std::cout<<sum;
}
Additional information :
I am running CUDA C++ on Ubuntu in WSL2
CUDA version - 11.4
I am using the nvcc compiler to compile:
nvcc -o reduction reduction.cu
To run:
./reduction
Am I doing something wrong here?
I would not say you are doing anything wrong here. However that might be a matter of opinion. Let's unpack it a bit, using a profiler. I'm not using the exact same setup as you (I'm using a different GPU - Tesla V100, on Linux, CUDA 11.4). In my case the measurement spit out by the code is ~0.5ms, not 2ms.
The profiler tells me that the thrust::reduce is accomplished under the hood via a call to cub::DeviceReduceKernel followed by cub::DeviceReduceSingleTileKernel. This two-kernel approach should make sense to you if you have studied Mark Harris' reduction material. The profiler tells me that together, these two calls account for ~40us of the ~500us overall time. This is the time that would be most comparable to the measurement you made of your implementation of Mark Harris' reduction code, assuming you are timing the kernels only. If we multiply by 4 to account for the overall perf ratio, it is pretty close to your 150us measurement of that.
The profiler tells me that the big contributors to the ~500us reported time in my case are a call to cudaMalloc (~200us) and a call to cudaFree (~200us). This isn't surprising because if you study the cub::DeviceReduce methodology that is evidently being used by thrust, it requires an initial call to do a temporary allocation. Since thrust provides a self-contained call for thrust::reduce, it has to perform that call, as well as do a cudaMalloc and cudaFree operation for the indicated temporary storage.
So is there anything that can be done?
The thrust designers were aware of this situation. To get a (closer to) apples-apples comparison between just measuring the kernel duration(s) of a CUDA C++ implementation, and using thrust to do the same thing, you could use a profiler to compare measurements, or else take control of the temporary allocations yourself.
One way to do this would be to switch from thrust to cub.
The thrust way to do it is to use a thrust custom allocator.
There may be a few other detail differences in methodology that are impacting your measurement. For example, the thrust call intrinsically copies the reduction result back to host memory. You may or may not be timing that step in your other approach which you haven't shown. But according to my profiler measurement, that only accounts for a few microseconds.

C++ Eigen execution time difference

So I'm calculating a lot of statistical distances in my application, written in C++ (11/14). I use the Eigen library for linear algebra calculations. My code was initially compiled on macOS, particularly BigSur. Since I need to make my results reproducible, I was trying to get the same results under other OS, particularly Fedora 32. However, there are significant result differences, which I cannot contribute to anything specific after trying various things.
So I made a sample code...
#include <iostream>
#include <chrono>
#include <Eigen/Core>
#include <Eigen/Dense>
using namespace std;
using namespace std::chrono;
using namespace Eigen;
int main()
{
MatrixXd cov(2,2);
cov << 1.5,0.2,0.2,1.5;
VectorXd mean(2),ne(2);
mean << 10,10;
ne << 10.2,10.2;
auto start = high_resolution_clock::now();
for(int i=0;i<2000000;i++) {
MatrixXd icov=cov.inverse();
VectorXd delta=ne-mean;
double N0=delta.transpose()*(icov*delta);
double res=sqrtf(N0);
}
auto stop = high_resolution_clock::now();
cout << "Mahalanobis calculations in "
<< duration_cast<milliseconds>(stop - start).count()
<< " ms." << endl;
return 0;
}
which was compiled with
clang++ -std=c++14 -w -O2 -I'....Eigen/include' -DNDEBUG -m64 -o benchmark benchmark.cpp
on both, macOS and Fedora32. Yes, I downloaded and installed clang on Fedora, just to be sure I'm using the same compiler. On macOS, I have clang version 12.0.0, and on Fedora 10.0.1!
The difference between these test cases is 2x
macOS:
Mahalanobis calculations in 2833 ms.
Fedora:
Mahalanobis calculations in 1490 ms.
When it comes to my specific application, the difference is almost 30x, which is quite unusual. In the meantime I checked for the following:
OpenMP support - tried switching on and off, compile time and runtime (setting the number of threads before the test code chunk)
various compiling flags and architectures
adding OpenMP support to macOS
tempering with EIGEN_USE_BLAS, EIGEN_USE_LAPACKE, and EIGEN_DONT_PARALLELIZE flags
Nothing helps. Any ideas where is the problem?
Maybe something with memory management?
Finally, to answer the question for all those that encounter the same problem. The issue is in the memory management. As someone pointed out, these is a big difference between dynamically and statically allocated Eigen objects. So
MatrixXd cov(2,2);
tends to be much slower than
Matrix<double,2,2> cov;
since the first approach uses heap to dynamically allocate the needed memory. At the end of the day, it all comes down to the way how the OS handles memory. It seems that Linux is doing it better than macOS or Windows (no surprises there actually).
I know that it is not possible always to use Matrix2d over the good old MatrixXd. Some developers even reported that Eigen matrix math tends to be slower than their own home-made simple solutions, but this comes down to the choice of doing everything yourself, or taking all-purpose linera algebra library. Depends on what you are doing...

Why is there such a high variability in runtime for adding two values?

I wrote a timing function that records the run time of a function and calculates the mean and the standard deviation over multiple runs. I was surprised to find very high standard deviations, even for seemingly simple tasks such as adding two doubles.
I analysed the data in python (see the plots). The c++ output was 19.6171 ns +/- 21.9653ns (82799807 runs) when compiled with:
gcc version 8.3.0 (Debian 8.3.0-19)
/usr/bin/c++ -O3 -DNDEBUG -std=gnu++17
The whole test was done on my personal computer, which was not idle but running a DE, a browser, my IDE and other processes. There was free RAM available during the test though. My double core CPU with HT was idling below 10% usage.
Is a spike from an average value of 20 ns to 50 µs to be expected for this situation?
Plot of run times
This is the content of std::vector<double> run_times. I don't see any pattern.
Histogram of run times
Note log y axis (number of samples in this bin).
timing.h
#include <cstdint>
#include <ostream>
#include <cmath>
#include <algorithm>
#include <vector>
#include <chrono>
#include <numeric>
#include <fstream>
struct TimingResults{
// all time results are in nanoseconds
double mean;
double standard_deviation;
uint64_t number_of_runs;
};
std::ostream& operator<<(std::ostream& os, const TimingResults& results);
template <typename InputIterator>
std::pair<typename InputIterator::value_type, typename InputIterator::value_type>
calculate_mean_and_standard_deviation(InputIterator first, InputIterator last){
double mean = std::accumulate(first, last, 0.) / std::distance(first, last);
double sum = 0;
std::for_each(first, last, [&](double x){sum += (x - mean) * (x - mean);});
return {mean, std::sqrt(sum / (std::distance(first, last) - 1))};
}
template<uint64_t RunTimeMilliSeconds = 4000, typename F, typename... Args>
TimingResults measure_runtime(F func, Args&&... args){
std::vector<double> runtimes;
std::chrono::system_clock::time_point b;
auto start_time = std::chrono::high_resolution_clock::now();
do {
auto a = std::chrono::high_resolution_clock::now();
func(std::forward<Args>(args)...);
b = std::chrono::high_resolution_clock::now();
runtimes.push_back(std::chrono::duration_cast<std::chrono::nanoseconds>(b - a).count());
} while (std::chrono::duration_cast<std::chrono::milliseconds>(b-start_time).count() <= RunTimeMilliSeconds);
auto [mean, std_deviation] = calculate_mean_and_standard_deviation(runtimes.begin(), runtimes.end());
return {mean, std_deviation, runtimes.size()};
}
timing.cpp
#include <iostream>
#include "timing.h"
std::ostream& operator<<(std::ostream& os, const TimingResults& results){
return os << results.mean << " ns" << " +/- " << results.standard_deviation << "ns ("
<< results.number_of_runs << " runs)";
}
main.cpp
#include "src/timing/timing.h"
#include <iostream>
int main(){
auto res = measure_runtime([](double x, double y){return x * y;}, 6.9, 9.6);
std::cout << res;
}
Modern CPUs easily perform on the order of several 10^9 FLOPS, i.e. the expected time for one operation is below 1 ns. This, however refers to peak-performance. For most real-world workloads, the performance is going to be much less, owing to memory and cache effects.
The problem with your benchmark is that you are timing individual operations. The overhead of getting the time points a and b likely simply exceeds the time you are actually trying to measure. Additionally, even std::chrono::high_resolution_clock is not going to give you picosecond accuracy (though that's in principle implementation and hardware depended). The obvious fix is to perform the operation N times, time that and then divide the total time by N. At some point, you'll see that your results become consistent. (Feel free to post your results.)
TL;DR: You are trying to time a lightning bolt with a pocket watch.
TL:DR: Your entire approach is too simplistic to tell you anything useful. Timing overhead would dominate even if your multiply wasn't optimized away.
Microbenchmarking is non-trivial even in hand-written asm. It's impossible in C++ if you don't understand how your C++ compiles to asm for your target platform, for an operation as simple / cheap as x * y.
You aren't using the result, so maybe you were trying to measure throughput (instead of latency). But with only one multiply inside the timed interval, there's no chance for superscalar / pipelined execution to happen.
Even more fundamentally, you don't use the result so there's no need for the compiler to even compute it at all. And even if you did, after inlining from that C++ header the operands are compile-time constants, so the compiler will do it once at compile time instead of with a mulsd instruction at run time. And even if you made the args in main come from atof(argv[1]) or something, the compiler could hoist the computation out of the loop.
Any one of those 3 microbenchmark pitfalls would lead to timing an empty interval with no work between the two functions, other than saving the first now() result to different registers. You have all 3 problems.
You're literally timing an empty interval and still getting this much jitter because of the occasional interrupt, and the relatively high overhead of the library function wrapped around clock_gettime which ultimately runs an rdtsc instruction and scales it using values exported by the kernel. Fortunately it can do this in user-space, without actually using a syscall instruction to enter the kernel. (The Linux kernel exports code + data in the VDSO pages.)
Directly using rdtsc inside a tight loop does give fairly repeatable timings, but still has pretty high overhead relative to mulsd. (How to get the CPU cycle count in x86_64 from C++?).
Your mental model of execution cost is probably wrong at this level of detail. You can't just time individual operations and then add up their costs. Superscalar pipelined out-of-order execution means you have to consider throughput vs. latency, and lengths of dependency chains. (And also front-end bottlenecks vs. the throughput of any one kind of instruction, or execution port).
Modern x86 cost model
How many CPU cycles are needed for each assembly instruction?
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
And no, disabling optimizations is not useful. That would turn this into a microbenchmark of call/ret through a nest of C++ functions, and maybe store-forwarding latency.
Benchmarking with optimizations disabled is useless. Typically you need to use inline asm to force the compiler to materialize a value in a register repeatedly in a loop, and/or forget what it knows about a variable's value to make it redo a calculation instead of hoisting it. e.g. see "Escape" and "Clobber" equivalent in MSVC (not the MSVC part, just the part in the question showing useful GNU C inline asm).

Eigen: how to speed up a += coeffs * coeffs.transpose()

I need to compute many (about 400k) solutions of small linear least square problems. Each problem contains 10-300 equations with only 7 variables.
To solve these problems i use eigen library. Straight solving takes too much time and i transform each problem to solving 7x7 system of linear equations by deriving derivatives by my hand.
I recieve nice speed-up but i want to increase performance again.
I use vagrind to profile my program and i found that operation with highest self cost is operator += of eigen matrix. This operation takes more than ten calls of a.ldlt().solve(b);
I use this operator to compose A matrix and B vector of each system of equations
//I cal these code to solve each problem
const int nVars = 7;
//i really need double precision
Eigen::Matrix<double, nVars, nVars> a = Eigen::Matrix<double, nVars, nVars>::Zero();
Eigen::Matrix<double, nVars, 1> b = Eigen::Matrix<double, nVars, 1>::Zero();
Eigen::Matrix<double, nVars, 1> equationCoeffs;
//............................
//Somewhere in big cycle.
//equationCoeffs and z are updated on each iteration
a += equationCoeffs * equationCoeffs.transpose();
b += equationCoeffs * z;
Where z is some scalar
So my question is: How can i improve performance of these operations?
PS Sorry for my poor English
Instead of forming the matrix and vector components of the normal equation by hand, one equation at a time, you might try to allocate a large enough matrix once (e.g. 300 x 7) to store all coefficients and then let Eigen's optimized matrix-matrix product kernels do the job for you:
Matrix<double,Dynamic,nbVars> D(300,nbVars);
VectorXd f(300);
for(...)
{
int nb_equations = ...;
for(i=0..nb_equations-1)
{
D.row(i) = equationCoeffs;
f(i) = z;
}
a = D.topRows(nb_equations).transpose() * D.topRows(nb_equations);
b = D.topRows(nb_equations).transpose() * f.head(nb_equations);
// solve ax=b
}
You might bench with both a column-major and row-major storage for the matrix D to see which one is best.
Another possible approach would be to declare a, equationCoeffs, and b as 8x8 or 8x1 matrix or vectors making sure that equationCoeffs(7)==0. This way you maximize SIMD usage. Then use a.topLeftCorners<7,7>(), b.head<7>() when calling LDLT. You might even combine this strategy with the previous one.
Finally, if your CPU support AVX or FMA, you might use the devel branch and compile with -mavx or -mfma to get a significant speedup.
If you can use g++5.1, you might want to take a look at OpenMP
( http://openmp.org/mp-documents/OpenMP4.0.0.Examples.pdf ).
G++5.1 (or gcc5.1 for C) also has some basic support for OpenACC, you can try that as well. There should be more implementation of OpenACC in the future.
Also if you have access to intel compiler (icc, icpc) it speeded up my code even just by using it.
If you can use nvidia's nvcc, you might use the thrust library
( http://docs.nvidia.com/cuda/thrust/#axzz3g8xJPGHe ), there's a lot of sample code on their github as well
( https://github.com/thrust/thrust ). However, using thrust is not so straight forward and needs some real thinking.
EDIT:
Thrust also requires Nvidia GPU.
For AMD cards I believe there is a library called ArrayFire, which looks very similar to Thrust (I have not tried that one, yet)
I have a single problem Ax=b with 480k float variables. The matrix A is sparse and solving it with Eigen BiCGSTAB took 4.8 seconds.
I also worked with ViennaCL before, so I tried to solve the same problem there, and it took only 1.2 seconds. The increase in spead is realised
by the processing on the GPU.

Make g++ produce a program that can use multiple cores?

I have a c++ program with multiple For loops; each one runs about 5 million iterations. Is there any command I can use with g++ to make the resulting .exe will use multiple cores; i.e. make the first For loop run on the first core and the second For loop run on the second core at the same time? I've tried -O3 and -O3 -ftree-vectorize, but in both cases, my cpu usage still only hovers at around 25%.
EDIT:
Here is my code, in case in helps. I'm basically just making a program to test the speed capabilities of my computer.
#include <iostream>
using namespace std;
#include <math.h>
int main()
{
float *bob = new float[50102133];
float *jim = new float[50102133];
float *joe = new float[50102133];
int i,j,k,l;
//cout << "Starting test...";
for (i=0;i<50102133;i++)
bob[i] = sin(i);
for (j=0;j<50102133;j++)
bob[j] = sin(j*j);
for (k=0;k<50102133;k++)
bob[k] = sin(sqrt(k));
for (l=0;l<50102133;l++)
bob[l] = cos(l*l);
cout << "finished test.";
cout << "the 100120 element is," << bob[1001200];
return 0;
}
The most obviously choice would be to use OpenMP. Assuming your loop is one that's really easy to execute multiple iterations in parallel, you might be able to just add:
#pragma openmp parallel for
...immediately before the loop, and get it to execute in parallel. You'll also have to add -fopenmp when you compile.
Depending on the content of the loop, that may give anywhere from a nearly-linear speedup to slowing the code down somewhat. In the latter cases (slowdown or minimal speedup) there may be other things you can do with OpenMP to help speed it up, but without knowing at least a little about the code itself, it's hard to guess what to do or what improvement you may be able to expect at maximum.
The other advice you're getting ("Use threads") may be suitable. OpenMP is basically an automated way of putting threads to use for specific types of parallel code. For a situation such as you describe (executing multiple iterations of a loop in parallel) OpenMP is generally preferred--it's quite a bit simpler to implement, and may well give better performance unless you know multithreading quite well and/or expend a great deal of effort on parallelizing the code.
Edit:
The code you gave in the question probably won't benefit from multiple threads. The problem is that it does very little computation on each data item before writing the result out to memory. Even a single core can probably do the computation fast enough that the overall speed will be limited by the bandwidth to memory.
To stand a decent chance of getting some real benefit from multiple threads, you probably want to write some code that does more computation and less just reading and writing memory. For example, if we collapse your computations together, and do all of them on a single item, then sum the results:
double total = 0;
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
By adding a pragma:
#pragma omp parallel for reduction(+:total)
...just before the for loop, we stand a good chance of seeing a substantial improvement in execution speed. Without OpenMP, I get a time like this:
Real 16.0399
User 15.9589
Sys 0.0156001
...but with the #pragma and OpenMP enabled when I compile, I get a time like this:
Real 8.96051
User 17.5033
Sys 0.0468003
So, on my (dual core) processor, time has dropped from 16 to 9 seconds--not quite twice as fast, but pretty close. Of course, a lot of the improvement you get will depend on exactly how many cores you have available. For example, on my other computer (with an Intel i7 CPU), I get a rather larger improvement because it has more cores.
Without OpenMP:
Real 15.339
User 15.3281
Sys 0.015625
...and with OpenMP:
Real 3.09105
User 23.7813
Sys 0.171875
For completeness, here's the final code I used:
#include <math.h>
#include <iostream>
static const int size = 1024 * 1024 * 128;
int main(){
double total = 0;
#pragma omp parallel for reduction(+:total)
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
std::cout << total << "\n";
}
The compiler has no way to tell if your code inside the loop can be safely executed on multiple cores. If you want to use all your cores, use threads.
Use Threads or Processes, you may want to look to OpenMp
C++11 got support for threading but c++ compilers won't/can't do any threading on their own.
As others have pointed out, you can manually use threads to achieve this. You might look at libraries such as libdispatch (aka. GCD) or Intel's TBB to help you do this with the least pain.
The -ftree-vectorize option you mention is for targeting SIMD vector processor units on CPUs such as ARM's NEON or Intel's SSE. The code produced is not thread-parallel, but rather operation parallel using a single thread.
The code example posted above is highly amenable to parallelism on SIMD systems as the body of each loop very obviously has no dependancies on the previous iteration, and the operations in the loop are linear.
On some ARM Cortex A series systems at least, you may need to accept slightly reduced accuracy to get the full benefits.