Slow matrix inversion in C++ - c++

I'm currently trying to convert matlab code to C++ using armadillo. I converted some matlab code by following the aramdillo documentation to C++. However the performance is disappointing compared to matlab.
In Matlab it takes about 0.1 sec to inverse a Matrix A of size (625x625) compared to over 3 seconds in C++.
In C++ I have tried both
solve()
as well as
inv()
I'm aware of the fact that inv produces less accurate results, thus I do not prefer to use it. Besides I really need the inverse of matrix A, as I use the diagonal elements later in the algorithm.
The code that's producing these results:
Matlab
x=A\b
invA = A\eye(size(A))
C++
arma::mat x = solve(A,b)
arma::mat invA = solve(A,eye(625,625))
The versions I'm using:
C++:
Visual Studio 2013
Armadillo 8.300.1
Intel MKL 2018.1.156
Matlab:
matlab 2016b
version -blas
Intel(R) Math Kernel Library Version 11.3.1 Product Build 20151021 for Intel(R) 64 architecture applications, CNR branch AVX2
version -lapack
Intel(R) Math Kernel Library Version 11.3.1 Product Build 20151021 for Intel(R) 64 architecture applications, CNR branch AVX2
Linear Algebra PACKage Version 3.5.0
Does anyone have an idea how to overcome this lack of speed in C++ using armadillo?

Have you tried : add to your code #ARMA_USE_LAPACK ? This allows your code to use more optimized version of .inv() function. Also check the documentation: http://arma.sourceforge.net/docs.html

Related

numpy is faster and more efficient than Eigen C++?

Recently, I had a debate with a colleague about comparing python vs C++ in terms of performance. Both of us were using these languages for linear algebra mostly. So I wrote two scripts, one in python3, using numpy, and the other one in C++ using Eigen.
Python3 numpy version matmul_numpy.py:
import numpy as np
import time
a=np.random.rand(2000,2000)
b=np.random.rand(2000,2000)
start=time.time()
c=a*b
end=time.time()
print(end-start)
If I run this script with
python3 matmul_numpy.py
This will return:
0.07 seconds
The C++ eigen version matmul_eigen.cpp:
#include <iostream>
#include <Eigen/Dense>
#include "time.h"
int main(){
clock_t start,end;
size_t n=2000;
Eigen::MatrixXd a=Eigen::MatrixXd::Random(n,n);
Eigen::MatrixXd b=Eigen::MatrixXd::Random(n,n);
start=clock();
Eigen::MatrixXd c=a*b;
end=clock();
std::cout<<(double)(end-start)/CLOCKS_PER_SEC<<std::endl;
return 0;}
The way I compile it is,
g++ matmul_eigen.cpp -I/usr/include/eigen3 -O3 -march=native -std=c++17 -o matmul_eigen
this will return (both c++11 and c++17):
0.35 seconds
This is very odd to me, 1-Why numpy here is so faster than C++? Am I missing any other flags for optimization?
I thought maybe it is because of the python interpreter that it is executing the program faster here. So I compile the code with cython using this thread in stacks.
The compiled python script was still faster (0.11 seconds). This again add two more questions for me:
2- why it got longer? does the interpreter do anymore optimization?
3- why the binary file of the python script (37 kb) is smaller than the c++(57 kb) one ?
I would appreciate any help,
thanks
The biggest issue is that you are comparing two completely different things:
In Numpy, a*b perform an element-wise multiplication since a and b are 2D array and not considered as matrices. a#b performs a matrix multiplication.
In Eigen, a*b performs a matrix multiplication, not an element-wise one (see the documentation). This is because a and b are matrices, not just 2D arrays.
The two gives completely different results. Moreover, a matrix multiplication runs in O(n**3) time while an element-wise multiplication runs in O(n**2) time. Matrix multiplication kernels are generally highly-optimized and compute-bound. They are often parallelized by most BLAS library. Element-wise multiplications are memory-bound (especially here due to page-faults). As a result, this is not surprising the matrix multiplication is slower than an element-wise multiplication and the gap is not huge either due to the later being memory-bound.
On my i5-9600KF processor (with 6 cores), Numpy takes 9 ms to do a*b (in sequential) and 65 ms to do a#b (in parallel, using OpenBLAS).
Note Numpy element-wise multiplications like this are not parallel (at least, not in the standard default implementation of Numpy). The matrix multiplication of Numpy use a BLAS library which is generally OpenBLAS by default (this is dependent of the actual target platform). Eigen should also use a BLAS library, but it might not be the same than the one of Numpy.
Also note note that clock is not a good way to measure parallel codes as it does not measure the wall clock time but the CPU time (see this post for more information). std::chrono::steady_clock is generally a better alternative in C++.
3- why the binary file of the python script (37 kb) is smaller than the c++(57 kb) one ?
Python is generally compiled to bytecode which is not a native assembly code. C++ is generally compiled to executable programs that contains assembled binary code, additional information used to run the program as well as meta-informations. Bytecodes are generally pretty compact because they are higher-level. Native compilers can perform optimizations making programs bigger such as loop unrolling and inlining for example. Such optimizations are not done by CPython (the default Python interpreter). In fact, CPython performs no (advanced) optimizations on the bytecode. Note that you can tell to native compilers like GCC to generate a smaller code (though generally slower) using flags like -Os and -s.
So based on what I learned from the #Jérôme Richard response and the comments #user17732522. It seems that I made two mistakes in the comparison,
1- I made a mistake defining multiplication in the python script, it should be np.matmul(a,b) or np.dot(a,b) or a#b. not a*b which is a elementwise multiplication.
2- I didn't measure the time in C++ code correctly. clock_t doesn't work right for this calculation, std::chrono::steady_clock works better.
With applying these comments, the c++ eigen code is 10 times faster than the python's.
The updated code for matmul_eigen.cpp:
#include <iostream>
#include <Eigen/Dense>
#include <chrono>
int main(){
size_t n=2000;
Eigen::MatrixXd a=Eigen::MatrixXd::Random(n,n);
Eigen::MatrixXd b=Eigen::MatrixXd::Random(n,n);
auto t1=std::chrono::steady_clock::now();
Eigen::MatrixXd c=a*b;
auto t2=std::chrono::steady_clock::now();
std::cout<<(double)std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count()/1000000.0f<<std::endl;
return 0;}
To compile, both the vectorization and multi-thread flags should be considered.
g++ matmul_eigen.cpp -I/usr/include/eigen3 -O3 -std=c++17 -march=native -fopenmp -o eigen_matmul
To use the multiple threads for running the code:
OMP_NUM_THREADS=4 ./eigen_matmul
where "4" is the number of CPU(s) that openmp can use, you can see how many you have with:
lscpu | grep "CPU(s):"
This will return 0.104 seconds.
The updated python script matmul_numpy.py:
import numpy as np
import time
a=np.random.rand(2000,2000)
b=np.random.rand(2000,2000)
a=np.array(a, dtype=np.float64)
b=np.array(b, dtype=np.float64)
start=time.time()
c=np.dot(a,b)
end=time.time()
print(end-start)
To run the code,
python3 matmul_numpy.py
This will return 1.0531 seconds.
About the reason that it is like this, I think #Jérôme Richard response is a better reference.

Eigen with EIGEN_USE_MKL_ALL

I compiled my C++ project(using Eigen 3.2.8) with the EIGEN_USE_BLAS option and link against MKL-BLAS, every thing works fine and that indeed speeds up my program substantially(perhaps due to a lot of complex-valued matrix-vector multiplication)
Then I also tried the EIGEN_USE_MKL_ALL, however, some similar errors prompt up:
/eigen3/Eigen/src/QR/ColPivHouseholderQR_MKL.h:94:1 error:
Cannot convert "Eigen::PlainObjectBase<Eigen::matrix<int,-1,1>>::Scalar*
{aka int*}" to "long long int*" in initialization
EIGEN_MKL_OR_COLPIV(...)
Two questions here:
EIGEN_USE_BLAS enables a 4x speed up though I didn't expect that much, possible reason?
EIGEN_USE_MKL_ALL seems to have some type conflict with LAPACK stuff, how to fix the compiling error?
MKL utilizes new AVX/AVX2 instruction set (8 32-bit float operations per clock with FMA and 3-operand instructions), while Eigen 3.2.8 only supports up to SSE4 (4 32-bit float operations per clock). As indicated by ggael, you could update to 3.3beta1 to achieve better performance.
You could try Eigen 3.3-beta1. Currently I cannot reproduce your problem. You may want to provide your code sample and compile option. But based on your error message, I guess you are using ILP64 interface, which is not supported by Eigen. You could use LP64 instead.
To complete kangshiyin answer, Eigen 3.3 supports AVX/FMA and can thus achieve similar performance. You need to compile with AVX and FMA instructions enabled. For instance with GCC, clang, or ICC: -mavx -mfma.

Generate samples from a gamma distribution in c++ without random, boost or chrono

I would like to generate samples from a gamma distribution in c++. I am compiling my code on a supercomputer, which does not have an up-to-date version of g++ (nor can it be updated). I have written code which uses chrono and random to generate samples from a gamma distribution. Unfortunately, the version of g++ does not allow me to compile the code (I also cannot use the flags -std=c++11 or -std=c++0x). I also cannot use boost/TR1 as the compiler does not have it.
Other than going back to basics (generating gamma from a sum of exponentials etc.) is there another way that I can sample from this distribution? I am thinking of pure c++ rather than implementing MCMC.

How to speed up this C++ program with eigen library against matlab?

I want to use C++ for big linear algebra computation. As a starting step, these comparison programs I created in C++ and matlab. I am also giving astonishing execution time here. Can you suggest way to beat matlab or atleast get comparable performance? I know that C++ uses highly vectorized methods for computations. So in large scientific programming involving linear algebra, should one always go for matlab instead of C++? I personally think that matlab doesn't give good performance for large computations therefore C++ is preferred to matlab in such cases. However my program results go contrary to this belief.
C++ program compiled with gcc:
#include <iostream>
#include <Eigen\Dense> //EIGEN library
using namespace Eigen;
using namespace std;
int main()
{
MatrixXd A;
A.setRandom(1000, 1000);
MatrixXd B;
B.setRandom(1000, 1000);
MatrixXd C;
C=A*B;
}
Execution time: 24.141 s
Here is matlab program:
function [ ] = Trial( )
clear all;
close all;
clc;
tic;
A=rand([1000,1000]);
B=rand([1000,1000]);
C=A*B;
toc
end
Elapsed time is 0.073883 seconds.
It is extremely hard to beat MATLAB, even with all optimizations turned on. To get the most out of Eigen you need to compile with parallel support (-fopenmp in gcc), and turn optimizations on (-O3). Even in this case, MATLAB will be slightly faster, mainly because it is using the Intel MKL proprietary library to get the most out of Intel chips, so unless you buy it I don't think you will be able to beat it. I am currently using Eigen for a project and wasn't able to beat MATLAB (at least not for dense matrix multiplication).
For example, for A*B where A and B are 1000 x 1000 complex matrices, the best average time I can get is:
MATLAB: 0.32 seconds
Eigen: 0.44 seconds
For 2000 x 2000,
MATLAB: 2.80 seconds
Eigen: 3.45 seconds
System: MacbookPro 2013, OS X.
PS: you should make absolutely sure that you turn optimizations on (-O3) and also compile with parallel support, -fopenmp. This is the reason you're probably getting this huge difference in running time. So you should compile your program as:
g++ -O3 -fopenmp <other compiling flags/parameters> main.cpp
To get the best of Eigen, compile with optimizations ON (e.g. -O3 compiler flag), with OpenMP enabled (e.g., -fopenmp), and disable hyper-threading or specify to openmp the true number of physical cores (e.g., export OMP_NUM_THREADS=4 if you have 8 hyper-threaded "cores", but 4 physical cores).
Finally, you might also consider using the devel branch and enable AVX (e.g., -mavx) and FMA if your CPU does support FMA (e.g., -mfma).
Actually Matlab (if you don't buy the expensive parallel computing toolbox) hardly use multi-threading. It's only used in the libraries called by Matlab which are probably more efficient that what you're using now.
You can check this link to understand (and check) what libraries your Matlab uses http://undocumentedmatlab.com/blog/math-libraries-version-info-upgrade
It's also possible to use them in your C program (though they might have hidden the headers or something, at least you still have the .dll since they need that to run Matlab)

Matrix exponential with Armadillo

I am currently developing with my own C++/Mex code and Matlab, but my project is getting big and I am considering switching to a proper linear algebra library. I have read very good things about Armadillo, but I can't find a few essential functions I need for my project.
I understand Armadillo links to LAPACK and BLAS libraries, but I couldn't find the matrix exponential function in Armdaillo's API, nor in the LAPACK functions.
Can anyone tell me if there is an add-on to compute matrix exponentials with Armadillo? If so, a short example code would be much appreciated.
The matrix exponential is something Matlab has. So Octave implemented it. So other Free Software projects looked at what Octave has and reimplemented it by borrowing this implementation.
I work a lot with R and Armadillo via the RcppArmadillo package (for which I'm a co-author). In one piece of recent work I needed expm() and borrowed it for use by Armadillo from the R package exmp.
The code goes like this:
arma::mat expm(arma::mat x) {
arma::mat z(x.n_rows, x.n_cols);
(*expmat)(x.begin(), x.n_rows, z.begin(), Ward_2);
return z;
}
but it hinges of course on the fact that I get the function pointer to expmat from the
R package exmp. The full file is here on Github which has the enum Ward_2 as well.
This has been added as expmat to the latest release, see
http://arma.sourceforge.net/docs.html#expmat
Here's a port of John Burkardt's c/c++ implementation of one of the 19 dubious to Armadillo...
https://gist.github.com/tesch1/0c03e43885cd66eceabe