How to speed up this C++ program with eigen library against matlab? - c++

I want to use C++ for big linear algebra computation. As a starting step, these comparison programs I created in C++ and matlab. I am also giving astonishing execution time here. Can you suggest way to beat matlab or atleast get comparable performance? I know that C++ uses highly vectorized methods for computations. So in large scientific programming involving linear algebra, should one always go for matlab instead of C++? I personally think that matlab doesn't give good performance for large computations therefore C++ is preferred to matlab in such cases. However my program results go contrary to this belief.
C++ program compiled with gcc:
#include <iostream>
#include <Eigen\Dense> //EIGEN library
using namespace Eigen;
using namespace std;
int main()
{
MatrixXd A;
A.setRandom(1000, 1000);
MatrixXd B;
B.setRandom(1000, 1000);
MatrixXd C;
C=A*B;
}
Execution time: 24.141 s
Here is matlab program:
function [ ] = Trial( )
clear all;
close all;
clc;
tic;
A=rand([1000,1000]);
B=rand([1000,1000]);
C=A*B;
toc
end
Elapsed time is 0.073883 seconds.

It is extremely hard to beat MATLAB, even with all optimizations turned on. To get the most out of Eigen you need to compile with parallel support (-fopenmp in gcc), and turn optimizations on (-O3). Even in this case, MATLAB will be slightly faster, mainly because it is using the Intel MKL proprietary library to get the most out of Intel chips, so unless you buy it I don't think you will be able to beat it. I am currently using Eigen for a project and wasn't able to beat MATLAB (at least not for dense matrix multiplication).
For example, for A*B where A and B are 1000 x 1000 complex matrices, the best average time I can get is:
MATLAB: 0.32 seconds
Eigen: 0.44 seconds
For 2000 x 2000,
MATLAB: 2.80 seconds
Eigen: 3.45 seconds
System: MacbookPro 2013, OS X.
PS: you should make absolutely sure that you turn optimizations on (-O3) and also compile with parallel support, -fopenmp. This is the reason you're probably getting this huge difference in running time. So you should compile your program as:
g++ -O3 -fopenmp <other compiling flags/parameters> main.cpp

To get the best of Eigen, compile with optimizations ON (e.g. -O3 compiler flag), with OpenMP enabled (e.g., -fopenmp), and disable hyper-threading or specify to openmp the true number of physical cores (e.g., export OMP_NUM_THREADS=4 if you have 8 hyper-threaded "cores", but 4 physical cores).
Finally, you might also consider using the devel branch and enable AVX (e.g., -mavx) and FMA if your CPU does support FMA (e.g., -mfma).

Actually Matlab (if you don't buy the expensive parallel computing toolbox) hardly use multi-threading. It's only used in the libraries called by Matlab which are probably more efficient that what you're using now.
You can check this link to understand (and check) what libraries your Matlab uses http://undocumentedmatlab.com/blog/math-libraries-version-info-upgrade
It's also possible to use them in your C program (though they might have hidden the headers or something, at least you still have the .dll since they need that to run Matlab)

Related

numpy is faster and more efficient than Eigen C++?

Recently, I had a debate with a colleague about comparing python vs C++ in terms of performance. Both of us were using these languages for linear algebra mostly. So I wrote two scripts, one in python3, using numpy, and the other one in C++ using Eigen.
Python3 numpy version matmul_numpy.py:
import numpy as np
import time
a=np.random.rand(2000,2000)
b=np.random.rand(2000,2000)
start=time.time()
c=a*b
end=time.time()
print(end-start)
If I run this script with
python3 matmul_numpy.py
This will return:
0.07 seconds
The C++ eigen version matmul_eigen.cpp:
#include <iostream>
#include <Eigen/Dense>
#include "time.h"
int main(){
clock_t start,end;
size_t n=2000;
Eigen::MatrixXd a=Eigen::MatrixXd::Random(n,n);
Eigen::MatrixXd b=Eigen::MatrixXd::Random(n,n);
start=clock();
Eigen::MatrixXd c=a*b;
end=clock();
std::cout<<(double)(end-start)/CLOCKS_PER_SEC<<std::endl;
return 0;}
The way I compile it is,
g++ matmul_eigen.cpp -I/usr/include/eigen3 -O3 -march=native -std=c++17 -o matmul_eigen
this will return (both c++11 and c++17):
0.35 seconds
This is very odd to me, 1-Why numpy here is so faster than C++? Am I missing any other flags for optimization?
I thought maybe it is because of the python interpreter that it is executing the program faster here. So I compile the code with cython using this thread in stacks.
The compiled python script was still faster (0.11 seconds). This again add two more questions for me:
2- why it got longer? does the interpreter do anymore optimization?
3- why the binary file of the python script (37 kb) is smaller than the c++(57 kb) one ?
I would appreciate any help,
thanks
The biggest issue is that you are comparing two completely different things:
In Numpy, a*b perform an element-wise multiplication since a and b are 2D array and not considered as matrices. a#b performs a matrix multiplication.
In Eigen, a*b performs a matrix multiplication, not an element-wise one (see the documentation). This is because a and b are matrices, not just 2D arrays.
The two gives completely different results. Moreover, a matrix multiplication runs in O(n**3) time while an element-wise multiplication runs in O(n**2) time. Matrix multiplication kernels are generally highly-optimized and compute-bound. They are often parallelized by most BLAS library. Element-wise multiplications are memory-bound (especially here due to page-faults). As a result, this is not surprising the matrix multiplication is slower than an element-wise multiplication and the gap is not huge either due to the later being memory-bound.
On my i5-9600KF processor (with 6 cores), Numpy takes 9 ms to do a*b (in sequential) and 65 ms to do a#b (in parallel, using OpenBLAS).
Note Numpy element-wise multiplications like this are not parallel (at least, not in the standard default implementation of Numpy). The matrix multiplication of Numpy use a BLAS library which is generally OpenBLAS by default (this is dependent of the actual target platform). Eigen should also use a BLAS library, but it might not be the same than the one of Numpy.
Also note note that clock is not a good way to measure parallel codes as it does not measure the wall clock time but the CPU time (see this post for more information). std::chrono::steady_clock is generally a better alternative in C++.
3- why the binary file of the python script (37 kb) is smaller than the c++(57 kb) one ?
Python is generally compiled to bytecode which is not a native assembly code. C++ is generally compiled to executable programs that contains assembled binary code, additional information used to run the program as well as meta-informations. Bytecodes are generally pretty compact because they are higher-level. Native compilers can perform optimizations making programs bigger such as loop unrolling and inlining for example. Such optimizations are not done by CPython (the default Python interpreter). In fact, CPython performs no (advanced) optimizations on the bytecode. Note that you can tell to native compilers like GCC to generate a smaller code (though generally slower) using flags like -Os and -s.
So based on what I learned from the #Jérôme Richard response and the comments #user17732522. It seems that I made two mistakes in the comparison,
1- I made a mistake defining multiplication in the python script, it should be np.matmul(a,b) or np.dot(a,b) or a#b. not a*b which is a elementwise multiplication.
2- I didn't measure the time in C++ code correctly. clock_t doesn't work right for this calculation, std::chrono::steady_clock works better.
With applying these comments, the c++ eigen code is 10 times faster than the python's.
The updated code for matmul_eigen.cpp:
#include <iostream>
#include <Eigen/Dense>
#include <chrono>
int main(){
size_t n=2000;
Eigen::MatrixXd a=Eigen::MatrixXd::Random(n,n);
Eigen::MatrixXd b=Eigen::MatrixXd::Random(n,n);
auto t1=std::chrono::steady_clock::now();
Eigen::MatrixXd c=a*b;
auto t2=std::chrono::steady_clock::now();
std::cout<<(double)std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count()/1000000.0f<<std::endl;
return 0;}
To compile, both the vectorization and multi-thread flags should be considered.
g++ matmul_eigen.cpp -I/usr/include/eigen3 -O3 -std=c++17 -march=native -fopenmp -o eigen_matmul
To use the multiple threads for running the code:
OMP_NUM_THREADS=4 ./eigen_matmul
where "4" is the number of CPU(s) that openmp can use, you can see how many you have with:
lscpu | grep "CPU(s):"
This will return 0.104 seconds.
The updated python script matmul_numpy.py:
import numpy as np
import time
a=np.random.rand(2000,2000)
b=np.random.rand(2000,2000)
a=np.array(a, dtype=np.float64)
b=np.array(b, dtype=np.float64)
start=time.time()
c=np.dot(a,b)
end=time.time()
print(end-start)
To run the code,
python3 matmul_numpy.py
This will return 1.0531 seconds.
About the reason that it is like this, I think #Jérôme Richard response is a better reference.

Why is Fortran slower than Octave?

Normally, Fortran is leaps and bounds faster than Octave. However, I've noticed that when performing similar matrix manipulations with Fortran's "spread" function, compared to Octave's "repmat" function, Octave runs about twice as fast as my compiled Fortran version of the program. Is anyone able to give an explanation as to why that is? Is there something that I need to be doing in order to increase Fortran's performance?
First, here's my simple fortran program:
program test_program
double precision, parameter, dimension(1000,500) :: A = reshape([ ... ],[1000,500])
logical, dimension(:,:,:), allocatable :: blockL
integer, dimension(2) :: Adim
Adim = shape(A)
blockL = spread(A,3,Adim(1))==spread(transpose(A),1,Adim(1))
end program test_program
Now here's my corresponding program, written in Octave:
A = [ ... ]; % This is the same "A" that was used in Fortran
Adim1 = size(A,1);
blockL = repmat(A,[1 1 Adim1])==repmat(permute(A,[3 2 1]),[Adim1 1 1]);
Once compiled, the Fortran program takes about fifteen seconds to run. The Octave program takes about eight. Shouldn't a compiled program always be faster than an interpreted one? Any ideas on what I may be doing wrong, or how I could speed up my Fortran program?
I'm using the gfortran compiler on a machine that is running Lubuntu 14.04. The following shows exactly how I'm compiling it, when I type my command at the Linux console:
gfortran test_program.f08 -o test_program
I have Octave installed on the same machine, so both programs are using the same resources and hardware when being compared.
Thanks so much for your time and attention. I appreciate any guidance that anyone is able, or willing, to provide.
As many have pointed out, in the comments section, interpreted languages, like Octave, are only slow based on the number of lines, or command calls, that exist in your program. When it comes to intrinsic functions, Octave can be relatively fast.
As for building a Fortran program that is as fast as the Octave version, I ultimately turned to coarrays.
Fortran's coarray feature is an excellent way for programmers to exploit the benefits of multi-processor computers. Octave uses optimized tools, similar to tools like BLAS, in their implementation. Those tools are also capable of exploiting the parallel nature of processors by using SIMD features of modern processors. Although I'm not completely sure that coarrays use that exact implementation, on the processor level (they probably do), they do allow programmers to include parallelization in their programs.
By using coarrays, I was able to write a program that is as fast as the fastest Octave version of my program; maybe even faster.
One commenter suggested compiling my original Fortran program using the "-O3" optimization option with gfortran. In my case, doing so resulted in no speed increase.

Disable the default Armadillo in C++ when compiled with -fopenmp

In Armadillo C++, is there any way to disable the default parallelization when compiled with -fopenmp. I would like the parallelization to be on other parts of the code.
The function I'm particularly interested in is eig_sym().
Thanks very much,
Yantao
Armadillo isn't parallelized with OpenMP, with slight caveats:
The underlying LAPACK or BLAS implementation may be paralellized. If you are using OpenBLAS, it is.
The Armadillo gmm_diag implementation uses OpenMP.
So the simplest way to go is "don't use OpenBLAS, instead use a singlethreaded BLAS". But that's not the only way to go.
It sounds to me like you want to disable nested parallelism, so that the only parts of the code that are parallelized are at the higher levels of your code and not in eig_sym(). Here's some documentation on OMP_NESTED:
https://docs.oracle.com/cd/E19205-01/819-5270/aewbc/index.html
So you could either set the environment OMP_NESTED to false at runtime, or with omp_set_nested() in your code.

Eigen with EIGEN_USE_MKL_ALL

I compiled my C++ project(using Eigen 3.2.8) with the EIGEN_USE_BLAS option and link against MKL-BLAS, every thing works fine and that indeed speeds up my program substantially(perhaps due to a lot of complex-valued matrix-vector multiplication)
Then I also tried the EIGEN_USE_MKL_ALL, however, some similar errors prompt up:
/eigen3/Eigen/src/QR/ColPivHouseholderQR_MKL.h:94:1 error:
Cannot convert "Eigen::PlainObjectBase<Eigen::matrix<int,-1,1>>::Scalar*
{aka int*}" to "long long int*" in initialization
EIGEN_MKL_OR_COLPIV(...)
Two questions here:
EIGEN_USE_BLAS enables a 4x speed up though I didn't expect that much, possible reason?
EIGEN_USE_MKL_ALL seems to have some type conflict with LAPACK stuff, how to fix the compiling error?
MKL utilizes new AVX/AVX2 instruction set (8 32-bit float operations per clock with FMA and 3-operand instructions), while Eigen 3.2.8 only supports up to SSE4 (4 32-bit float operations per clock). As indicated by ggael, you could update to 3.3beta1 to achieve better performance.
You could try Eigen 3.3-beta1. Currently I cannot reproduce your problem. You may want to provide your code sample and compile option. But based on your error message, I guess you are using ILP64 interface, which is not supported by Eigen. You could use LP64 instead.
To complete kangshiyin answer, Eigen 3.3 supports AVX/FMA and can thus achieve similar performance. You need to compile with AVX and FMA instructions enabled. For instance with GCC, clang, or ICC: -mavx -mfma.

Why use third-party vector libraries at all?

So I'm thinking of using the Eigen matrix library for a project I'm doing (2D space simulator). I just went ahead and profiled some code with Eigen::Vector2d, and with bare arrays. I noticed a 10x improvement in assigning values to elements in the array, and a 40x improvement in calculating the dot products.
Here is my profiling if you want to check it out, basically it's ~4.065s against ~0.110s.
Obviously bare arrays are much more efficient at dot products and assigning stuff. So why use the Eigen library (or any other library, Eigen just seemed the fastest)? Is it stability? Complicated maths that would be hard to code by yourself efficiently?
The real win for these libraies is the built in SIMD vectorization.
It looks like eigen doesn't enable that by default and you need to enable it with a define / compiler switch. (EDIT: Misread the link, it's enabled if it detects that the compiler supports it, and you need to enable the instructions on some compilers, still, may or may not be on by default on your compiler)
(Not to mention the fact that they are typically more thoroughly tested than a home rolled solution, and enable all sorts of complicated/interesting stuff that's a real bear to code by hand)
There are a number of reasons to opt for standard library code.
Better portability. An individual developer may not have considered (or may not have access to) multiple platforms.
Better reliability. (as mentioned by Donnie) A library is usually more thoroughly tested.
Better developer mobility. It is easier to work on other people's code if they are using standard library components.
Avoids reinventing the wheel. You want to avoid a situation where each developer develops the same component in their own way.
A custom implementation can get stale soon. There's only a limited amount of time upto which you would be able to keep updating and supporting your version of the library. The standard library is likely to have more support effort.
Better "external" support. Consider the C++ STL library for instance. You will find plenty of resources from people who are not the original developers. Also, textbooks will cover standard library components, which helps new users and students to learn them without any burden to the developer.
PS/Disclaimer: My apologies, I don't know about the Eigen library. The above points are from a more general perspective regarding standard library.
I just had a look at your benchmarking and get the following result:
g++ -I/usr/include/eigen3/ eigen.cpp -o eigen
g++ -O3 -I/usr/include/eigen3/ eigen.cpp -o eigen_opt
g++ -I/usr/include/eigen3/ matrix.cpp -o matrix
g++ -O3 -I/usr/include/eigen3/ matrix.cpp -o matrix_opt
./eigen 3.10s user 0.00s system 99% cpu 3.112 total
./eigen_opt 0.00s user 0.00s system 0% cpu 0.001 total
./matrix 0.06s user 0.00s system 96% cpu 0.058 total
./matrix_opt 0.00s user 0.00s system 0% cpu 0.001 total
Eigen really is not fast unless you switch on the compiler optimizations. I also suspect that the compiler in the -O3 case does some optimization that works against the benchmarking character. You might want to look into it.
I think this removes one of your points for not using a library: speed. Once that criteria is out of the way, there is NO reason that I could think of not to use an existing library, other than you want to do something for academic purposes, or you want to write your own library. Whenever I see a library or other code that implements its own Matrix and Vector classes these days I try to avoid it if possible. With Eigen around I even have a much lower need of Matlab...