numpy is faster and more efficient than Eigen C++? - c++

Recently, I had a debate with a colleague about comparing python vs C++ in terms of performance. Both of us were using these languages for linear algebra mostly. So I wrote two scripts, one in python3, using numpy, and the other one in C++ using Eigen.
Python3 numpy version matmul_numpy.py:
import numpy as np
import time
a=np.random.rand(2000,2000)
b=np.random.rand(2000,2000)
start=time.time()
c=a*b
end=time.time()
print(end-start)
If I run this script with
python3 matmul_numpy.py
This will return:
0.07 seconds
The C++ eigen version matmul_eigen.cpp:
#include <iostream>
#include <Eigen/Dense>
#include "time.h"
int main(){
clock_t start,end;
size_t n=2000;
Eigen::MatrixXd a=Eigen::MatrixXd::Random(n,n);
Eigen::MatrixXd b=Eigen::MatrixXd::Random(n,n);
start=clock();
Eigen::MatrixXd c=a*b;
end=clock();
std::cout<<(double)(end-start)/CLOCKS_PER_SEC<<std::endl;
return 0;}
The way I compile it is,
g++ matmul_eigen.cpp -I/usr/include/eigen3 -O3 -march=native -std=c++17 -o matmul_eigen
this will return (both c++11 and c++17):
0.35 seconds
This is very odd to me, 1-Why numpy here is so faster than C++? Am I missing any other flags for optimization?
I thought maybe it is because of the python interpreter that it is executing the program faster here. So I compile the code with cython using this thread in stacks.
The compiled python script was still faster (0.11 seconds). This again add two more questions for me:
2- why it got longer? does the interpreter do anymore optimization?
3- why the binary file of the python script (37 kb) is smaller than the c++(57 kb) one ?
I would appreciate any help,
thanks

The biggest issue is that you are comparing two completely different things:
In Numpy, a*b perform an element-wise multiplication since a and b are 2D array and not considered as matrices. a#b performs a matrix multiplication.
In Eigen, a*b performs a matrix multiplication, not an element-wise one (see the documentation). This is because a and b are matrices, not just 2D arrays.
The two gives completely different results. Moreover, a matrix multiplication runs in O(n**3) time while an element-wise multiplication runs in O(n**2) time. Matrix multiplication kernels are generally highly-optimized and compute-bound. They are often parallelized by most BLAS library. Element-wise multiplications are memory-bound (especially here due to page-faults). As a result, this is not surprising the matrix multiplication is slower than an element-wise multiplication and the gap is not huge either due to the later being memory-bound.
On my i5-9600KF processor (with 6 cores), Numpy takes 9 ms to do a*b (in sequential) and 65 ms to do a#b (in parallel, using OpenBLAS).
Note Numpy element-wise multiplications like this are not parallel (at least, not in the standard default implementation of Numpy). The matrix multiplication of Numpy use a BLAS library which is generally OpenBLAS by default (this is dependent of the actual target platform). Eigen should also use a BLAS library, but it might not be the same than the one of Numpy.
Also note note that clock is not a good way to measure parallel codes as it does not measure the wall clock time but the CPU time (see this post for more information). std::chrono::steady_clock is generally a better alternative in C++.
3- why the binary file of the python script (37 kb) is smaller than the c++(57 kb) one ?
Python is generally compiled to bytecode which is not a native assembly code. C++ is generally compiled to executable programs that contains assembled binary code, additional information used to run the program as well as meta-informations. Bytecodes are generally pretty compact because they are higher-level. Native compilers can perform optimizations making programs bigger such as loop unrolling and inlining for example. Such optimizations are not done by CPython (the default Python interpreter). In fact, CPython performs no (advanced) optimizations on the bytecode. Note that you can tell to native compilers like GCC to generate a smaller code (though generally slower) using flags like -Os and -s.

So based on what I learned from the #Jérôme Richard response and the comments #user17732522. It seems that I made two mistakes in the comparison,
1- I made a mistake defining multiplication in the python script, it should be np.matmul(a,b) or np.dot(a,b) or a#b. not a*b which is a elementwise multiplication.
2- I didn't measure the time in C++ code correctly. clock_t doesn't work right for this calculation, std::chrono::steady_clock works better.
With applying these comments, the c++ eigen code is 10 times faster than the python's.
The updated code for matmul_eigen.cpp:
#include <iostream>
#include <Eigen/Dense>
#include <chrono>
int main(){
size_t n=2000;
Eigen::MatrixXd a=Eigen::MatrixXd::Random(n,n);
Eigen::MatrixXd b=Eigen::MatrixXd::Random(n,n);
auto t1=std::chrono::steady_clock::now();
Eigen::MatrixXd c=a*b;
auto t2=std::chrono::steady_clock::now();
std::cout<<(double)std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count()/1000000.0f<<std::endl;
return 0;}
To compile, both the vectorization and multi-thread flags should be considered.
g++ matmul_eigen.cpp -I/usr/include/eigen3 -O3 -std=c++17 -march=native -fopenmp -o eigen_matmul
To use the multiple threads for running the code:
OMP_NUM_THREADS=4 ./eigen_matmul
where "4" is the number of CPU(s) that openmp can use, you can see how many you have with:
lscpu | grep "CPU(s):"
This will return 0.104 seconds.
The updated python script matmul_numpy.py:
import numpy as np
import time
a=np.random.rand(2000,2000)
b=np.random.rand(2000,2000)
a=np.array(a, dtype=np.float64)
b=np.array(b, dtype=np.float64)
start=time.time()
c=np.dot(a,b)
end=time.time()
print(end-start)
To run the code,
python3 matmul_numpy.py
This will return 1.0531 seconds.
About the reason that it is like this, I think #Jérôme Richard response is a better reference.

Related

Why is Fortran slower than Octave?

Normally, Fortran is leaps and bounds faster than Octave. However, I've noticed that when performing similar matrix manipulations with Fortran's "spread" function, compared to Octave's "repmat" function, Octave runs about twice as fast as my compiled Fortran version of the program. Is anyone able to give an explanation as to why that is? Is there something that I need to be doing in order to increase Fortran's performance?
First, here's my simple fortran program:
program test_program
double precision, parameter, dimension(1000,500) :: A = reshape([ ... ],[1000,500])
logical, dimension(:,:,:), allocatable :: blockL
integer, dimension(2) :: Adim
Adim = shape(A)
blockL = spread(A,3,Adim(1))==spread(transpose(A),1,Adim(1))
end program test_program
Now here's my corresponding program, written in Octave:
A = [ ... ]; % This is the same "A" that was used in Fortran
Adim1 = size(A,1);
blockL = repmat(A,[1 1 Adim1])==repmat(permute(A,[3 2 1]),[Adim1 1 1]);
Once compiled, the Fortran program takes about fifteen seconds to run. The Octave program takes about eight. Shouldn't a compiled program always be faster than an interpreted one? Any ideas on what I may be doing wrong, or how I could speed up my Fortran program?
I'm using the gfortran compiler on a machine that is running Lubuntu 14.04. The following shows exactly how I'm compiling it, when I type my command at the Linux console:
gfortran test_program.f08 -o test_program
I have Octave installed on the same machine, so both programs are using the same resources and hardware when being compared.
Thanks so much for your time and attention. I appreciate any guidance that anyone is able, or willing, to provide.
As many have pointed out, in the comments section, interpreted languages, like Octave, are only slow based on the number of lines, or command calls, that exist in your program. When it comes to intrinsic functions, Octave can be relatively fast.
As for building a Fortran program that is as fast as the Octave version, I ultimately turned to coarrays.
Fortran's coarray feature is an excellent way for programmers to exploit the benefits of multi-processor computers. Octave uses optimized tools, similar to tools like BLAS, in their implementation. Those tools are also capable of exploiting the parallel nature of processors by using SIMD features of modern processors. Although I'm not completely sure that coarrays use that exact implementation, on the processor level (they probably do), they do allow programmers to include parallelization in their programs.
By using coarrays, I was able to write a program that is as fast as the fastest Octave version of my program; maybe even faster.
One commenter suggested compiling my original Fortran program using the "-O3" optimization option with gfortran. In my case, doing so resulted in no speed increase.

Slow matrix inversion in C++

I'm currently trying to convert matlab code to C++ using armadillo. I converted some matlab code by following the aramdillo documentation to C++. However the performance is disappointing compared to matlab.
In Matlab it takes about 0.1 sec to inverse a Matrix A of size (625x625) compared to over 3 seconds in C++.
In C++ I have tried both
solve()
as well as
inv()
I'm aware of the fact that inv produces less accurate results, thus I do not prefer to use it. Besides I really need the inverse of matrix A, as I use the diagonal elements later in the algorithm.
The code that's producing these results:
Matlab
x=A\b
invA = A\eye(size(A))
C++
arma::mat x = solve(A,b)
arma::mat invA = solve(A,eye(625,625))
The versions I'm using:
C++:
Visual Studio 2013
Armadillo 8.300.1
Intel MKL 2018.1.156
Matlab:
matlab 2016b
version -blas
Intel(R) Math Kernel Library Version 11.3.1 Product Build 20151021 for Intel(R) 64 architecture applications, CNR branch AVX2
version -lapack
Intel(R) Math Kernel Library Version 11.3.1 Product Build 20151021 for Intel(R) 64 architecture applications, CNR branch AVX2
Linear Algebra PACKage Version 3.5.0
Does anyone have an idea how to overcome this lack of speed in C++ using armadillo?
Have you tried : add to your code #ARMA_USE_LAPACK ? This allows your code to use more optimized version of .inv() function. Also check the documentation: http://arma.sourceforge.net/docs.html

Eigen with EIGEN_USE_MKL_ALL

I compiled my C++ project(using Eigen 3.2.8) with the EIGEN_USE_BLAS option and link against MKL-BLAS, every thing works fine and that indeed speeds up my program substantially(perhaps due to a lot of complex-valued matrix-vector multiplication)
Then I also tried the EIGEN_USE_MKL_ALL, however, some similar errors prompt up:
/eigen3/Eigen/src/QR/ColPivHouseholderQR_MKL.h:94:1 error:
Cannot convert "Eigen::PlainObjectBase<Eigen::matrix<int,-1,1>>::Scalar*
{aka int*}" to "long long int*" in initialization
EIGEN_MKL_OR_COLPIV(...)
Two questions here:
EIGEN_USE_BLAS enables a 4x speed up though I didn't expect that much, possible reason?
EIGEN_USE_MKL_ALL seems to have some type conflict with LAPACK stuff, how to fix the compiling error?
MKL utilizes new AVX/AVX2 instruction set (8 32-bit float operations per clock with FMA and 3-operand instructions), while Eigen 3.2.8 only supports up to SSE4 (4 32-bit float operations per clock). As indicated by ggael, you could update to 3.3beta1 to achieve better performance.
You could try Eigen 3.3-beta1. Currently I cannot reproduce your problem. You may want to provide your code sample and compile option. But based on your error message, I guess you are using ILP64 interface, which is not supported by Eigen. You could use LP64 instead.
To complete kangshiyin answer, Eigen 3.3 supports AVX/FMA and can thus achieve similar performance. You need to compile with AVX and FMA instructions enabled. For instance with GCC, clang, or ICC: -mavx -mfma.

How to speed up this C++ program with eigen library against matlab?

I want to use C++ for big linear algebra computation. As a starting step, these comparison programs I created in C++ and matlab. I am also giving astonishing execution time here. Can you suggest way to beat matlab or atleast get comparable performance? I know that C++ uses highly vectorized methods for computations. So in large scientific programming involving linear algebra, should one always go for matlab instead of C++? I personally think that matlab doesn't give good performance for large computations therefore C++ is preferred to matlab in such cases. However my program results go contrary to this belief.
C++ program compiled with gcc:
#include <iostream>
#include <Eigen\Dense> //EIGEN library
using namespace Eigen;
using namespace std;
int main()
{
MatrixXd A;
A.setRandom(1000, 1000);
MatrixXd B;
B.setRandom(1000, 1000);
MatrixXd C;
C=A*B;
}
Execution time: 24.141 s
Here is matlab program:
function [ ] = Trial( )
clear all;
close all;
clc;
tic;
A=rand([1000,1000]);
B=rand([1000,1000]);
C=A*B;
toc
end
Elapsed time is 0.073883 seconds.
It is extremely hard to beat MATLAB, even with all optimizations turned on. To get the most out of Eigen you need to compile with parallel support (-fopenmp in gcc), and turn optimizations on (-O3). Even in this case, MATLAB will be slightly faster, mainly because it is using the Intel MKL proprietary library to get the most out of Intel chips, so unless you buy it I don't think you will be able to beat it. I am currently using Eigen for a project and wasn't able to beat MATLAB (at least not for dense matrix multiplication).
For example, for A*B where A and B are 1000 x 1000 complex matrices, the best average time I can get is:
MATLAB: 0.32 seconds
Eigen: 0.44 seconds
For 2000 x 2000,
MATLAB: 2.80 seconds
Eigen: 3.45 seconds
System: MacbookPro 2013, OS X.
PS: you should make absolutely sure that you turn optimizations on (-O3) and also compile with parallel support, -fopenmp. This is the reason you're probably getting this huge difference in running time. So you should compile your program as:
g++ -O3 -fopenmp <other compiling flags/parameters> main.cpp
To get the best of Eigen, compile with optimizations ON (e.g. -O3 compiler flag), with OpenMP enabled (e.g., -fopenmp), and disable hyper-threading or specify to openmp the true number of physical cores (e.g., export OMP_NUM_THREADS=4 if you have 8 hyper-threaded "cores", but 4 physical cores).
Finally, you might also consider using the devel branch and enable AVX (e.g., -mavx) and FMA if your CPU does support FMA (e.g., -mfma).
Actually Matlab (if you don't buy the expensive parallel computing toolbox) hardly use multi-threading. It's only used in the libraries called by Matlab which are probably more efficient that what you're using now.
You can check this link to understand (and check) what libraries your Matlab uses http://undocumentedmatlab.com/blog/math-libraries-version-info-upgrade
It's also possible to use them in your C program (though they might have hidden the headers or something, at least you still have the .dll since they need that to run Matlab)

Why does trivial loop in python run so much slower than the same in C++? And how to optimize that? [duplicate]

This question already has answers here:
Why are Python Programs often slower than the Equivalent Program Written in C or C++?
(11 answers)
Closed 9 years ago.
simply run a near empty for loop in python and in C++ (as following), the speed are very different, the python is more than a hundred times slower.
a = 0
for i in xrange(large_const):
a += 1
int a = 0;
for (int i = 0; i < large_const; i++)
a += 1;
Plus, what can I do to optimize the speed of python?
(Addition:
I made a bad example here in the first version of this question, I don't really mean that a=1 so that C/C++ compiler could optimize that, I mean the loop itself consumed a lot of resource (maybe I should use a+=1 as example).. And what I mean by how to optimize is that if the for loop is just like a += 1 that simple, how could it be run in the similar speed as C/C++? In my practice, I used Numpy so I can't use pypy anymore (for now), is there some general methods for making loop far more quickly (such as generator in generating list)?
)
A smart C compiler can probably optimize your loop away by recognizing that at the end, a will always be 1. Python can't do that because when iterating over xrange, it needs to call __next__ on the xrange object until it raises StopIteration. python can't know if __next__ will have side-effect until it calls it, so there is no way to optimize the loop away. The take-away message from this paragraph is that it is MUCH HARDER to optimize a Python "compiler" than a C compiler because python is such a dynamic language and requires the compiler to know how the object will behave in certain circumstances. In
C, that's much easier because C knows exactly what type every object is ahead of time.
Of course, compiler aside, python needs to do a lot more work. In C, you're working with base types using operations supported in hardware instructions. In python, the interpreter is interpreting the byte-code one line at a time in software. Clearly that is going to take longer than machine level instructions. And the data model (e.g. calling __next__ over and over again) can also lead to a lot of function calls which the C doesn't need to do. Of course, python does this stuff to make it much more flexible than you can have in a compiled language.
The typical way to speed up python code is to use libraries or intrinsic functions which provide a high level interface to low-level compiled code. scipy and numpy are excellent examples this kind of library. Other things you can look into are using pypy which includes a JIT compiler -- you probably won't reach native speeds, but it'll probably beat Cpython (the most common implementation), or writing extensions in C/fortran using the Cpython-API, cython or f2py for performance critical sections of code.
Simply because Python is a more high level language and has to do more different things on every iteration (like acquiring locks, resolving variables etc.)
“How to optimise” is a very vague question. There is no “general” way to optimise any Python program (everythng possible was already done by the developers of Python). Your particular example can be optimsed this way:
a = 1
That's what any C compiler will do, by the way.
If your program works with numeric data, then using numpy and its vectorised routines often gives you a great performance boost, as it does everything in pure C (using C loops, not Python ones) and doesn't have to take interpreter lock and all this stuff.
Python is (usually) an interpreted language, meaning that the script has to be read line-by-line at runtime and its instructions compiled into usable bytecode at that point.
C is (usually) a compiled language, so by the time you're running it you're working with pure machine code.
Python will never be as fast as C, for that reason.
Edit: In fact, python compiles INTO C code at run time, that's why you get those .pyc files.
As you go more abstract the speed will go down. The fastest code is assembly code which is written directly.
Read this question Why are Python Programs often slower than the Equivalent Program Written in C or C++?