Eigen Matrix Multiplication Speed - c++

I was trying to do linear algebra numerical computation in C++. I used Python Numpy for quick model and I would like to find a C++ linear algebra pack for some further speed up. Eigen seems to be quite a good point to start.
I wrote a small performance test using large dense matrix multiplication to test the processing speed. In Numpy I was doing this:
import numpy as np
import time
a = np.random.uniform(size = (5000, 5000))
b = np.random.uniform(size = (5000, 5000))
start = time.time()
c = np.dot(a, b)
print (time.time() - start) * 1000, 'ms'
In C++ Eigen I was doing this:
#include <time.h>
#include "Eigen/Dense"
using namespace std;
using namespace Eigen;
int main() {
MatrixXf a = MatrixXf::Random(5000, 5000);
MatrixXf b = MatrixXf::Random(5000, 5000);
time_t start = clock();
MatrixXf c = a * b;
cout << (double)(clock() - start) / CLOCKS_PER_SEC * 1000 << "ms" << endl;
return 0;
}
I have done some search in the documents and on stackoverflow on the compilation optimization flags. I tried to compile the program using this command:
g++ -g test.cpp -o test -Ofast -msse2
The C++ executable compiled with -Ofast optimization flags runs about 30x or more faster than a simple no optimization compilation. It will return the result in roughly 10000ms on my 2015 macbook pro.
Meanwhile Numpy will return the result in about 1800ms.
I am expecting a boost of performance in using Eigen compared with Numpy. However, this failed my expectation.
Is there any compile flags I missed that will further boost the Eigen performance in this? Or is there any multithread switch that can be turn on to give me extra performance gain? I am just curious about this.
Thank you very much!
Edit on April 17, 2016:
After doing some search according to #ggael 's answer, I have come up with the answer to this question.
Best solution to this is compile with link to Intel MKL as backend for Eigen. for osx system the library can be found at here. With MKL installed I tried to use the Intel MKL link line advisor to enable MKL backend support for Eigen.
I compile in this manner for all MKL enablement:
g++ -DEIGEN_USE_MKL_ALL -L${MKLROOT}/lib -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -liomp5 -lpthread -lm -ldl -m64 -I${MKLROOT}/include -I. -Ofast -DNDEBUG test.cpp -o test
If there is any environment variable error for MKLROOT just run the environment setup script provided in the MKL package which is installed default at /opt/intel/mkl/bin on my device.
With MKL as Eigen backend the matrix multiplication for two 5000x5000 operation will be finished in about 900ms on my 2.5Ghz Macbook Pro. This is much faster than Python Numpy on my device.

To answer on the OSX side, first of all recall that on OSX g++ is actually an alias to clang++, and the current Apple's version of clang does not support openmp. Nonetheless, using Eigen3.3-beta-1, and default clang++, I get on a macbookpro 2.6Ghz:
$ clang++ -mfma -I ../eigen so_gemm_perf.cpp -O3 -DNDEBUG && ./a.out
2954.91ms
Then to get support for multithreading, you need a recent clang of gcc compiler, for instance using homebrew or macport. Here using gcc 5 from macport, I get:
$ g++-mp-5 -mfma -I ../eigen so_gemm_perf.cpp -O3 -DNDEBUG -fopenmp -Wa,-q && ./a.out
804.939ms
and with clang 3.9:
$ clang++-mp-3.9 -mfma -I ../eigen so_gemm_perf.cpp -O3 -DNDEBUG -fopenmp && ./a.out
806.16ms
Remark that gcc on osx does not knowhow to properly assemble AVX/FMA instruction,so you need to tell it to use the native assembler with the -Wa,-q flag.
Finally, with the devel branch, you can also tell Eigen to use whatever BLAS as a backend, for instance the one from Apple's Accelerate as follows:
$ g++ -framework Accelerate -DEIGEN_USE_BLAS -O3 -DNDEBUG so_gemm_perf.cpp -I ../eigen && ./a.out
802.837ms

Compiling your little program with VC2013:
/fp:precise - 10.5s
/fp:strict - 10.4s
/fp:fast - 10.3s
/fp:fast /arch:AVX2 - 6.6s
/fp:fast /arch:AVX2 /openmp - 2.7s
So using AVX/AVX2 and enabling OpenMP is going to help a lot. You can also try linking against MKL (http://eigen.tuxfamily.org/dox/TopicUsingIntelMKL.html).

Related

Performance comparison issue between OpenMPI and Intel MPI

I am working with a C++ MPI code which when compiled with openMPI takes 1min12 seconds and 16 seconds with Intel MPI (I have tested it on other inputs too, difference is similar. Both compiled codes give correct answer). I want to understand why is there such a big difference in run time. And what can be done to decrease run time with openMPI (GCC).
I am using CentOS 6 OS with Intel Haswell processor.
I am using following flags for compiling.
openMPI (GCC): mpiCC -Wall -O3
I have also tried -march=native -funroll-loops. It does not make a great difference. I have also tried -lm option. I cannot compile for 32 bit.
Intel MPI: mpiicpc -Wall -O3 -xhost
-xhost saves 3 seconds in run time.

Equivalent of mpif90 --showme for Cray Fortran Wrapper ftn

I am currently compiling code on a HPC system that was set up by Cray. To call Fortran, C, and C++ compilers it is suggested to use ftn, cc, and CC compiler wrappers provided by Cray.
Now, I would like to know which options the ftn wrapper adds to the actual compiler call (in my case to ifort, but it should not matter). From working with MPI wrappers I know the option --showme to get this information:
> mpif90 --showme
pgf90 -I/opt/openmpi/pgi/ib/include -fast -I/opt/openmpi/pgi/ib/lib -L/opt/openmpi/pgi/ib/lib -lmpi_f90 -lmpi_f77 -lmpi -libverbs -lrt -lnsl -lutil -ldl -lm -lrt -lnsl -lutil
## example from another HPC system; MPI wrapper around Portland Fortran Group Compiler
I am locking for an option like --OPTION_TO_GET_APPENDED_FLAGS that provides the same information for the ftn wrapper
> ftn --OPTION_TO_GET_APPENDED_FLAGS
ifort -one_option -O2 -another_option
Because it is Friday afternoon local time all colleagues with knowledge on this topic left already for their weekend (as well as the cluster support team).
Thanks in advance for the answers.
On the Cray system I am using (Cray Linux Environment (CLE), 27th Apr. 2016), the appropriate option is -craype-verbose:
ftp -craype-verbose
> ifort -xCORE-AVX2 -static -D__CRAYXC [...]
It is written on the man page which I just scanned quickly before asking this question:
-craype-verbose
Print the command which is forwarded to compiler invocation.

g++ and cilkscreen for detecting race condition

I'm trying to use cilkscreen to detect some race conditions in a code.
I'm compiling my code using
g++-5 -g foo.cpp -fcilkplus -std=c++14 -lcilkrts -ldl -O2
However, when I launch cilkscreen I get the following error message:
cilkview ./a.out
Cilkview: Generating scalability data
Cilkview Scalability Analyzer V2.0.0, Build 4421
1100189201
Error: No Cilk code found in program
Should I add some more option to g++ ? Or does cilkscreen only works with code compiled with icc ?
FWIW: I'm using
gcc version 5.3.1 20160301 [gcc-5-branch revision 233849] (SUSE Linux)
Cilkscreen/cilkview works only with icc/icpc.

Rcpp with Intel MKL Multithreading

I wrote a C++ shared library that uses Intel MKL for BLAS operations, and it threads beautifully, using all 12 cores of the machine. I am now trying to use RCpp to call a function from my library, and I am finding that it is single threaded. As in, for the same data, when the same function is called from C++, it uses all 12 cores very quickly, whereas when Rcpp calls it, it is single threaded and takes much longer (but the results are consistent).
Intel MKL is dynamically linked to my library thusly:
Makefile:
LIBRARIES=-lpthread -Wl,--no-as-needed -L<directory>bin -liomp5 -L<bin_directory> -lmkl_mc3 -lmkl_intel_lp64 -lmkl_gnu_thread -ldl -lmkl_core -lm -DMKL_ILP64 -fopenmp
LFLAGS=-O3 -I/opt/intel/composer_xe_2015/mkl/include -std=c++0x -m64
#Compiles the shared library
g++ -fPIC -shared <cpp files> -oliblibrary.so $(LIBRARIES) -O3 -I/opt/intel/composer_xe_2015/mkl/include -std=c++0x -m64
#Compile a controller for R, so that it can be loaded as dyn.load()
PKG_LIBS='`Rscript -e "Rcpp:::LdFlags() $(LIBRARIES) $(LFLAGS)"`' \
PKG_CXXFLAGS='`Rscript -e "Rcpp:::CxxFlags()"` $(LIBRARIES) $(LFLAGS) ' \
R CMD SHLIB fastRPCA.cpp -o../bin/RProgram.so -L../bin -llibrary
Then I call it in R:
dyn.load("fastRPCA.so", local=FALSE);
Please note I would prefer not setting MKL as the BLAS/LAPACK alternative for R, so that when other people use this code they don't have to change it for all of R. As such, I am trying to just use it in the C code.
How can I make the program multithread in Rcpp just as it does when run outside of R?
Based on this discussion, I am concerned that this is not possible. However, I wanted to ask, because I believe that since Intel MKL uses the OpenMP, perhaps there was some way to make it work.
There are basically two rules for working with R code:
Create a package.
Follow rule 1.
You are making your life hard by ignoring these.
Moreover, there are a number of packages on CRAN happily using OpenMP -- study those. You need to know and learn about thread setting -- see eg the RhpcBLASctl package which does this.
Lastly, you can of course connect R directly with the MKL; see the gcbd package and its vignette.
Edit three years later: See this post for details on installing the MKL easily on a .deb system

C++ eigen3 linear algebra library, odd performance results

I've been using eigen3 linear algebra library in c++ for a while, and I've always tried to take advantage of the vectorization performance benefits. Today, I've decided to test how much vectorization really speeds my programs up. So, I've written the following test program:
--- eigentest.cpp ---
#include <eigen3/Eigen/Dense>
using namespace Eigen;
#include <iostream>
int main() {
Matrix4d accumulator=Matrix4d::Zero();
Matrix4d randMat = Matrix4d::Random();
Matrix4d constMat = Matrix4d::Constant(2);
for(int i=0; i<1000000; i++) {
randMat+=constMat;
accumulator+=randMat*randMat;
}
std::cout<<accumulator(0,0)<<"\n"; // To avoid optimizing everything away
return 0;
}
Then I've run this program after compiling it with different compiler options: (The results aren't one-time, many runs give similar results)
$ g++ eigentest.cpp -o eigentest -DNDEBUG -std=c++0x -march=native
$ time ./eigentest
5.33334e+18
real 0m4.409s
user 0m4.404s
sys 0m0.000s
$ g++ eigentest.cpp -o eigentest -DNDEBUG -std=c++0x
$ time ./eigentest
5.33334e+18
real 0m4.085s
user 0m4.040s
sys 0m0.000s
$ g++ eigentest.cpp -o eigentest -DNDEBUG -std=c++0x -march=native -O3
$ time ./eigentest
5.33334e+18
real 0m0.147s
user 0m0.136s
sys 0m0.000s
$ g++ eigentest.cpp -o eigentest -DNDEBUG -std=c++0x -O3
$time ./eigentest
5.33334e+18
real 0m0.025s
user 0m0.024s
sys 0m0.000s
And here's my relevant cpu information:
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 5600+
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow extd_apicid pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy 3dn
I know that there's no vectorization going on when I don't use the compiler option -march=native because when I don't use it, I never get a segmentation fault, or wrong result due to vectorization, as opposed to the case that I use it (with -NDEBUG).
These results lead me to believe that, at least on my CPU vectorization with eigen3 results in slower execution. Who should I blame? My CPU, eigen3 or gcc?
Edit: To remove any doubts, I've now tried to add the -DEIGEN_DONT_ALIGN compiler option in cases where I'm trying to measure the performance of the no-vectorization case, and the results are the same. Furthermore, when I add -DEIGEN_DONT_ALIGN along with -march=native the results become very close to the case without -march=native.
It seems that the compiler is smarter than you think and still optimizes a lot of stuff away.
On my platform, I get about 9ms without -march=native and about 39ms with -march=native. However, if I replace the line above the return by
std::cout<<accumulator<<"\n";
then the timings change to 78ms without -march=native and about 39ms with -march=native.
Thus, it seems that without vectorization, the compiler realizes that you only use the (0,0) element of the matrix and so it only computes that element. However, it can't do that optimization if vectorization is enabled.
If you output the whole matrix, thus forcing the compiler to compute all the entries, then vectorization speeds up the program with a factor 2, as expected (though I'm surprised to see that it is exactly a factor 2 in my timings).