C++ eigen3 linear algebra library, odd performance results - c++

I've been using eigen3 linear algebra library in c++ for a while, and I've always tried to take advantage of the vectorization performance benefits. Today, I've decided to test how much vectorization really speeds my programs up. So, I've written the following test program:
--- eigentest.cpp ---
#include <eigen3/Eigen/Dense>
using namespace Eigen;
#include <iostream>
int main() {
Matrix4d accumulator=Matrix4d::Zero();
Matrix4d randMat = Matrix4d::Random();
Matrix4d constMat = Matrix4d::Constant(2);
for(int i=0; i<1000000; i++) {
randMat+=constMat;
accumulator+=randMat*randMat;
}
std::cout<<accumulator(0,0)<<"\n"; // To avoid optimizing everything away
return 0;
}
Then I've run this program after compiling it with different compiler options: (The results aren't one-time, many runs give similar results)
$ g++ eigentest.cpp -o eigentest -DNDEBUG -std=c++0x -march=native
$ time ./eigentest
5.33334e+18
real 0m4.409s
user 0m4.404s
sys 0m0.000s
$ g++ eigentest.cpp -o eigentest -DNDEBUG -std=c++0x
$ time ./eigentest
5.33334e+18
real 0m4.085s
user 0m4.040s
sys 0m0.000s
$ g++ eigentest.cpp -o eigentest -DNDEBUG -std=c++0x -march=native -O3
$ time ./eigentest
5.33334e+18
real 0m0.147s
user 0m0.136s
sys 0m0.000s
$ g++ eigentest.cpp -o eigentest -DNDEBUG -std=c++0x -O3
$time ./eigentest
5.33334e+18
real 0m0.025s
user 0m0.024s
sys 0m0.000s
And here's my relevant cpu information:
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 5600+
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow extd_apicid pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy 3dn
I know that there's no vectorization going on when I don't use the compiler option -march=native because when I don't use it, I never get a segmentation fault, or wrong result due to vectorization, as opposed to the case that I use it (with -NDEBUG).
These results lead me to believe that, at least on my CPU vectorization with eigen3 results in slower execution. Who should I blame? My CPU, eigen3 or gcc?
Edit: To remove any doubts, I've now tried to add the -DEIGEN_DONT_ALIGN compiler option in cases where I'm trying to measure the performance of the no-vectorization case, and the results are the same. Furthermore, when I add -DEIGEN_DONT_ALIGN along with -march=native the results become very close to the case without -march=native.

It seems that the compiler is smarter than you think and still optimizes a lot of stuff away.
On my platform, I get about 9ms without -march=native and about 39ms with -march=native. However, if I replace the line above the return by
std::cout<<accumulator<<"\n";
then the timings change to 78ms without -march=native and about 39ms with -march=native.
Thus, it seems that without vectorization, the compiler realizes that you only use the (0,0) element of the matrix and so it only computes that element. However, it can't do that optimization if vectorization is enabled.
If you output the whole matrix, thus forcing the compiler to compute all the entries, then vectorization speeds up the program with a factor 2, as expected (though I'm surprised to see that it is exactly a factor 2 in my timings).

Related

Performance comparison issue between OpenMPI and Intel MPI

I am working with a C++ MPI code which when compiled with openMPI takes 1min12 seconds and 16 seconds with Intel MPI (I have tested it on other inputs too, difference is similar. Both compiled codes give correct answer). I want to understand why is there such a big difference in run time. And what can be done to decrease run time with openMPI (GCC).
I am using CentOS 6 OS with Intel Haswell processor.
I am using following flags for compiling.
openMPI (GCC): mpiCC -Wall -O3
I have also tried -march=native -funroll-loops. It does not make a great difference. I have also tried -lm option. I cannot compile for 32 bit.
Intel MPI: mpiicpc -Wall -O3 -xhost
-xhost saves 3 seconds in run time.

How to compile Crypto++ cross platform on osx

My desktop application has dependency on the Crypto++ library. First I tried to install Crypto++ from Brew and link with my application. First error has arrived when I tried to run application to an older mac (with older cpu, which I suppose does not have AESNI instructions). it crashed with:
Crashed Thread: 56
Exception Type: EXC_BAD_INSTRUCTION (SIGILL)
Exception Codes: 0x0000000000000001, 0x0000000000000000
Exception Note: EXC_CORPSE_NOTIFY
Termination Signal: Illegal instruction: 4
Termination Reason: Namespace SIGNAL, Code 0x4
Terminating Process: exc handler [0]
After that I compiled crytpo++ with an older mac. so far all was good. but recently I encountered same error with even older cpu.
Basically the question is: is there a way to compile Crypto++ so the deployed lib would be cross platform ?
... the question is: is there a way to compile crypto++ so the deployed lib would be cross platform ?
Yes, but only within the processor family.
The problem is likely the use of a newer instruction, but not AES. There are three reasons I suspect it.
First, the makefile adds -march=native when building. This gets you all the CPU features for the machine you are building on.
Second, the newer instruction could be from SSE4, AVX or BMI because you compile on a newer Mac; while your older Mac can only handle, say, SSE4 in the case of a Core2 Duo.
Third, AES is guarded at runtime, so those particular machine instructions are not executed if the CPU lacks AESNI. However, other instructions the compiler may emit, like AVX or BMI, are not guarded.
Here's my OS X test environment:
MacBook, early 2010
Intel Core2 Duo
OS X 10.9
SSE 4.1
MacBook Pro, late 2012
Intel Core i7
OS X 10.8
SSE 4.1, SSE 4.2, AESNI, RDRAND, AVX
Based on the list above, if I compile on the MacBook Pro (SSE 4.1, SSE 4.2, AESNI, RDRAND, AVX) for the MacBook (SSE 4.1), then I need to limit the target machine to SSE 4.1. Otherwise, Clang is sure to emit instructions the older MacBook cannot handle.
To limit the target machine in Crypto++:
git clone https://github.com/weidai11/cryptopp.git
cd cryptopp
export CXXFLAGS="-DNDEBUG -g2 -O2 -DDISABLE_NATIVE_ARCH=1 -msse2 -msse3 -mssse3 -msse4.1"
make -j 4
-DDISABLE_NATIVE_ARCH is a relatively new addition. I don't believe its in Crypto++ 5.6.5. You need Master for it, and it will be in the upcoming Crypto++ 6.0.
If you need to remove the makefile code that adds -march=native, then its not hard to find. Open GNUmakefile, and delete this block around line 200:
# BEGIN_NATIVE_ARCH
# Guard use of -march=native (or -m{32|64} on some platforms)
# Don't add anything if -march=XXX or -mtune=XXX is specified
ifeq ($(DISABLE_NATIVE_ARCH),0)
ifeq ($(findstring -march,$(CXXFLAGS)),)
ifeq ($(findstring -mtune,$(CXXFLAGS)),)
ifeq ($(GCC42_OR_LATER)$(IS_NETBSD),10)
CXXFLAGS += -march=native
else ifneq ($(CLANG_COMPILER)$(INTEL_COMPILER),00)
CXXFLAGS += -march=native
else
# GCC 3.3 and "unknown option -march="
# Ubuntu GCC 4.1 compiler crash with -march=native
# NetBSD GCC 4.8 compiler and "bad value (native) for -march= switch"
# Sun compiler is handled below
ifeq ($(SUN_COMPILER)$(IS_X64),01)
CXXFLAGS += -m64
else ifeq ($(SUN_COMPILER)$(IS_X86),01)
CXXFLAGS += -m32
endif # X86/X32/X64
endif
endif # -mtune
endif # -march
endif # DISABLE_NATIVE_ARCH
# END_NATIVE_ARCH
After that, you should be able to run your binary on both machines.
The GNUmakefile is kind of a monstrosity. There's a lot to it. We documented it at GNUmakefile on the Crypto++ wiki.
You can also limit the machine you are compiling for using -mtune. For example:
$ export CXXFLAGS="-DNDEBUG -g2 -O2 -mtune=core2"
$ make -j 3
g++ -DNDEBUG -g2 -O2 -mtune=core2 -fPIC -pipe -c cryptlib.cpp
g++ -DNDEBUG -g2 -O2 -mtune=core2 -fPIC -pipe -c cpu.cpp
g++ -DNDEBUG -g2 -O2 -mtune=core2 -fPIC -pipe -c integer.cpp
...
First I tried to install Crypto++ from Brew and link with my application...
I don't use Brew, so I don't know how to to set CXXFLAGS when using it. Hopefully one of the Homebrew folks will provide some information about it.
Maybe Build and install Brew apps that are x86_64 instead of i386? and Using Homebrew with alternate GCC will help.
It is also possible you are compiling on an x86_64 machine, and then trying to run it on an i386 machine. If that is the case, then it likely won't work.
You may be able to build a fat library with the following, and it may work on both machines. Notice the addition of -arch x86_64 -arch i386.
export CXXFLAGS="-DNDEBUG -g2 -O2 -DDISABLE_NATIVE_ARCH=1 -arch x86_64 -arch i386 -msse2 -msse3 -mssse3 -msse4.1"
make -j 4
You might also be interested in iOS (Command Line) on the Crypto++ wiki. It goes into some detail about fat binaries in the context of iOS. The same concepts apply to OS X.
If you encounter a compile error for -msse4.1 or -msse4.2, then you may need -msse4_1 or -msse4_2. Different compilers accept (or expect) slightly different syntax.
For comparison using Linux, below is the difference in CPU capabilities between a Core2 Duo and a 3rd gen Core i5. Notice the Core i5 has SSE4.2 and AVX, while the Core2 Duo does not. AVX makes a heck of a difference, and compilers aggressively use the instruction set.
On OS X, you want to run sysctl machdep.cpu.features. I showed the one for my old MacBook from early 2010.
Core i5:
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 58
model name : Intel(R) Core(TM) i5-3230M CPU # 2.60GHz
...
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc
rep_good nopl xtopology nonstop_tsc cpuid pni pclmulqdq ssse3 cx16 sse4_1
sse4_2 x2apic popcnt aes xsave avx rdrand hypervisor lahf_lm
Core2 Duo:
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Duo CPU T6500 # 2.10GHz
...
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm
constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64
monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm
Core Duo (MacBook):
$ sudo sysctl machdep.cpu.features
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE
MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 DTES64
MON DSCPL VMX SMX EST TM2 SSSE3 CX16 TPR PDCM SSE4.1

Illegal instruction - vcvtsi2sd

I am writing a program to compute Groebner bases using the library FGB. While it has a C interface, I am calling the library from C++ code compiled with g++ on Ubuntu.
Compiling with the option -g and using x/i $pc in gdb, the illegal instruction is as follows.
0x421c39 FGb_xmalloc_spec+985: vcvtsi2sd %rbx,%xmm0,%xmm0
The line above has angle brackets around FGB_xmalloc_spec+985. As far as I can tell, my processor does not support this instruction, and I am trying to figure out why the program uses it. It looks to me like the instruction comes from the library code. However, the code I am compiling used to work on the desktop it is now failing on - one day just started throwing the illegal instruction. I assumed I screwed up some libraries or something, so I reinstalled Ubuntu 16.04 but I continue to get the illegal instruction. The same exact code does work on another desktop and a chromebook, running Ubuntu 16.04 and 14.04 respectively.
Technical information:
g++: 5.4.0 20160609
gdb: 7.11.1
Ubuntu: 16.04/14.04 LTS
Process: x86info output
Found 4 identical CPUs
Extended Family: 0 Extended Model: 1 Family: 6 Model: 23 Stepping: 10
Type: 0 (Original OEM)
CPU Model (x86info's best guess): Core 2 Duo
Processor name string (BIOS programmed): Intel(R) Core(TM)2 Quad CPU Q9650 # 3.00GHz
cpu flags
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm tpr_shadow vnmi flexpriority dtherm
Compile line
g++ -std=c++11 -g -I src -o bin/main.o -c src/main.cpp
g++ -std=c++11 -g -I src -o bin/Polynomial.o -c src/Polynomial.cpp
g++ -std=c++11 -g -I src -o bin/Util.o -c src/Util.cpp
g++ -std=c++11 -g -I src -o bin/Solve.o -c src/Solve.cpp
g++ -std=c++11 -g -o bin/StartUp bin/main.o bin/Util.o bin/Polynomial.o bin/Solve.o -Llib -lfgb -lfgbexp -lgb -lgbexp -lminpoly -lminpolyvgf -lgmp -lm -fopenmp
At this point, I am not sure what further things I can try to avoid this illegal instruction and welcome any and all suggestions.

Eigen Matrix Multiplication Speed

I was trying to do linear algebra numerical computation in C++. I used Python Numpy for quick model and I would like to find a C++ linear algebra pack for some further speed up. Eigen seems to be quite a good point to start.
I wrote a small performance test using large dense matrix multiplication to test the processing speed. In Numpy I was doing this:
import numpy as np
import time
a = np.random.uniform(size = (5000, 5000))
b = np.random.uniform(size = (5000, 5000))
start = time.time()
c = np.dot(a, b)
print (time.time() - start) * 1000, 'ms'
In C++ Eigen I was doing this:
#include <time.h>
#include "Eigen/Dense"
using namespace std;
using namespace Eigen;
int main() {
MatrixXf a = MatrixXf::Random(5000, 5000);
MatrixXf b = MatrixXf::Random(5000, 5000);
time_t start = clock();
MatrixXf c = a * b;
cout << (double)(clock() - start) / CLOCKS_PER_SEC * 1000 << "ms" << endl;
return 0;
}
I have done some search in the documents and on stackoverflow on the compilation optimization flags. I tried to compile the program using this command:
g++ -g test.cpp -o test -Ofast -msse2
The C++ executable compiled with -Ofast optimization flags runs about 30x or more faster than a simple no optimization compilation. It will return the result in roughly 10000ms on my 2015 macbook pro.
Meanwhile Numpy will return the result in about 1800ms.
I am expecting a boost of performance in using Eigen compared with Numpy. However, this failed my expectation.
Is there any compile flags I missed that will further boost the Eigen performance in this? Or is there any multithread switch that can be turn on to give me extra performance gain? I am just curious about this.
Thank you very much!
Edit on April 17, 2016:
After doing some search according to #ggael 's answer, I have come up with the answer to this question.
Best solution to this is compile with link to Intel MKL as backend for Eigen. for osx system the library can be found at here. With MKL installed I tried to use the Intel MKL link line advisor to enable MKL backend support for Eigen.
I compile in this manner for all MKL enablement:
g++ -DEIGEN_USE_MKL_ALL -L${MKLROOT}/lib -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -liomp5 -lpthread -lm -ldl -m64 -I${MKLROOT}/include -I. -Ofast -DNDEBUG test.cpp -o test
If there is any environment variable error for MKLROOT just run the environment setup script provided in the MKL package which is installed default at /opt/intel/mkl/bin on my device.
With MKL as Eigen backend the matrix multiplication for two 5000x5000 operation will be finished in about 900ms on my 2.5Ghz Macbook Pro. This is much faster than Python Numpy on my device.
To answer on the OSX side, first of all recall that on OSX g++ is actually an alias to clang++, and the current Apple's version of clang does not support openmp. Nonetheless, using Eigen3.3-beta-1, and default clang++, I get on a macbookpro 2.6Ghz:
$ clang++ -mfma -I ../eigen so_gemm_perf.cpp -O3 -DNDEBUG && ./a.out
2954.91ms
Then to get support for multithreading, you need a recent clang of gcc compiler, for instance using homebrew or macport. Here using gcc 5 from macport, I get:
$ g++-mp-5 -mfma -I ../eigen so_gemm_perf.cpp -O3 -DNDEBUG -fopenmp -Wa,-q && ./a.out
804.939ms
and with clang 3.9:
$ clang++-mp-3.9 -mfma -I ../eigen so_gemm_perf.cpp -O3 -DNDEBUG -fopenmp && ./a.out
806.16ms
Remark that gcc on osx does not knowhow to properly assemble AVX/FMA instruction,so you need to tell it to use the native assembler with the -Wa,-q flag.
Finally, with the devel branch, you can also tell Eigen to use whatever BLAS as a backend, for instance the one from Apple's Accelerate as follows:
$ g++ -framework Accelerate -DEIGEN_USE_BLAS -O3 -DNDEBUG so_gemm_perf.cpp -I ../eigen && ./a.out
802.837ms
Compiling your little program with VC2013:
/fp:precise - 10.5s
/fp:strict - 10.4s
/fp:fast - 10.3s
/fp:fast /arch:AVX2 - 6.6s
/fp:fast /arch:AVX2 /openmp - 2.7s
So using AVX/AVX2 and enabling OpenMP is going to help a lot. You can also try linking against MKL (http://eigen.tuxfamily.org/dox/TopicUsingIntelMKL.html).

Is sse2 enabled by default in g++?

When I run g++ -Q --help=target, I get
-msse2 [disabled].
However, if I create the assembly code of with default options as
g++ -g mycode.cpp -o mycode.o; objdump -S mycode.o > default,
and a sse2 version with
g++ -g -msse2 mycode.cpp -o mycode.sse2.o; objdump -S mycode.sse2.o > sse2,
and finally a non-sse2 version with
g++ -g -mno-sse2 mycode.cpp -o mycode.nosse2.o; objdump -S mycode.nosse2.o > nosse2
I see basically no difference between default and sse2, but a big difference between default and nosse2, so this tells me that, by default, g++ is using sse2 instructions, even though I am being told it is disabled ... what is going on here?
I am compiling on a Xeon E5-2680 under Linux with gcc-4.4.7 if it matters.
If you are compiling for 64bit, then this is totally fine and documented behavior.
As stated in the gcc docs the SSE instruction set is enabled by default when using an x86-64 compiler:
-mfpmath=unit
Generate floating point arithmetics for selected unit unit. The choices for unit are:
`387'
Use the standard 387 floating point coprocessor present majority of chips and emulated otherwise. Code compiled with this option will run almost everywhere. The temporary results are computed in 80bit precision instead of precision specified by the type resulting in slightly different results compared to most of other chips. See -ffloat-store for more detailed description.
This is the default choice for i386 compiler.
`sse'
Use scalar floating point instructions present in the SSE instruction set. This instruction set is supported by Pentium3 and newer chips, in the AMD line by Athlon-4, Athlon-xp and Athlon-mp chips. The earlier version of SSE instruction set supports only single precision arithmetics, thus the double and extended precision arithmetics is still done using 387. Later version, present only in Pentium4 and the future AMD x86-64 chips supports double precision arithmetics too.
For the i386 compiler, you need to use -march=cpu-type, -msse or -msse2 switches to enable SSE extensions and make this option effective. For the x86-64 compiler, these extensions are enabled by default.
The resulting code should be considerably faster in the majority of cases and avoid the numerical instability problems of 387 code, but may break some existing code that expects temporaries to be 80bit.
This is the default choice for the x86-64 compiler.