Eigen with EIGEN_USE_MKL_ALL - c++

I compiled my C++ project(using Eigen 3.2.8) with the EIGEN_USE_BLAS option and link against MKL-BLAS, every thing works fine and that indeed speeds up my program substantially(perhaps due to a lot of complex-valued matrix-vector multiplication)
Then I also tried the EIGEN_USE_MKL_ALL, however, some similar errors prompt up:
/eigen3/Eigen/src/QR/ColPivHouseholderQR_MKL.h:94:1 error:
Cannot convert "Eigen::PlainObjectBase<Eigen::matrix<int,-1,1>>::Scalar*
{aka int*}" to "long long int*" in initialization
EIGEN_MKL_OR_COLPIV(...)
Two questions here:
EIGEN_USE_BLAS enables a 4x speed up though I didn't expect that much, possible reason?
EIGEN_USE_MKL_ALL seems to have some type conflict with LAPACK stuff, how to fix the compiling error?

MKL utilizes new AVX/AVX2 instruction set (8 32-bit float operations per clock with FMA and 3-operand instructions), while Eigen 3.2.8 only supports up to SSE4 (4 32-bit float operations per clock). As indicated by ggael, you could update to 3.3beta1 to achieve better performance.
You could try Eigen 3.3-beta1. Currently I cannot reproduce your problem. You may want to provide your code sample and compile option. But based on your error message, I guess you are using ILP64 interface, which is not supported by Eigen. You could use LP64 instead.

To complete kangshiyin answer, Eigen 3.3 supports AVX/FMA and can thus achieve similar performance. You need to compile with AVX and FMA instructions enabled. For instance with GCC, clang, or ICC: -mavx -mfma.

Related

Eigen segfaults with AVX on Ivy Bridge when using `std::vector` of fixed-size matrices

I'm wondering if this is a known issue; if not, has anyone experienced this, and has anyone managed to find a fix?
I'm building a numerical computation code using Eigen 3.3.4 using GCC 6.4 on Fedora 25 on a Core i7-3700. My proc/cpuinfo says I should have AVX. I've tried two builds. Build 1:
g++ -std=c++14 -O3 -m64 -mavx
and build 2:
g++ -std=c++14 -O3 -m64 -msse4.2
Build 2 runs fine. But when I try build 1, I get segfaults in the Zero function for a square fixed-size matrix as well as in the inverse() method. I'll appreciate any pointers as to what might be going on.
EDIT: I forgot one very important detail: I was actually using a std::vector of fixed-size Eigen matrices.
The fact that I was using a std::vector of fixed-size matrices was the key. Thanks very much for the request for a minimal example, #rex. While preparing the example, I found out the following.
For certain large input sizes (of the std::vector containing the matrices), Eigen throws a runtime error, which led me to this site. Following the instructions there fixed the issue.
Essentially, std::vector with its standard allocator seems to mess with Eigen's alignment requirements for vectorization of fixed-size array operations. Using Eigen's provided aligned_allocator fixes the issue.

How to speed up this C++ program with eigen library against matlab?

I want to use C++ for big linear algebra computation. As a starting step, these comparison programs I created in C++ and matlab. I am also giving astonishing execution time here. Can you suggest way to beat matlab or atleast get comparable performance? I know that C++ uses highly vectorized methods for computations. So in large scientific programming involving linear algebra, should one always go for matlab instead of C++? I personally think that matlab doesn't give good performance for large computations therefore C++ is preferred to matlab in such cases. However my program results go contrary to this belief.
C++ program compiled with gcc:
#include <iostream>
#include <Eigen\Dense> //EIGEN library
using namespace Eigen;
using namespace std;
int main()
{
MatrixXd A;
A.setRandom(1000, 1000);
MatrixXd B;
B.setRandom(1000, 1000);
MatrixXd C;
C=A*B;
}
Execution time: 24.141 s
Here is matlab program:
function [ ] = Trial( )
clear all;
close all;
clc;
tic;
A=rand([1000,1000]);
B=rand([1000,1000]);
C=A*B;
toc
end
Elapsed time is 0.073883 seconds.
It is extremely hard to beat MATLAB, even with all optimizations turned on. To get the most out of Eigen you need to compile with parallel support (-fopenmp in gcc), and turn optimizations on (-O3). Even in this case, MATLAB will be slightly faster, mainly because it is using the Intel MKL proprietary library to get the most out of Intel chips, so unless you buy it I don't think you will be able to beat it. I am currently using Eigen for a project and wasn't able to beat MATLAB (at least not for dense matrix multiplication).
For example, for A*B where A and B are 1000 x 1000 complex matrices, the best average time I can get is:
MATLAB: 0.32 seconds
Eigen: 0.44 seconds
For 2000 x 2000,
MATLAB: 2.80 seconds
Eigen: 3.45 seconds
System: MacbookPro 2013, OS X.
PS: you should make absolutely sure that you turn optimizations on (-O3) and also compile with parallel support, -fopenmp. This is the reason you're probably getting this huge difference in running time. So you should compile your program as:
g++ -O3 -fopenmp <other compiling flags/parameters> main.cpp
To get the best of Eigen, compile with optimizations ON (e.g. -O3 compiler flag), with OpenMP enabled (e.g., -fopenmp), and disable hyper-threading or specify to openmp the true number of physical cores (e.g., export OMP_NUM_THREADS=4 if you have 8 hyper-threaded "cores", but 4 physical cores).
Finally, you might also consider using the devel branch and enable AVX (e.g., -mavx) and FMA if your CPU does support FMA (e.g., -mfma).
Actually Matlab (if you don't buy the expensive parallel computing toolbox) hardly use multi-threading. It's only used in the libraries called by Matlab which are probably more efficient that what you're using now.
You can check this link to understand (and check) what libraries your Matlab uses http://undocumentedmatlab.com/blog/math-libraries-version-info-upgrade
It's also possible to use them in your C program (though they might have hidden the headers or something, at least you still have the .dll since they need that to run Matlab)

AVX log intrinsics (_mm256_log_ps) missing in g++-4.8?

I am trying to utilise some AVX intrinsics in my code and have run into a brick wall with the logarithm intrinsics.
Using the Intel Intrinsics Guide v3.0.1 for Linux, I see the intrinsic _mm256_log_ps(__m256) listed as being part of "immintrin.h" and also supported on my current arch.
However trying to compile this simple test case fails with "error: ‘_mm256_log_ps’ was not declared in this scope"
The example was compiled with g++-4.8 -march=native -mavx test.cpp
#include <immintrin.h>
int main()
{
__m256 i;
_mm256_log_ps(i);
}
Am I missing something fundamental here? Are certain intrinsics not supported by g++ and only available in icc?
SOLVED: This instruction is not a true intrinsic but instead implemented as part of the Intel SVML for ICC.
As indicated in the comments to your question, that intrinsic doesn't map to an actual AVX instruction; it is an Intel extension to the intrinsic set. The implementation likely uses many underlying instructions, as a logarithm isn't a trivial operation.
If you'd like to use a non-Intel compiler but want a fast logarithm implementation, you might check out this open-source implementation of sin(), cos(), exp(), and log() functions using AVX. They are based on an earlier SSE2 version of the same functions.
I've posted my implementation of _mm256_log_pd(__m256d) here: https://stackoverflow.com/a/45898937/1915854 . With some effort you should be able to extend it to 8 packed floats instead of 4 doubles, though you need to revise the bit manipulations. And some parts are easies because you don't need to repack odd-/even-numbered 32-bit components of __m256i into __m128i.

How can I use SSE instruction?

I have some problem with SSE on ubuntu linux system.
example source code on msdn(sse4)
use sse4.1 operation on linux
gcc -o test test.c -msse4.1
then error message:
error: request for member 'm128i_u16' in something not a structure or union
How can I use this example code?
Or any example code can use?
The title of the code sample is "Microsoft Specific". This means that those functions are specific to the microsoft implementation of c++, and aren't cross-platform. Here are some Intel-specific guides to SSE instructions. Here is gcc documentation concerning command-line flags for specific assembly optimizations, including SSE. Good luck, SSE can get a bit hairy.
This is not so much about Microsoft-specific intrinsic functions, it is about the datatype. The actual intrinsics are 100% identical in both compilers, and are de facto standard (stemming from Intel).
The problem you are facing is that the __m128i type is -- as a convenience feature -- a union under MSVC, which includes fields such as m128i_u16. The code sample you link to assumes this.
Under gcc, __m128i is not a union and therefore, unsurprisingly, does not have these fields. This is not really a downside, because accessing fields in an union like this anihilates any gains you might have from using SSE in the first place, so other than in demo snippets like the above, you will (almost) never want to use such a thing.

Taking advantage of SSE and other CPU extensions

Theres are couple of places in my code base where the same operation is repeated a very large number of times for a large data set. In some cases it's taking a considerable time to process these.
I believe that using SSE to implement these loops should improve their performance significantly, especially where many operations are carried out on the same set of data, so once the data is read into the cache initially, there shouldn't be any cache misses to stall it. However I'm not sure about going about this.
Is there a compiler and OS independent way writing the code to take advantage of SSE instructions? I like the VC++ intrinsics, which include SSE operations, but I haven't found any cross compiler solutions.
I still need to support some CPU's that either have no or limited SSE support (eg Intel Celeron). Is there some way to avoid having to make different versions of the program, like having some kind of "run time linker" that links in either the basic or SSE optimised code based on the CPU running it when the process is started?
What about other CPU extensions, looking at the instruction sets of various Intel and AMD CPU's shows there are a few of them?
For your second point there are several solutions as long as you can separate out the differences into different functions:
plain old C function pointers
dynamic linking (which generally relies on C function pointers)
if you're using C++, having different classes that represent the support for different architectures and using virtual functions can help immensely with this.
Note that because you'd be relying on indirect function calls, the functions that abstract the different operations generally need to represent somewhat higher level functionality or you may lose whatever gains you get from the optimized instruction in the call overhead (in other words don't abstract the individual SSE operations - abstract the work you're doing).
Here's an example using function pointers:
typedef int (*scale_func_ptr)( int scalar, int* pData, int count);
int non_sse_scale( int scalar, int* pData, int count)
{
// do whatever work needs done, without SSE so it'll work on older CPUs
return 0;
}
int sse_scale( int scalar, in pData, int count)
{
// equivalent code, but uses SSE
return 0;
}
// at initialization
scale_func_ptr scale_func = non_sse_scale;
if (useSSE) {
scale_func = sse_scale;
}
// now, when you want to do the work:
scale_func( 12, theData_ptr, 512); // this will call the routine that tailored to SSE
// if the CPU supports it, otherwise calls the non-SSE
// version of the function
Good reading on the subject: Stop the instruction set war
Short overview: Sorry, it is not possible to solve your problem in simple and most compatible (Intel vs. AMD) way.
The SSE intrinsics work with visual c++, GCC and the intel compiler. There is no problem to use them these days.
Note that you should always keep a version of your code that does not use SSE and constantly check it against your SSE implementation.
This helps not only for debugging, it is also usefull if you want to support CPUs or architectures that don't support your required SSE versions.
In answer to your comment:
So effectively, as long as I don't try to actually execute code containing unsupported instructions I'm fine, and I could get away with an "if(see2Supported){...}else{...}" type switch?
Depends. It's fine for SSE instructions to exist in the binary as long as they're not executed. The CPU has no problem with that.
However, if you enable SSE support in the compiler, it will most likely swap a number of "normal" instructions for their SSE equivalents (scalar floating-point ops, for example), so even chunks of your regular non-SSE code will blow up on a CPU that doesn't support it.
So what you'll have to do is most likely compile on or two files separately, with SSE enabled, and let them contain all your SSE routines. Then link that with the rest of the app, which is compiled without SSE support.
Rather than hand-coding an alternative SSE implementation to your scalar code, I strongly suggest you have a look at OpenCL. It is a vendor-neutral portable, cross-platform system for computationally intensive applications (and is highly buzzword-compliant!). You can write your algorithm in a subset of C99 designed for vectorised operations, which is much easier than hand-coding SSE. And best of all, OpenCL will generate the best implementation at runtime, to execute either on the GPU or on the CPU. So basically you get the SSE code written for you.
Theres are couple of places in my code base where the same operation is repeated a very large number of times for a large data set. In some cases it's taking a considerable time to process these.
Your application sounds like just the kind of problem that OpenCL is designed to address. Writing alternative functions in SSE would certainly improve the execution speed, but it is a great deal of work to write and debug.
Is there a compiler and OS independent way writing the code to take advantage of SSE instructions? I like the VC++ intrinsics, which include SSE operations, but I haven't found any cross compiler solutions.
Yes. The SSE intrinsics have been essentially standardised by Intel, so the same functions work the same between Windows, Linux and Mac (specifically with Visual C++ and GNU g++).
I still need to support some CPU's that either have no or limited SSE support (eg Intel Celeron). Is there some way to avoid having to make different versions of the program, like having some kind of "run time linker" that links in either the basic or SSE optimised code based on the CPU running it when the process is started?
You could do that (eg. using dlopen()) but it is a very complex solution. Much simpler would be (in C) to define a function interface and call the appropriate version of the optimised function via function pointer, or in C++ to use different implementation classes, depending on the CPU detected.
With OpenCL it is not necessary to do this, as the code is generated at runtime for the given architecture.
What about other CPU extensions, looking at the instruction sets of various Intel and AMD CPU's shows there are a few of them?
Within the SSE instruction set, there are many flavours. It can be quite difficult to code the same algorithm in different subsets of SSE when certain instructions are not present. I suggest (at least to begin with) that you choose a minimum supported level, such as SSE2, and fall back to the scalar implementation on older machines.
This is also an ideal situation for unit/regression testing, which is very important to ensure your different implementations produce the same results. Have a test suite of input data and known good output data, and run the same data through both versions of the processing function. You may need to have a precision test for passing (ie. the difference epsilon between the result and the correct answer is below 1e6, for example). This will greatly aid in debugging, and if you build in high-resolution timing to your testing framework, you can compare the performance improvements at the same time.