Efficient SSE NxN matrix multiplication - c++

I'm trying to implement SSE version of large matrix by matrix multiplication.
I'm looking for an efficient algorithm based on SIMD implementations.
My desired method looks like:
A(n x m) * B(m x k) = C(n x k)
And all matrices are considered to be 16-byte aligned float array.
I searched the net and found some articles describing 8x8 multiplication and even smaller. I really need it as efficient as possible and I don't want to use Eigen library or similar libraries. (Only SSE3 to be more specific).
So I'd appreciate if anyone can help me find some articles or resources on how to start implementing this.

The main challenge in implementation of arbitrary-size matrix-matrix multiplication is not the use of SIMD, but reuse of cached data. The paper Anatomy of High-Performance Matrix Multiplication by Goto and Van de Geijn is a must-read if you want to implement cache-friendly matrix-matrix multiplication, and it also discusses the choice of kernels to be SIMD-friendly. After reading this paper expect to achieve 50% of machine peak on matrix-matrix multiplication after two weeks of efforts.
However, if the purpose of this work is not pure learning, I strongly recommend to use a highly optimized library. On x86 your best options are OpenBLAS (BSD-licensed, supports dynamic CPU dispatching), BLIS (BSD-licensed, easily portable to new processors), and Intel MKL (commercial, supports dynamic CPU dispatching on Intel processors). For performance reasons it is better to avoid ATLAS unless you target a very exotic architecture which is not supported by other libraries.

Related

Intell TBB Performance

I took a TBB matrix multiplication from here
This example uses the concept of blocked_range for parallel_for loops. I also ran a couple of programs using Intel MKL and eigen libraries. When I compare the times taken by these implementations, MKL is the fastest, while TBB is the slowest (10 times slower than eigen on an average) for a variety of matrix sizes (2-4096). Is it normal or am I doing something wrong ? Shouldn't TBB performing better than eigen at least ?
That looks like a really basic matrix multiplication algorithm, meant as little more than an example on how to use TBB. There are far better ones and I'm fairly certain the intel MKL will be using SSE / AVX / FMA instructions too.
To put it another way, there wouldn't be any point to the Intel MKL if you could replicate its performance with 20 lines of code. So yes, what you get seems normal.
At the very least, with large matrices, the algorithm needs to take cache and other details of the memory subsystem into account.
Is it normal
Yes, it is normal for one program to be slower than another by a factor of 10.
Shouldn't TBB performing better than eigen at least ?
I don't see any reason why a naïve implementation of matrix multiplication using TBB would perform better, or even close to the performance of a dedicated, optimized library designed for fast linear algebra.

using SSE in C: how to load distributed values in register [duplicate]

I'm trying to implement SSE version of large matrix by matrix multiplication.
I'm looking for an efficient algorithm based on SIMD implementations.
My desired method looks like:
A(n x m) * B(m x k) = C(n x k)
And all matrices are considered to be 16-byte aligned float array.
I searched the net and found some articles describing 8x8 multiplication and even smaller. I really need it as efficient as possible and I don't want to use Eigen library or similar libraries. (Only SSE3 to be more specific).
So I'd appreciate if anyone can help me find some articles or resources on how to start implementing this.
The main challenge in implementation of arbitrary-size matrix-matrix multiplication is not the use of SIMD, but reuse of cached data. The paper Anatomy of High-Performance Matrix Multiplication by Goto and Van de Geijn is a must-read if you want to implement cache-friendly matrix-matrix multiplication, and it also discusses the choice of kernels to be SIMD-friendly. After reading this paper expect to achieve 50% of machine peak on matrix-matrix multiplication after two weeks of efforts.
However, if the purpose of this work is not pure learning, I strongly recommend to use a highly optimized library. On x86 your best options are OpenBLAS (BSD-licensed, supports dynamic CPU dispatching), BLIS (BSD-licensed, easily portable to new processors), and Intel MKL (commercial, supports dynamic CPU dispatching on Intel processors). For performance reasons it is better to avoid ATLAS unless you target a very exotic architecture which is not supported by other libraries.

Large Matrix Inversion

I am looking at taking the inverse of a large matrix, common size of 1000 x 1000, but sometimes exceeds 100000 x 100000 (which is currently failing due to time and memory). I know that the normal sentiment is 'don't take the inverse, find some other way to do it', but that is not possible at the moment. The reason for this is due to the usage of software that is already made that expects to get the matrix inverse. (Note: I am looking into ways of changing this, but that will take a long time)
At the moment we are using an LU decomposition method from numerical recopies, and I am currently in the process of testing the eigen library. The eigen library seems to be more stable and a bit faster, but I am still in testing phase for accuracy. I have taken a quick look at other libraries such as ATLAS and LAPACK but have not done any substantial testing with these yet.
It seems as though the eigen library does not use concurrent methods to compute the inverse (though does for LU factorization part of the inverse) and as far as I can tell ATLAS and LAPACK are similar in this limitation. (I am currently testing the speed difference for eigen with openMP and without.)
First question is can anyone explain how it would be possible to optimize matrix inversion by parallelization. I found an article here that talks about matrix inversion parallel algorithms, but I did not understand. It seems this article talks about another method? I am also not sure if scaLAPACK or PETSc are useful?
Second question, I read this article of using the GPUs to increase performance, but I have never coded for GPUs and so have no idea what is trying to convey, but the charts at the bottom looked rather alarming. How is this even possible, and how where do I start to go about implementing something like this if it is to be true.
I also found this article, have yet had the time to read through it to understand, but it seems promising, as memory is a current issue with our software.
Any information about these articles or the problems in general would be of great help. And again I apologize if this question seems vague, I will try to expand more if necessary.
First question is can anyone explain how it would be possible to optimize matrix inversion by parallelization.
I'd hazard a guess that this, and related topics in linear algebra, is one of the most studied topics in parallel computing. If you're stuck looking for somewhere to start reading, well good old Golub and Van Loan have a chapter on the topic. As to whether Scalapack and Petsc are likely to be useful, certainly the former, probably the latter. Of course, they both depend on MPI but that's kind of taken for granted in this field.
Second question ...
Use GPUs if you've got them and you can afford to translate your code into the programming model supported by your GPUs. If you've never coded for GPUs and have access to a cluster of commodity-type CPUs you'll get up to speed quicker by using the cluster than by wrestling with a novel technology.
As for the last article you refer to, it's now 10 years old in a field that changes very quickly (try finding a 10-year old research paper on using GPUs for matrix inversion). I can't comment on its excellence or other attributes, but the problem sizes you mention seem to me to be well within the capabilities of modern clusters for in-core (to use an old term) computation. If your matrices are very big, are they also sparse ?
Finally, I strongly support your apparent intention to use existing off-the-shelf codes rather than to try to develop your own.
100000 x 100000 is 80GB at double precision. You need a library that supports memory-mapped matrices on disk. I can't recommend a particular library and I didn't find anything with quick Google searches. But code from Numerical Recipes certainly isn't going to be adequate.
Regarding the first question (how to parallellize computing the inverse):
I assume you are computing the inverse by doing an LU decomposition of your matrix and then using the decomposition to solve A*B = I where A is your original matrix, B is the matrix you solve for, and I is the identity matrix. Then B is the inverse.
The last step is easy to parallellize. Divide your identity matrix along the columns. If you have p CPUs and your matrix is n-by-n, then every part has n/p columns and n rows. Lets call the parts I1, I2, etc. On every CPU, solve a system of the form A*B1 = I1, this gives you the parts B1, B2, etc., and you can combine them to form B which is the inverse.
An LU decomp on a GPU can be ~10x faster than on a CPU. Although this is now changing, GPU's have traditionally been designed around single precision arithmetic, and so on older hardware single precision arithmetic is generally much faster than double precision arithmetic. Also, storage requirements and performance will be greatly impacted by the structure of your matrices. A sparse 100,000 x 100,000 matrix LU decomp is a reasonable problem to solve and will not require much memory.
Unless you want to become a specialist and spend a lot of time tuning for hardware updates, I would strongly recommend using a commercial library. I would suggest CULA tools. They have both sparse and dense GPU libraries and in fact their free library offers SGETRF - a single precision (dense) LU decomp routine. You'll have to pay for their double precision libraries.
I know it's old post - but really - OpenCL (you download the relevant one based on your graphics card) + OpenMP + Vectorization (not in that order) is the way to go.
Anyhow - for me my experience with matrix anything is really to do with overheads from copying double double arrays in and out the system and also to pad up or initialize matrices with 0s before any commencement of computation - especially when I am working with creating .xll for Excel usage.
If I were to reprioritize the top -
try to vectorize the code (Visual Studio 2012 and Intel C++ has autovectorization - I'm not sure about MinGW or GCC, but I think there are flags for the compiler to analyse your for loops to generate the right assembly codes to use instead of the normal registers to hold your data, to populate your processor's vector registers. I think Excel is doing that because when I monitored Excel's threads while running their MINVERSE(), I notice only 1 thread is used.
I don't know much assembly language - so I don't know how to vectorize manually... (haven't had time to go learn this yet but sooooo wanna do it!)
Parallelize with OpenMP (omp pragma) or MPI or pthreads library (parallel_for) - very simple - but... here's the catch - I realise that if your matrix class is completely single threaded in the first place - then parallelizing the operation like mat multiply or inverse is scrappable - cuz parallelizing will deteriorate the speed due to initializing or copying to or just accessing the non-parallelized matrix class.
But... where parallelization helps is - if you're designing your own matrix class and you parallelize its constructor operation (padding with 0s etc), then your computation of LU(A^-1) = I will also be faster.
It's also mathematically straightforward to also optimize your LU decomposition, and also optimizing ur forward backward substitution for the special case of identity. (I.e. don't waste time creating any identity matrix - analyse where your for (row = col) and evaluate to be a function with 1 and the rest with 0.)
Once it's been parallelized (on the outer layers) - the matrix operations requiring element by element can be mapped to be computed by GPU(SSSSSS) - hundreds of processors to compute elements - beat that!. There is now sample Monte Carlo code available on ATI's website - using ATI's OpenCL - don't worry about porting code to something that uses GeForce - all u gotta do is recompile there.
For 2 and 3 though - remember that overheads are incurred so no point unless you're handling F*K*G HUGE matrices - but I see 100k^2? wow...
Gene

How can the C++ Eigen library perform better than specialized vendor libraries?

I was looking over the performance benchmarks: http://eigen.tuxfamily.org/index.php?title=Benchmark
I could not help but notice that eigen appears to consistently outperform all the specialized vendor libraries. The questions is: how is it possible? One would assume that mkl/goto would use processor specific tuned code, while eigen is rather generic.
Notice this http://download.tuxfamily.org/eigen/btl-results-110323/aat.pdf, essentially a dgemm. For N=1000 Eigen gets roughly 17Gf, MKL only 12Gf
Eigen has lazy evaluation. From How does Eigen compare to BLAS/LAPACK?:
For operations involving complex expressions, Eigen is inherently
faster than any BLAS implementation because it can handle and optimize
a whole operation globally -- while BLAS forces the programmer to
split complex operations into small steps that match the BLAS
fixed-function API, which incurs inefficiency due to introduction of
temporaries. See for instance the benchmark result of a Y = aX + bY
operation which involves two calls to BLAS level1 routines while Eigen
automatically generates a single vectorized loop.
The second chart in the benchmarks is Y = a*X + b*Y, which Eigen was specially designed to handle. It should be no wonder that a library wins at a benchmark it was created for. You'll notice that the more generic benchmarks, like matrix-matrix multiplication, don't show any advantage for Eigen.
Benchmarks are designed to be misinterpreted.
Let's look at the matrix * matrix product. The benchmark available on this page from the Eigen website tells you than Eigen (with its own BLAS) gives timings similar to the MKL for large matrices (n = 1000). I've compared Eigen 3.2.6 with MKL 11.3 on my computer (a laptop with a core i7) and the MKL is 3 times faster than Eigen for such matrices using one thread, and 10 times faster than Eigen using 4 threads. This looks like a completely different conclusion. There are two reasons for this. Eigen 3.2.6 (its internal BLAS) does not use AVX. Moreover, it does not seem to make a good usage of multithreading. This benchmark hides this as they use a CPU that does not have AVX support without multithreading.
Usually, those C++ libraries (Eigen, Armadillo, Blaze) bring two things:
Nice operator overloading: You can use +, * with vectors and matrices. In order to get nice performance, they have to use tricky techniques known as "Smart Template expression" in order to avoid temporary when they reduce the timing (such as y = alpha x1 + beta x2 with y, x1, x2 vectors) and introduce them when they are useful (such as A = B * C with A, B, C matrices). They can also reorder operations for less computations, for instance, if A, B, C are matrices A * B * C can be computed as (A * B) * C or A * (B * C) depending upon their sizes.
Internal BLAS: To compute the product of 2 matrices, they can either rely on their internal BLAS or one externally provided (MKL, OpenBLAS, ATLAS). On Intel chips with large matrices, the MKL il almost impossible to beat. For small matrices, one can beat the MKL as it was not designed for that kind of problems.
Usually, when those libraries provide benchmarks against the MKL, they usually use old hardware, and do not turn on multithreading so they can be on par with the MKL. They might also compare BLAS level 1 operations such as y = alpha x1 + beta x2 with 2 calls to a BLAS level 1 function which is a stupid thing to do anyway.
In a nutshell, those libraries are extremely convenient for their overloading of + and * which is extremely difficult to do without losing performance. They usually do a good job on this. But when they give you benchmark saying that they can be on par or beat the MKL with their own BLAS, be careful and do your own benchmark. You'll usually get different results ;-).
About the comparison ATLAS vs. Eigen
Have a look at this thread on the Eigen mailing list starting here:
http://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2012/07/msg00052.html
It shows for instance that ATLAS outperforms Eigen on the matrix-matrix product by 46%:
http://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2012/07/msg00062.html
More benchmarks results and details on how the benchmarks were done can be found here:
Intel(R) Core(TM) i5-3470 CPU # 3.20GHz:
http://www.mathematik.uni-ulm.de/~lehn/bench_FLENS/index.html
http://sourceforge.net/tracker/index.php?func=detail&aid=3540928&group_id=23725&atid=379483
Edit:
For my lecture "Software Basics for High Performance Computing" I created a little framework called ulmBLAS. It contains the ATLAS benchmark suite and students could implement their own matrix-matrix product based on the BLIS papers. You can have a look at the final benchmarks which also measure Eigen:
http://www.mathematik.uni-ulm.de/~lehn/sghpc/gemm/page14/index.html#toc5
You can use the ulmBLAS framework to make your own benchmarks.
Also have a look at
Matrix-Matrix Product Experiments with uBLAS
Matrix-Matrix Product Experiments with BLAZE
Generic code can be fast because Compile Time Function Evaluation (CTFE) allows to choose optimal register blocking strategy (small temporary sub-matrixes stored in CPU registers).
Mir GLAS and Intel MKL are faster than Eigen and OpenBLAS.
Mir GLAS is more generic compared to Eigen. See also the benchmark and reddit thread.
It doesn't seem to consistently outperform other libraries, as can be seen on the graphs further down on that page you linked. So the different libraries are optimized for different use cases, and different libraries are faster for different problems.
This is not surprising, since you usually cannot optimize perfectly for all use cases. Optimizing for one specific operation usually limits the optimization options for other use cases.
I sent the same question to the ATLAS mailing list some time ago:
http://sourceforge.net/mailarchive/message.php?msg_id=28711667
Clint (the ATLAS developer) does not trust these benchmarks. He suggested some trustworthy benchmark procedure. As soon as I have some free time I will do this kind of benchmarking.
If the BLAS functionality of Eigen is actually faster then that of GotoBLAS/GotoBLAS, ATLAS, MKL then they should provide a standard BLAS interface anyway. This would allow linking of LAPACK against such an Eigen-BLAS. In this case, it would also be an interesting option for Matlab and friends.

LAPACK/BLAS versus simple "for" loops

I want to migrate a piece of code that involves a number of vector and matrix calculations to C or C++, the objective being to speed up the code as much as possible.
Are linear algebra calculations with for loops in C code as fast as calculations using LAPACK/BLAS, or there is some unique speedup from using those libraries?
In other words, could simple C code (using for loops and the like) perform linear algebra calculations as fast as code that utilizes LAPACK/BLAS?
Vendor-provided LAPACK / BLAS libraries (Intel's IPP/MKL have been mentioned, but there's also AMD's ACML, and other CPU vendors like IBM/Power or Oracle/SPARC provide equivalents as well) are often highly optimized for specific CPU abilities that'll significantly boost performance on large datasets.
Often, though, you've got very specific small data to operate on (say, 4x4 matrices or 4D dot products, i.e. operations used in 3D geometry processing) and for those sort of things, BLAS/LAPACK are overkill, because of initial tests done by these subroutines which codepaths to choose, depending on properties of the data set. In those situations, simple C/C++ sourcecode, maybe using SSE2...4 intrinsics and/or compiler-generated vectorization, may beat BLAS/LAPACK.
That's why, for example, Intel has two libraries - MKL for large linear algebra datasets, and IPP for small (graphics vectors) data sets.
In that sense,
what exactly is your data set ?
What matrix/vector sizes ?
What linear algebra operations ?
Also, regarding "simple for loops": Give the compiler the chance to vectorize for you. I.e. something like:
for (i = 0; i < DIM_OF_MY_VECTOR; i += 4) {
vecmul[i] = src1[i] * src2[i];
vecmul[i+1] = src1[i+1] * src2[i+1];
vecmul[i+2] = src1[i+2] * src2[i+2];
vecmul[i+3] = src1[i+3] * src2[i+3];
}
for (i = 0; i < DIM_OF_MY_VECTOR; i += 4)
dotprod += vecmul[i] + vecmul[i+1] + vecmul[i+2] + vecmul[i+3];
might be a better feed to a vectorizing compiler than the plain
for (i = 0; i < DIM_OF_MY_VECTOR; i++) dotprod += src1[i]*src2[i];
expression. In some ways, what you mean by calculations with for loops will have a significant impact.
If your vector dimensions are large enough though, the BLAS version,
dotprod = CBLAS.ddot(DIM_OF_MY_VECTOR, src1, 1, src2, 1);
will be cleaner code and likely faster.
On the reference side, these might be of interest:
Intel Math Kernel Libraries Documentation (LAPACK / BLAS and others optimized for Intel CPUs)
Intel Performance Primitives Documentation (optimized for small vectors / geometry processing)
AMD Core Math Libraries (LAPACK / BLAS and others for AMD CPUs)
Eigen Libraries (a "nicer" linear algebra interface)
Probably not. People quite a bit of work into ensuring that lapack/BLAS routines are optimized and numerically stable. While the code is often somewhat on the complex side, it's usually that way for a reason.
Depending on your intended target(s), you might want to look at the Intel Math Kernel Library. At least if you're targeting Intel processors, it's probably the fastest you're going to find.
Numerical analysis is hard. At the very least, you need to be intimately aware of the limitations of floating point arithmetic, and know how to sequence operations so that you balance speed with numerical stability. This is non-trivial.
You need to actually have some clue about the balance between speed and stability you actually need. In more general software development, premature optimization is the root of all evil. In numerical analysis, it is the name of the game. If you don't get the balance right the first time, you will have to re-write more-or-less all of it.
And it gets harder when you try to adapt linear algebra proofs into algorithms. You need to actually understand the algebra, so that you can refactor it into a stable (or stable enough) algorithm.
If I were you, I'd target the LAPACK/BLAS API and shop around for the library that works for your data set.
You have plenty of options: LAPACK/BLAS, GSL and other self-optimizing libraries, vender libraries.
I dont meet this libraries very well. But you should consider that libraries usually make a couple of tests in parameters, they have a "sistem of comunication" to errors, and even the attribution to new variables when you call a function... If the calcs are trivial, maybe you can try do it by yourself, adaptating whith your necessities...