Intell TBB Performance - c++

I took a TBB matrix multiplication from here
This example uses the concept of blocked_range for parallel_for loops. I also ran a couple of programs using Intel MKL and eigen libraries. When I compare the times taken by these implementations, MKL is the fastest, while TBB is the slowest (10 times slower than eigen on an average) for a variety of matrix sizes (2-4096). Is it normal or am I doing something wrong ? Shouldn't TBB performing better than eigen at least ?

That looks like a really basic matrix multiplication algorithm, meant as little more than an example on how to use TBB. There are far better ones and I'm fairly certain the intel MKL will be using SSE / AVX / FMA instructions too.
To put it another way, there wouldn't be any point to the Intel MKL if you could replicate its performance with 20 lines of code. So yes, what you get seems normal.
At the very least, with large matrices, the algorithm needs to take cache and other details of the memory subsystem into account.

Is it normal
Yes, it is normal for one program to be slower than another by a factor of 10.
Shouldn't TBB performing better than eigen at least ?
I don't see any reason why a naïve implementation of matrix multiplication using TBB would perform better, or even close to the performance of a dedicated, optimized library designed for fast linear algebra.

Related

using SSE in C: how to load distributed values in register [duplicate]

I'm trying to implement SSE version of large matrix by matrix multiplication.
I'm looking for an efficient algorithm based on SIMD implementations.
My desired method looks like:
A(n x m) * B(m x k) = C(n x k)
And all matrices are considered to be 16-byte aligned float array.
I searched the net and found some articles describing 8x8 multiplication and even smaller. I really need it as efficient as possible and I don't want to use Eigen library or similar libraries. (Only SSE3 to be more specific).
So I'd appreciate if anyone can help me find some articles or resources on how to start implementing this.
The main challenge in implementation of arbitrary-size matrix-matrix multiplication is not the use of SIMD, but reuse of cached data. The paper Anatomy of High-Performance Matrix Multiplication by Goto and Van de Geijn is a must-read if you want to implement cache-friendly matrix-matrix multiplication, and it also discusses the choice of kernels to be SIMD-friendly. After reading this paper expect to achieve 50% of machine peak on matrix-matrix multiplication after two weeks of efforts.
However, if the purpose of this work is not pure learning, I strongly recommend to use a highly optimized library. On x86 your best options are OpenBLAS (BSD-licensed, supports dynamic CPU dispatching), BLIS (BSD-licensed, easily portable to new processors), and Intel MKL (commercial, supports dynamic CPU dispatching on Intel processors). For performance reasons it is better to avoid ATLAS unless you target a very exotic architecture which is not supported by other libraries.

Efficient SSE NxN matrix multiplication

I'm trying to implement SSE version of large matrix by matrix multiplication.
I'm looking for an efficient algorithm based on SIMD implementations.
My desired method looks like:
A(n x m) * B(m x k) = C(n x k)
And all matrices are considered to be 16-byte aligned float array.
I searched the net and found some articles describing 8x8 multiplication and even smaller. I really need it as efficient as possible and I don't want to use Eigen library or similar libraries. (Only SSE3 to be more specific).
So I'd appreciate if anyone can help me find some articles or resources on how to start implementing this.
The main challenge in implementation of arbitrary-size matrix-matrix multiplication is not the use of SIMD, but reuse of cached data. The paper Anatomy of High-Performance Matrix Multiplication by Goto and Van de Geijn is a must-read if you want to implement cache-friendly matrix-matrix multiplication, and it also discusses the choice of kernels to be SIMD-friendly. After reading this paper expect to achieve 50% of machine peak on matrix-matrix multiplication after two weeks of efforts.
However, if the purpose of this work is not pure learning, I strongly recommend to use a highly optimized library. On x86 your best options are OpenBLAS (BSD-licensed, supports dynamic CPU dispatching), BLIS (BSD-licensed, easily portable to new processors), and Intel MKL (commercial, supports dynamic CPU dispatching on Intel processors). For performance reasons it is better to avoid ATLAS unless you target a very exotic architecture which is not supported by other libraries.

How can the C++ Eigen library perform better than specialized vendor libraries?

I was looking over the performance benchmarks: http://eigen.tuxfamily.org/index.php?title=Benchmark
I could not help but notice that eigen appears to consistently outperform all the specialized vendor libraries. The questions is: how is it possible? One would assume that mkl/goto would use processor specific tuned code, while eigen is rather generic.
Notice this http://download.tuxfamily.org/eigen/btl-results-110323/aat.pdf, essentially a dgemm. For N=1000 Eigen gets roughly 17Gf, MKL only 12Gf
Eigen has lazy evaluation. From How does Eigen compare to BLAS/LAPACK?:
For operations involving complex expressions, Eigen is inherently
faster than any BLAS implementation because it can handle and optimize
a whole operation globally -- while BLAS forces the programmer to
split complex operations into small steps that match the BLAS
fixed-function API, which incurs inefficiency due to introduction of
temporaries. See for instance the benchmark result of a Y = aX + bY
operation which involves two calls to BLAS level1 routines while Eigen
automatically generates a single vectorized loop.
The second chart in the benchmarks is Y = a*X + b*Y, which Eigen was specially designed to handle. It should be no wonder that a library wins at a benchmark it was created for. You'll notice that the more generic benchmarks, like matrix-matrix multiplication, don't show any advantage for Eigen.
Benchmarks are designed to be misinterpreted.
Let's look at the matrix * matrix product. The benchmark available on this page from the Eigen website tells you than Eigen (with its own BLAS) gives timings similar to the MKL for large matrices (n = 1000). I've compared Eigen 3.2.6 with MKL 11.3 on my computer (a laptop with a core i7) and the MKL is 3 times faster than Eigen for such matrices using one thread, and 10 times faster than Eigen using 4 threads. This looks like a completely different conclusion. There are two reasons for this. Eigen 3.2.6 (its internal BLAS) does not use AVX. Moreover, it does not seem to make a good usage of multithreading. This benchmark hides this as they use a CPU that does not have AVX support without multithreading.
Usually, those C++ libraries (Eigen, Armadillo, Blaze) bring two things:
Nice operator overloading: You can use +, * with vectors and matrices. In order to get nice performance, they have to use tricky techniques known as "Smart Template expression" in order to avoid temporary when they reduce the timing (such as y = alpha x1 + beta x2 with y, x1, x2 vectors) and introduce them when they are useful (such as A = B * C with A, B, C matrices). They can also reorder operations for less computations, for instance, if A, B, C are matrices A * B * C can be computed as (A * B) * C or A * (B * C) depending upon their sizes.
Internal BLAS: To compute the product of 2 matrices, they can either rely on their internal BLAS or one externally provided (MKL, OpenBLAS, ATLAS). On Intel chips with large matrices, the MKL il almost impossible to beat. For small matrices, one can beat the MKL as it was not designed for that kind of problems.
Usually, when those libraries provide benchmarks against the MKL, they usually use old hardware, and do not turn on multithreading so they can be on par with the MKL. They might also compare BLAS level 1 operations such as y = alpha x1 + beta x2 with 2 calls to a BLAS level 1 function which is a stupid thing to do anyway.
In a nutshell, those libraries are extremely convenient for their overloading of + and * which is extremely difficult to do without losing performance. They usually do a good job on this. But when they give you benchmark saying that they can be on par or beat the MKL with their own BLAS, be careful and do your own benchmark. You'll usually get different results ;-).
About the comparison ATLAS vs. Eigen
Have a look at this thread on the Eigen mailing list starting here:
http://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2012/07/msg00052.html
It shows for instance that ATLAS outperforms Eigen on the matrix-matrix product by 46%:
http://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2012/07/msg00062.html
More benchmarks results and details on how the benchmarks were done can be found here:
Intel(R) Core(TM) i5-3470 CPU # 3.20GHz:
http://www.mathematik.uni-ulm.de/~lehn/bench_FLENS/index.html
http://sourceforge.net/tracker/index.php?func=detail&aid=3540928&group_id=23725&atid=379483
Edit:
For my lecture "Software Basics for High Performance Computing" I created a little framework called ulmBLAS. It contains the ATLAS benchmark suite and students could implement their own matrix-matrix product based on the BLIS papers. You can have a look at the final benchmarks which also measure Eigen:
http://www.mathematik.uni-ulm.de/~lehn/sghpc/gemm/page14/index.html#toc5
You can use the ulmBLAS framework to make your own benchmarks.
Also have a look at
Matrix-Matrix Product Experiments with uBLAS
Matrix-Matrix Product Experiments with BLAZE
Generic code can be fast because Compile Time Function Evaluation (CTFE) allows to choose optimal register blocking strategy (small temporary sub-matrixes stored in CPU registers).
Mir GLAS and Intel MKL are faster than Eigen and OpenBLAS.
Mir GLAS is more generic compared to Eigen. See also the benchmark and reddit thread.
It doesn't seem to consistently outperform other libraries, as can be seen on the graphs further down on that page you linked. So the different libraries are optimized for different use cases, and different libraries are faster for different problems.
This is not surprising, since you usually cannot optimize perfectly for all use cases. Optimizing for one specific operation usually limits the optimization options for other use cases.
I sent the same question to the ATLAS mailing list some time ago:
http://sourceforge.net/mailarchive/message.php?msg_id=28711667
Clint (the ATLAS developer) does not trust these benchmarks. He suggested some trustworthy benchmark procedure. As soon as I have some free time I will do this kind of benchmarking.
If the BLAS functionality of Eigen is actually faster then that of GotoBLAS/GotoBLAS, ATLAS, MKL then they should provide a standard BLAS interface anyway. This would allow linking of LAPACK against such an Eigen-BLAS. In this case, it would also be an interesting option for Matlab and friends.

Library for solving system of linear equations with tridiagonal matrix?

I am modelling physical system with heat conduction, and to do numerical calculations I need to solve system of linear equations with tridiagonal matrix. I am using this algorithm to get results: http://en.wikipedia.org/wiki/Tridiagonal_matrix_algorithm But I am afraid that my method is straightforward and not optimal. What C++ library should be used to solve that system in the fastest way? I should also mention that matrix is not changed often (only right part of the equation is changed). Thanks!
Check out Eigen.
The performance of this algorithm is likely dominated by floating-point division. Use SSE2 to perform two divisions (of ci and di) at once, and you will get close to optimal performance.
It is worth looking at the LAPACK and BLAS interfaces, of which there are several implementation libraries. Originally netlib which is open-source, and then other such as MKL that you have to pay for. The function dgtsv does what you are looking for. The open-source netlib versions don't do any explicit SIMD instructions, but MKL does and will perform best on intel chips.

LAPACK/BLAS versus simple "for" loops

I want to migrate a piece of code that involves a number of vector and matrix calculations to C or C++, the objective being to speed up the code as much as possible.
Are linear algebra calculations with for loops in C code as fast as calculations using LAPACK/BLAS, or there is some unique speedup from using those libraries?
In other words, could simple C code (using for loops and the like) perform linear algebra calculations as fast as code that utilizes LAPACK/BLAS?
Vendor-provided LAPACK / BLAS libraries (Intel's IPP/MKL have been mentioned, but there's also AMD's ACML, and other CPU vendors like IBM/Power or Oracle/SPARC provide equivalents as well) are often highly optimized for specific CPU abilities that'll significantly boost performance on large datasets.
Often, though, you've got very specific small data to operate on (say, 4x4 matrices or 4D dot products, i.e. operations used in 3D geometry processing) and for those sort of things, BLAS/LAPACK are overkill, because of initial tests done by these subroutines which codepaths to choose, depending on properties of the data set. In those situations, simple C/C++ sourcecode, maybe using SSE2...4 intrinsics and/or compiler-generated vectorization, may beat BLAS/LAPACK.
That's why, for example, Intel has two libraries - MKL for large linear algebra datasets, and IPP for small (graphics vectors) data sets.
In that sense,
what exactly is your data set ?
What matrix/vector sizes ?
What linear algebra operations ?
Also, regarding "simple for loops": Give the compiler the chance to vectorize for you. I.e. something like:
for (i = 0; i < DIM_OF_MY_VECTOR; i += 4) {
vecmul[i] = src1[i] * src2[i];
vecmul[i+1] = src1[i+1] * src2[i+1];
vecmul[i+2] = src1[i+2] * src2[i+2];
vecmul[i+3] = src1[i+3] * src2[i+3];
}
for (i = 0; i < DIM_OF_MY_VECTOR; i += 4)
dotprod += vecmul[i] + vecmul[i+1] + vecmul[i+2] + vecmul[i+3];
might be a better feed to a vectorizing compiler than the plain
for (i = 0; i < DIM_OF_MY_VECTOR; i++) dotprod += src1[i]*src2[i];
expression. In some ways, what you mean by calculations with for loops will have a significant impact.
If your vector dimensions are large enough though, the BLAS version,
dotprod = CBLAS.ddot(DIM_OF_MY_VECTOR, src1, 1, src2, 1);
will be cleaner code and likely faster.
On the reference side, these might be of interest:
Intel Math Kernel Libraries Documentation (LAPACK / BLAS and others optimized for Intel CPUs)
Intel Performance Primitives Documentation (optimized for small vectors / geometry processing)
AMD Core Math Libraries (LAPACK / BLAS and others for AMD CPUs)
Eigen Libraries (a "nicer" linear algebra interface)
Probably not. People quite a bit of work into ensuring that lapack/BLAS routines are optimized and numerically stable. While the code is often somewhat on the complex side, it's usually that way for a reason.
Depending on your intended target(s), you might want to look at the Intel Math Kernel Library. At least if you're targeting Intel processors, it's probably the fastest you're going to find.
Numerical analysis is hard. At the very least, you need to be intimately aware of the limitations of floating point arithmetic, and know how to sequence operations so that you balance speed with numerical stability. This is non-trivial.
You need to actually have some clue about the balance between speed and stability you actually need. In more general software development, premature optimization is the root of all evil. In numerical analysis, it is the name of the game. If you don't get the balance right the first time, you will have to re-write more-or-less all of it.
And it gets harder when you try to adapt linear algebra proofs into algorithms. You need to actually understand the algebra, so that you can refactor it into a stable (or stable enough) algorithm.
If I were you, I'd target the LAPACK/BLAS API and shop around for the library that works for your data set.
You have plenty of options: LAPACK/BLAS, GSL and other self-optimizing libraries, vender libraries.
I dont meet this libraries very well. But you should consider that libraries usually make a couple of tests in parameters, they have a "sistem of comunication" to errors, and even the attribution to new variables when you call a function... If the calcs are trivial, maybe you can try do it by yourself, adaptating whith your necessities...