I recently write a C++ program using eigen. But I found that my poor CPI cannot handle the large computation load. I estimate it demands at least a day of computation and I am short of time, especially since I may need debugging time too. I'm considering using GPU acceleration, but I don't know how to adapt my code for CUDA (or some other platforms, but I prefer CUDA).
Are there any convenient methods to apply GPU acceleration on eigen program?
cuSolver is a linar algebra library by NVIDIA. It provides fast and efficient eigen solver among tons of other linear algebra facilities.
You might also want to take a look at the SDK examples (7_Libraries).
Here's the link to the documentation.
Other useful libraries are
MAGMA - https://developer.nvidia.com/magma
CULA - http://www.culatools.com/
I took a TBB matrix multiplication from here
This example uses the concept of blocked_range for parallel_for loops. I also ran a couple of programs using Intel MKL and eigen libraries. When I compare the times taken by these implementations, MKL is the fastest, while TBB is the slowest (10 times slower than eigen on an average) for a variety of matrix sizes (2-4096). Is it normal or am I doing something wrong ? Shouldn't TBB performing better than eigen at least ?
That looks like a really basic matrix multiplication algorithm, meant as little more than an example on how to use TBB. There are far better ones and I'm fairly certain the intel MKL will be using SSE / AVX / FMA instructions too.
To put it another way, there wouldn't be any point to the Intel MKL if you could replicate its performance with 20 lines of code. So yes, what you get seems normal.
At the very least, with large matrices, the algorithm needs to take cache and other details of the memory subsystem into account.
Is it normal
Yes, it is normal for one program to be slower than another by a factor of 10.
Shouldn't TBB performing better than eigen at least ?
I don't see any reason why a naïve implementation of matrix multiplication using TBB would perform better, or even close to the performance of a dedicated, optimized library designed for fast linear algebra.
I'm trying to implement SSE version of large matrix by matrix multiplication.
I'm looking for an efficient algorithm based on SIMD implementations.
My desired method looks like:
A(n x m) * B(m x k) = C(n x k)
And all matrices are considered to be 16-byte aligned float array.
I searched the net and found some articles describing 8x8 multiplication and even smaller. I really need it as efficient as possible and I don't want to use Eigen library or similar libraries. (Only SSE3 to be more specific).
So I'd appreciate if anyone can help me find some articles or resources on how to start implementing this.
The main challenge in implementation of arbitrary-size matrix-matrix multiplication is not the use of SIMD, but reuse of cached data. The paper Anatomy of High-Performance Matrix Multiplication by Goto and Van de Geijn is a must-read if you want to implement cache-friendly matrix-matrix multiplication, and it also discusses the choice of kernels to be SIMD-friendly. After reading this paper expect to achieve 50% of machine peak on matrix-matrix multiplication after two weeks of efforts.
However, if the purpose of this work is not pure learning, I strongly recommend to use a highly optimized library. On x86 your best options are OpenBLAS (BSD-licensed, supports dynamic CPU dispatching), BLIS (BSD-licensed, easily portable to new processors), and Intel MKL (commercial, supports dynamic CPU dispatching on Intel processors). For performance reasons it is better to avoid ATLAS unless you target a very exotic architecture which is not supported by other libraries.
I wrote a code which solves large systems of PDEs using some discretization methods which basically involves solving large, sparse systems Ax=b many times at each time steps.
I currently use the PARDISO solver (from the intel MKL library) which is a direct LU factorization of A to solve the system. I would like to compare this method with the use of iterative solvers (which, with the use of preconditioners, might perform better since I could use the same preconditioner over many time steps if my Jacobian matrix does not vary too much).
My question is then, what library do you suggest for sparse iterative solvers in fortran ? I found one (SLATEC) which was written in 1993 so I wonder if there is something more performant which was written more recently ?
Thanks :)
I would also add:
LIS
AGMG
Oh, well ... the complete list of Linear Algebra Solvers
Thanks for the comments, PETSc seems to be exactly what I was looking for, just need to learn how to link C calls into fortran now :)
I'm writing a software for hyperbolic partial differential equations in c++. Almost all notations are vector and matrix ones. On top of that, I need the linear algebra solver. And yes, the vector's and matrix's sizes can vary considerably (from say 1000 to sizes that can be solved only by distributed memory computing, eg. clusters or similar architecture). If I had lived in utopia, I'd had had linear solver which scales great for clusters, GPUs and multicores.
When thinking about the data structure that should represent the variables, I came accros the boost.ublas and MTL4.
Both libraries are blas level 3 compatible, MTL4 implements sparse solver and is much faster than ublas. They both don't have implemented support for multicore processors, not to mention parallelization for distributed memory computations. On the other hand, the development of MTL4 depends on sole effort of 2 developers (at least as I understood), and I'm sure there is a reason that the ublas is in the boost library. Furthermore, intel's mkl library includes the example for binding their structure with ublas.
I'd like to bind my data and software to the data structure that will be rock solid, developed and maintained for long period of time.
Finally, the question. What is your experience with the use of ublas and/or mtl4, and what would you recommend?
thanx,
mightydodol
With your requirements, I would probably go for BOOST::uBLAS. Indeed, a good deployment of uBLAS should be roughly on par with MTL4 regarding speed.
The reason is that there exist bindings for ATLAS (hence shared-memory parallelization that you can efficiently optimize for your computer), and also vendor-tuned implementations like the Intel Math Kernel Library or HP MLIB.
With these bindings, uBLAS with a well-tuned ATLAS / BLAS library doing the math should be fast enough. If you link against a given BLAS / ATLAS, you should be roughly on par with MTL4 linked against the same BLAS / ATLAS using the compiler flag -DMTL_HAS_BLAS, and most likely faster than the MTL4 without BLAS according to their own observation (example see here, where GotoBLAS outperforms MTL4).
To sum up, speed should not be your decisive factor as long as you are willing to use some BLAS library. Usability and support is more important. You have to decide, whether MTL or uBLAS is better suited for you. I tend towards uBLAS given that it is part of BOOST, and MTL4 currently only supports BLAS selectively. You might also find this slightly dated comparison of scientific C++ packages interesting.
One big BUT: for your requirements (extremely big matrices), I would probably skip the "syntactic sugar" uBLAS or MTL, and call the "metal" C interface of BLAS / LAPACK directly. But that's just me... Another advantage is that it should be easier than to switch to ScaLAPACK (distributed memory LAPACK, have never used it) for bigger problems. Just to be clear: for house-hold problems, I would not suggest calling a BLAS library directly.
If you're programming vectors, matrices, and linear algebra in C++, I'd look at Eigen:
http://eigen.tuxfamily.org/
It's faster than uBLAS (not sure about MTL4) and much cleaner syntax.
For new projects, it's probably best to stay away from Boost's uBlas. The uBlas FAQ even has this warning since late 2012:
Q: Should I use uBLAS for new projects?
... the last major improvement of uBLAS was in 2008 and no significant change was committed since 2009. ... Performance? There are faster alternatives. Cutting edge? uBLAS is more than 10 years old and missed all new stuff from C++11.
There is one C++ library missing in this list: FLENS
http://flens.sf.net
Disclaimer: Yes, this is my baby
It is header only
Comes with a simple, non-performant, generic (i.e. templated) C++ reference implemenation of BLAS.
If available you can use an optimized BLAS implementation as backend. In this case its like using BLAS directly (some Benchmark I should update).
You can use overloaded operators instead of calling BLAS functions.
It comes with its own, stand-alone, generic re-implemenation of a bunch of LAPACK functions. We call this port FLENS-LAPACK.
FLENS-LAPACK has exactly the same accuracy and performance as Netlib's LAPACK. And in my experience (FLENS-)LAPACK+ATLAS or (FLENS-)LAPACK+OpenBLAS gives you the same performance as ACML or MKL.
FLENS has a different policy regarding the creation of temporary vector/matrices in the evaluation of linear algebra expressions. The FLENS policy is: Never create them!!!. However, in a special debug-mode we allow the creation of temporaries "when necessary". This "when necessary" policy thing is the default in other libraries like Eigen or Armadillo or in Matlab.
You can see the performance differences directly here:
http://www.osl.iu.edu/research/mtl/mtl4/doc/performance.php3
Both are reasonable libraries to use in terms of their interfaces, I don't think that because uBLAS got through the BOOST review process it's necessarily way more robust. I've had my share of nightmares with unobvious side effects and unintended consequences from uBLAS implementations.
That's not to say uBLAS is bad, it's really good, but I think given the dramatic performances differences for MTL these days, it's worth using it instead of uBLAS even though it's arguably a bit more risky becuase of it's "only 2 developer" support group.
At the end of the day, it's about speed with a matrix library, go with MTL4.
From my own experience, MTL4 is much faster than uBLAS and it is also faster than Eigen.
There is a parallel version of MTL4. Just have a look at simunova