Armadillo + OpenBLAS multi-threading - c++

I have successfully used Armadillo coupled with OpenBLAS in master's thesis on Ubuntu 14.04 64bit (both with Armadillo installed and without installation). The performance was very impressive - my code consisted mainly from basic matrix operations. All of these were carried out using all threads available.
Now I try to use Armadillo with OpenBLAS on Windows 7 64bit machine in Visual Studio 2013. I have found some help online and successfully added PThread library. The code itself works, but the performance is poor. I test three basic operations using 1000x1000 matrix - addition, multiplication and element-wise multiplication. Out of these three, only classical multiplication uses all the CPU power. The other two use 25% CPU, which indicates they run on single thread.
I have not encoutered this behavior in case of Ubuntu. Does anyone have any suggestion? I haven't seen any link, where someone had similar issue.

Are you sure that OpenBLAS is using multiple threads on Ubuntu for addition and element-wise multiplication? Intuitively I'd expect those operations to be BW-limited rather than FPU-limited, so I'd guess multithreading wouldn't help that much?

Related

Parallelisation in Armadillo

The Armadillo C++ linear algebra library documentation states one of the reasons for developing the library in C++ to be "ease of parallelisation via OpenMP present in modern C++ compilers", but the Armadillo code does not use OpenMP. How can I gain the benefits of parallelisation with Armadillo? Is this achieved by using one of the high-speed LAPACK and BLAS replacements? My platform is Linux, Intel processor but I suspect there is a generic answer to this question.
Okay so it appears that parallelisation is indeed achieved by using the high-speed LAPACK and BLAS replacements. On Ubuntu 12.04 I installed OpenBLAS using the package manager and built the Armadillo library from the source. The examples in the examples folder built and run and I can control the number of cores using the OPENBLAS_NUM_THREADS environment variable.
I created a small project openblas-benchmark which measures the performance increase of Armadillo when computing a matrix product C=AxB for various size matrices but I could only test it on a 2-core machine so far.
The performance plot shows nearly 50% reduction in execution time for matrices larger than 512x512. Note that both axes are logarithmic; each grid line on the y axis represents a doubling in execution time.

Parallelize SVD computations c++

So I've would like to do an SVD factorization of a large matrix (1000-25000 x 4096) in C++. I have tried LAPACKE dgesdd, Armadillo svd/svd_econ and Eigen but all of them seem to be single threaded and quite slow. I'm also currently trying to implement a solution based on redsvd.
Do you have any suggestions on how to implement a fast SVD factorization preferably using multi-threading. I've noticed that Matlab is using multi-threaded SVD so it should be possible.
Also, I'm running g++ on a 64-bit Linux machine if that would be of any importance.
Thank you in advance.
Intel's Math Kernel Libraries offer parallel implementations of LAPACKE. They are available for Linux as well.

Fast LAPACK/BLAS for matrix multiplication

I'm exploring the Armadillo C++ library for linear algebra at the moment. As far as I understood it uses LAPACK/BLAS library for basic matrix operations (e.g. matrix multiplication). As a Windows user I downloaded LAPACK/BLAS from here: http://icl.cs.utk.edu/lapack-for-windows/lapack/#running. The problem is that matrix multiplications are very slow comparing to Matlab or even R. For example, Matlab multiplies two 1000x1000 matrices in ~0.15 seconds on my computer, R needs ~1 second, while C++/Armadillo/LAPACK/BLAS needs more than 10 seconds for that.
So, Matlab is based on highly optimized libraries for linear algebra. My question is if there exists a faster LAPACK/BLAS brary to use from Armadillo? Alternatively, is there a way to extract Matlab linear algebra libraries somehow and use them in C++?
LAPACK doesn't do matrix multiplication. It's BLAS that provides matrix multiplication.
If you have a 64 bit operating system, I recommend to first try a 64 bit version of BLAS. This will get you an immediate doubling of performance.
Secondly, have a look at a high-performance implementation of BLAS, such as OpenBLAS. OpenBLAS uses both vectorisation and parallelisation (ie. multi-core). It is a free (no cost) open source project.
Matlab internally uses the Intel MKL library, which you can also use with the Armadillo library. Intel MKL is closed source, but is free for non-commercial use. Note that OpenBLAS can obtain matrix multiplication performance that is on par or better than Intel MKL.
Note that high performance linear algebra is generally easier to accomplish on Linux and Mac OS X than on Windows.
Adding to what has already been said, you should also use a high level of optimization:
Be sure to use either the O2 or the O3 compiler flag.
Link to the above mentioned high performance (and possibly multi-threaded) BLAS libraries. AFAIK MKL is only freely available for Unix platforms though, if you're using a Linux box like cygwin inside windows, this should be OK then I guess. OpenBLAS is also multi-threaded.
In many libraries, setting the symbol NDEBUG (e.g. passing the compiler flag -DNDEBUG) turns off costly range checking and assertions. Armadillo has its own symbol, called ARMA_NO_DEBUG, which you can either set manually, or you can edit the config.hpp header file (located in the armadillo include directory) and uncomment the corresponding line. I am guessing since you were able to turn on external BLAS usage in armadillo, you should be familiar with this config file anyways...
I did a quick comparison between armadillo/MKL_BLAS and Matlab on my intel core-i7 workstation. For the C++ exe I used -O3, MKL BLAS and had ARMA_NO_DEBUG defined. I multiplied 1000x1000 random matrices 100 times and averaged the multiplication times.
The C++ implementation was roughly 1.5 times faster than matlab.
Hope this helps
is there a way to extract Matlab linear algebra libraries somehow and use them in C++?Yes, for C++ call matlab function, refer to this link: How to Call Matlab Functions from C++
Several C++ lib for linear algebra provide an easy way to link with hightly optimized lib.
look at http://software.intel.com/en-us/articles/intelr-mkl-and-c-template-libraries
You should be able to link Armadillo to the MKL for more performance but it's a commercial package,

Libraries for parallel distributed cholesky decomposition in c/c++ in mpi environment?

What libraries are available for parallel distributed cholesky decomposition of dense matrices in C/C++ in mpi environment?
I've found the ScaLAPACK library, and this might be the solution I'm looking for. It seems that it's a bit fiddly to call though, lots of Fortran <-> C conversions to do, which makes me think that maybe it is not widely used, and therefore maybe there are some other libraries that are used instead?
Alternatively, are there some wrappers for ScaLAPACK that make it relatively not too painful to use in a C or C++ environment, when one is already using MPI, and MPI has already been initialized in the program?
Are these dense or sparse matrices?
Trilinos is a huge library for parallel scientific computation. The sub-package Amesos can link to Scalapack for parallel, direct solution of dense systems and to UMFPACK, SuperLU or MUMPS for sparse systems. Trilinos is mostly in C++, but there are Python bindings if that's your taste. It might be overkill, but it'll get the job done.
Intel MKL might also be a choice, since it calls ScaLAPACK on the inside. Note that Intel supports student use of this library, but in this case you have to use an open source MPI version. Also the Intel forum is very helpful.
Elemental is also an option, written in C++, which is surely a big advantage when you want to integrate with your C/C++ application and the project leader, Jack Poulson is a very friendly and helps a lot.
OpenBLAS, SuperLU and PETSc are also interesting and you may want to read more in my answer.

What's a good C++ library for matrix operations

I need to do multiplication on matrices. I'm looking for a library that can do it fast. I'm using the Visual C++ 2008 compiler and I have a core i7 860 so if the library is optimized for my configuration it's perfect.
FWIW, Eigen 3 uses threads (OpenMP) for matrix products (in reply to above statement about Eigen not using threads).
BLAS is a de facto Fortran standard for all basic linear algebra operations (essentially multiplications of matrices and vectors). There are numerous implementations available. For instance:
ATLAS is free and supposedly self-optimizing. You need to compile it yourself though.
Goto BLAS is maintained by Kazushige Goto at TACC. He is very good at getting the last performance bit out of modern processors. It is only for academic use though.
Intel MKL provides optimised BLAS for Intel processors. It is not free, even for academic use.
Then, you may want to use a C++ wrapper, for instance boost::ublas.
If you program on distributed systems, there are PBLAS and ScaLAPACK which enable the use of message passing for distributed linear algebra operations. On a multicore machine, usually implementations of BLAS (at least Intel MKL) use threads for large enough matrices.
If you want more advanced linear algebra routines (eigenvalues, linear systems, least square, ...), then there is the other de facto Fortran standard LAPACK. To my knowledge, there is nothing to integrate it elegantly with C++ other than calling the bare Fortran routines. You have to write some wrappers to hide the Fortran calls and provide a sound type-checking implementation.
Look into Eigen. It should have all you need.
I have had good experience with Boost's uBLAS. It's a nice option if you're already using Boost.
You can use the GNU Scientific Library(GSL).
Here's a page describing the matrix operations available in the library, including multiplication(gsl_matrix_mul_elements()):
http://www.gnu.org/software/gsl/manual/html_node/Matrix-operations.html
And here are some links to get you started with using GSL with visual studio:
http://gladman.plushost.co.uk/oldsite/computing/gnu_scientific_library.php
http://www.quantcode.com/modules/smartfaq/faq.php?faqid=33
it can't race with scientific libraries, but with visual c++ it is at hand
#include <windows.h>
#include <gdiplus.h>
#pragma comment (lib,"Gdiplus.lib")
using namespace Gdiplus;
int main()
{
ULONG_PTR gpToken = 0;
GdiplusStartup(&gpToken, &GdiplusStartupInput(), NULL);
//lib inited
Matrix A;
A.Translate(10,20);
Matrix B;
B.Rotate(35.0);
A.Multiply(&B);
if (A.IsInvertible())
A.Invert();
if (!A.IsIdentity())
A.RotateAt(120.0, PointF(10,10));
//getting values
REAL elements[6];
A.GetElements(elements);
//lib stopped
GdiplusShutdown(gpToken);
return 0;
}
so with this you can easily take the matrix multiplication obstacle (on Windows)
GdiPlus Matrix Documentation
for more recent version of Visual Studio, you can use ScaLapack + MKL.
A sample of code is provided here , with a tutorial on how to make it run.
http://code.msdn.microsoft.com/Using-ScaLAPACK-on-Windows-d16a5e76#content
There's an option to implement this yourself, perhaps using std::valarray because that may be parallelised using OpenMP: gcc certainly has such a version, MSVC++ probably does too.
Otherwise, the following tricks: one of the matrices should be transposed. Then you have:
AB[i,j] = Sum(k) A[i,k] B^t [j,k]
where you're scanning contiguous memory. If you have 8 cores you can fairly easily divide the set of [i,j] indices into 8, and give each core 1/8 of the total job. To make it even faster you can use vector multiply instructions, most compilers will provide a special function for this. The result won't be as fast as a tuned library but it should be OK.
If you're doing longer calculations such as polynomial evaluation, a threading evaluator which also has thread support (gak, two kind of threads) will do a good job even though it won't do low level tuning. If you really want to do stuff fast, you have to use a properly tuned library like Atlas, but then, you probably wouldn't be running Windows if you were serious about HPC.