I know that Blitz++ gets its performance plus by extensive usage of expression templates and template metaprogramms. But at some point you can't get more out of your code by using these techniques - you have to multiply and sum some floats up. At this point you can get a final performance kick by using the highly optimized (especially for special architectures) BLAS routines. Does the current implementation of Blitz++ use BLAS routines whenever it is possible?
Only for benchmarks you must specify it when you configure blitz++:
./configure -with-blas=...
Blitz does not use Blas routines.
Related
I am developing a linear algebra tool in C++, which relies heavily on matrix multiplication and decompositions (like LU, SVD), and is meant to be applied to large matrices. I developed it using Intel MKL for peak performance, but I don't want to release an Intel MKL only version, as I assume it will not work for people without Intel or who don't want to install MKL. Instead, I should release a more general code that is not Intel MKL-specific, but rather allows the user to specify which implementation of BLAS and LAPACK they would like to use (e.g. OpenBLAS, or ATLAS).
Although the function prototypes seem to be the same across implementations, there are several (helper?) functions and types that are specific to Intel MKL. For example, there is the MKL_INT type that I use, and also the mkl_malloc. This article suggests using macros to redefine the types, which was also my first thought. I assume I would also then have macros for the headers as well.
I believe it is standard for code to be written such that it is agnostic to the BLAS/LAPACK implementation, and I wanted to know if there was a cleaner way than relying on macros--particularly since the latter would require recompiling the code to switch, which does not seem to be necessary for other tools I have used.
Most scientific codes that rely on BLAS/LAPACK calls are implementation-agnostic. They usually require that the library is just linked as appropriate.
You've commented that the function prototypes are the same across implementations. This allows you to just have the prototypes in some myblas.h and mylapack.h headers then link whichever library you'd like to use.
It sounds like your primary concern is the implementation-specific stuff that you've utilized for MKL. The solution is to just not use this stuff. For example, the MKL types like MKL_INT are not special. They are C datatypes that have been defined to allow generalize between LP32/LP64/ILP64 libraries which MKL provides. See this table.
Also, stuff like mkl_malloc isn't special. It was introduced before the C standard had a thread-safe aligned alloc. In fact, that is all mkl_malloc is. So instead, just use aligned_alloc, or if you don't want to commit to C11 use _mm_malloc, memalign, etc...
On the other hand, MKL does provide some useful extensions to BLAS/LAPACK which aren't standardized (like transpositions, for example). However, this type of stuff is usually easy to implement with a special case BLAS/LAPACK call or easy enough to implement by yourself. MKL also has internal threading if you choose to use it, however, many BLAS/LAPACK libraries offer this.
I'm exploring the Armadillo C++ library for linear algebra at the moment. As far as I understood it uses LAPACK/BLAS library for basic matrix operations (e.g. matrix multiplication). As a Windows user I downloaded LAPACK/BLAS from here: http://icl.cs.utk.edu/lapack-for-windows/lapack/#running. The problem is that matrix multiplications are very slow comparing to Matlab or even R. For example, Matlab multiplies two 1000x1000 matrices in ~0.15 seconds on my computer, R needs ~1 second, while C++/Armadillo/LAPACK/BLAS needs more than 10 seconds for that.
So, Matlab is based on highly optimized libraries for linear algebra. My question is if there exists a faster LAPACK/BLAS brary to use from Armadillo? Alternatively, is there a way to extract Matlab linear algebra libraries somehow and use them in C++?
LAPACK doesn't do matrix multiplication. It's BLAS that provides matrix multiplication.
If you have a 64 bit operating system, I recommend to first try a 64 bit version of BLAS. This will get you an immediate doubling of performance.
Secondly, have a look at a high-performance implementation of BLAS, such as OpenBLAS. OpenBLAS uses both vectorisation and parallelisation (ie. multi-core). It is a free (no cost) open source project.
Matlab internally uses the Intel MKL library, which you can also use with the Armadillo library. Intel MKL is closed source, but is free for non-commercial use. Note that OpenBLAS can obtain matrix multiplication performance that is on par or better than Intel MKL.
Note that high performance linear algebra is generally easier to accomplish on Linux and Mac OS X than on Windows.
Adding to what has already been said, you should also use a high level of optimization:
Be sure to use either the O2 or the O3 compiler flag.
Link to the above mentioned high performance (and possibly multi-threaded) BLAS libraries. AFAIK MKL is only freely available for Unix platforms though, if you're using a Linux box like cygwin inside windows, this should be OK then I guess. OpenBLAS is also multi-threaded.
In many libraries, setting the symbol NDEBUG (e.g. passing the compiler flag -DNDEBUG) turns off costly range checking and assertions. Armadillo has its own symbol, called ARMA_NO_DEBUG, which you can either set manually, or you can edit the config.hpp header file (located in the armadillo include directory) and uncomment the corresponding line. I am guessing since you were able to turn on external BLAS usage in armadillo, you should be familiar with this config file anyways...
I did a quick comparison between armadillo/MKL_BLAS and Matlab on my intel core-i7 workstation. For the C++ exe I used -O3, MKL BLAS and had ARMA_NO_DEBUG defined. I multiplied 1000x1000 random matrices 100 times and averaged the multiplication times.
The C++ implementation was roughly 1.5 times faster than matlab.
Hope this helps
is there a way to extract Matlab linear algebra libraries somehow and use them in C++?Yes, for C++ call matlab function, refer to this link: How to Call Matlab Functions from C++
Several C++ lib for linear algebra provide an easy way to link with hightly optimized lib.
look at http://software.intel.com/en-us/articles/intelr-mkl-and-c-template-libraries
You should be able to link Armadillo to the MKL for more performance but it's a commercial package,
Most of the BLAS Level 1 API can be trivially written straight forward using Fortran 9x+ vectorized assignments and intrinsic procedures.
Assuming you are using a modern optimizing compiler, like Intel Fortran, and correct target-specific compiler optimization options, are there any performance benefits from using BLAS Level 1 procedures instead, say from Intel MKL or other fast BLAS implementations?
If there are, what is a typical vector size when these benefits appear?
It depends. We've tested this before with the Intel compiler and run into surprising results. For example, DOT_PRODUCT from Fortran vs. the BLAS implementation gave different trends based on the problem size. As the number of elements in the arrays got larger, BLAS became better than the intrinsic. But for small problem sizes, the intrinsic was much faster.
We actually measured for our use cases what the cut-off size that's required to make one better than the other and actually use if-statements to decide which to call. I can't share those results, but I encourage you to test it out yourself. There is still benefit from using BLAS.
Are there free C/C++ libraries taht do the types of functions that matlab does - something complicated i mean, like discrete laplacian, etc? Is the best option to create some kind of interface in matlab and build my own library?
Thanks
Have you looked at Boost.Math?
http://www.boost.org/doc/libs/1_46_1/libs/math/doc/html/index.html
If you are on windows, there is a very easy to use installer by BoostPro:
http://www.boostpro.com/download/
If you want something that was a matlab clone but free, you could use Octave http://www.gnu.org/software/octave/
I haven't used it in a C++ program, but it apparently has a C++ API:
http://octave.sourceforge.net/doxygen/html/index.html
Depending on what you want to do there are various packages available.
Arbitrary Precision
mostly integers: GMP, MPIR (similar codebases, MPIR has VC builds)
floats: MPFR
complex: MPC
Specialist:
Number Theory: Flint
Linear Algebra: Boost Numeric uBLAS
PDEs: libMesh
Computational Fluid Dynamics: OpenFoam
Graph Theory: Boost Graph
General:
TNT (was LAPACK++ (TNT=do everything, LAPACK++=Linear Alg.)
SciMath (Commercial)
GNU Scientific Library
and that's just a few. I haven't repeated ones others have listed like libpari.
Just in case you're wondering, Maple, Mathematica, Matlab etc all use the GNU MP for their arbitrary precision calculations.
PARI could be a good choice, although I am not familiar with using it:
Official Site for PARI
PARI is a C library, and if you want an independent software, they have PARI-GP there.
Below is the description of PARI on the website above:
PARI/GP is a widely used computer
algebra system designed for fast
computations in number theory
(factorizations, algebraic number
theory, elliptic curves...), but also
contains a large number of other
useful functions to compute with
mathematical entities such as
matrices, polynomials, power series,
algebraic numbers etc., and a lot of
transcendental functions. PARI is also
available as a C library to allow for
faster computations.
Hope this could be useful!
P.S. It is said that Octave functions could be called from C++, and that could be an excellent substitution for MATLAB.
Have a look at armadillo for simplifying your handling of matrices. Then for solving PDEs you'll have to do the job yourself, ie. construct explicitly your Laplacian matrix, and solve it the way you want.
There is Intel MKL too (not free though) which adds some value: iterative solvers (GMRES, BCG) and some black-boxes for solving the Laplacian / Poisson equation on simple domains (cubes and spheres).
I use OpenCV for a lot of image processing and matrix manipulation, which is generally what I use matlab for.
http://opencv.willowgarage.com/wiki/
May be overkill depending on what kind of math your trying to do, but it's great for computer vision.
The GNU Scientific Library is a free numerical library for C and C++ programmers.
With the Coder toolbox (requires MATLAB R2011a), you can also turn your MATLAB code into C or C++.
you can use octave runtime:
http://en.wikipedia.org/wiki/GNU_Octave#C.2B.2B_Integration
I need to do multiplication on matrices. I'm looking for a library that can do it fast. I'm using the Visual C++ 2008 compiler and I have a core i7 860 so if the library is optimized for my configuration it's perfect.
FWIW, Eigen 3 uses threads (OpenMP) for matrix products (in reply to above statement about Eigen not using threads).
BLAS is a de facto Fortran standard for all basic linear algebra operations (essentially multiplications of matrices and vectors). There are numerous implementations available. For instance:
ATLAS is free and supposedly self-optimizing. You need to compile it yourself though.
Goto BLAS is maintained by Kazushige Goto at TACC. He is very good at getting the last performance bit out of modern processors. It is only for academic use though.
Intel MKL provides optimised BLAS for Intel processors. It is not free, even for academic use.
Then, you may want to use a C++ wrapper, for instance boost::ublas.
If you program on distributed systems, there are PBLAS and ScaLAPACK which enable the use of message passing for distributed linear algebra operations. On a multicore machine, usually implementations of BLAS (at least Intel MKL) use threads for large enough matrices.
If you want more advanced linear algebra routines (eigenvalues, linear systems, least square, ...), then there is the other de facto Fortran standard LAPACK. To my knowledge, there is nothing to integrate it elegantly with C++ other than calling the bare Fortran routines. You have to write some wrappers to hide the Fortran calls and provide a sound type-checking implementation.
Look into Eigen. It should have all you need.
I have had good experience with Boost's uBLAS. It's a nice option if you're already using Boost.
You can use the GNU Scientific Library(GSL).
Here's a page describing the matrix operations available in the library, including multiplication(gsl_matrix_mul_elements()):
http://www.gnu.org/software/gsl/manual/html_node/Matrix-operations.html
And here are some links to get you started with using GSL with visual studio:
http://gladman.plushost.co.uk/oldsite/computing/gnu_scientific_library.php
http://www.quantcode.com/modules/smartfaq/faq.php?faqid=33
it can't race with scientific libraries, but with visual c++ it is at hand
#include <windows.h>
#include <gdiplus.h>
#pragma comment (lib,"Gdiplus.lib")
using namespace Gdiplus;
int main()
{
ULONG_PTR gpToken = 0;
GdiplusStartup(&gpToken, &GdiplusStartupInput(), NULL);
//lib inited
Matrix A;
A.Translate(10,20);
Matrix B;
B.Rotate(35.0);
A.Multiply(&B);
if (A.IsInvertible())
A.Invert();
if (!A.IsIdentity())
A.RotateAt(120.0, PointF(10,10));
//getting values
REAL elements[6];
A.GetElements(elements);
//lib stopped
GdiplusShutdown(gpToken);
return 0;
}
so with this you can easily take the matrix multiplication obstacle (on Windows)
GdiPlus Matrix Documentation
for more recent version of Visual Studio, you can use ScaLapack + MKL.
A sample of code is provided here , with a tutorial on how to make it run.
http://code.msdn.microsoft.com/Using-ScaLAPACK-on-Windows-d16a5e76#content
There's an option to implement this yourself, perhaps using std::valarray because that may be parallelised using OpenMP: gcc certainly has such a version, MSVC++ probably does too.
Otherwise, the following tricks: one of the matrices should be transposed. Then you have:
AB[i,j] = Sum(k) A[i,k] B^t [j,k]
where you're scanning contiguous memory. If you have 8 cores you can fairly easily divide the set of [i,j] indices into 8, and give each core 1/8 of the total job. To make it even faster you can use vector multiply instructions, most compilers will provide a special function for this. The result won't be as fast as a tuned library but it should be OK.
If you're doing longer calculations such as polynomial evaluation, a threading evaluator which also has thread support (gak, two kind of threads) will do a good job even though it won't do low level tuning. If you really want to do stuff fast, you have to use a properly tuned library like Atlas, but then, you probably wouldn't be running Windows if you were serious about HPC.