C++ container and openCL - c++

I am looking into some C++ code which extensively uses boost::multi_array<double>.
The next step is to port the code to use openCL. As I'm quite new to openCL I don't know what I should do with the multi_array. Should I rewrite it into a nested-openCL-vector or nested-c-array.
What would you do?

There are boost like libraries already exist for OpenCL, you may want to look at the following libraries from GPU Vendors
Thurst from NVIDIA: Thrust is a powerful library of parallel algorithms and data structures. Thrust provides a flexible, high-level interface for GPU programming that greatly enhances developer productivity. Using Thrust, C++ developers can write just a few lines of code to perform GPU-accelerated sort, scan, transform, and reduction operations orders of magnitude faster than the latest multi-core CPUs. For example, the thrust::sort algorithm delivers 5x to 100x faster sorting performance than STL and TBB.
AMD offers Math Libraries are software libraries containing FFT and BLAS functions written in OpenCL and designed to run on AMD GPUs for more info look here:
http://developer.amd.com/libraries/appmathlibs/Pages/default.aspx

Related

Using MPI with Armadillo C++ for parallel diagonalization

There has been a post regarding usage of MPI with Armadillo in C++: here
My question is, wether Lapack and OpenBlas did implement MPI support?
I could not anything so far in their documentations.
There seems to be a different library called ScaLAPACK, which uses MPI. Is this library compatible with Armadillo. Is it any different from LAPACK in use?
I need to diagonalise extremely large matrices (over 1TB of memory), therefore I need to spread the memory over multiple nodes on a cluster using MPI, but I don't know how to deal with that using Armadillo.
Do you have any useful tip/reference where I could find how to do that?
Any Blas is single-process. Some Blas implementations do multi-threading. So MPI has nothing to do with this: in an MPI run, each process calls a non-distributed Blas routine.
Scalapack is distributed memory, based on MPI. It is very different from Lapack. Matrix handling is considerably more complicated. Some applications / libraries are able to use Scalapack, but you can not switch out Lapack for Scalapack: support for Scalapack needs to be added explicitly.
Armadillo mentions threading support through OpenMP, and there is no mention of MPI. Therefore, you can not use Armadillo over multiple nodes.
If you want to do distributed eigenvalue calculations, take a look at the PETSc library and the SLEPc package on top of it. Those are written in C, so they can easily (though not entirely idiomatically) be used from C++.

Using libraries like boost in cuda device code

I am learning cuda at the moment and I am wondering if it is possible to use functions from different libraries and api's like boost in cuda device code.
Note: I tried using std::cout and that did not work I got printf working after changing the code generation to compute_20,sm_20. I'm using Visual Studio 2010. Cuda 5.0. GPU Nvidia GTX 570. NSIght is installed.
Here's an answer. And here's the CUDA documentation about language support. Boost won't make it for sure.
Since the purpose of using CUDA is to speed up kernels in your code, you'll typically want to limit the language complexities used, because of adding overhead. This will mean that you'll typically stay very close to plain C, with just a few sprinkles of C++ if that's really handy.
Constructs in for example Boost can result in large amounts of assembly code (and C++ in general has been criticised for this and is a reason to not use certain constructs in real-time software). This is all fine for most applications, but not for kernels you'll want to run on a GPU, where every instruction counts.
For CUDA (or OpenCL), people typically write intense algorithms that work on data in arrays. For example special image processing. You only use these techniques to do the calculation intensive tasks of you application. Then you have a 'regular' program that interacts with the user/network/database which creates these CUDA tasks (i.e. selects the data and parameters) and gives starts them. Here are CUDA samples.
Boost uses the expression templates technique not to loose performance while enabling a simpler syntax.
BlueBird and Newton are libraries using metaprogramming, similarly to Boost, enabling CUDA computations.
ArrayFire is another library using Just in Time compilation and exploiting the CUDA language underneath.
Finally, Thrust, as suggested by Njuffa, is a template library enabling CUDA calculations (but not using metaprogramming, see Evaluating expressions consisting of elementwise matrix operations in Thrust).

Libraries for parallel distributed cholesky decomposition in c/c++ in mpi environment?

What libraries are available for parallel distributed cholesky decomposition of dense matrices in C/C++ in mpi environment?
I've found the ScaLAPACK library, and this might be the solution I'm looking for. It seems that it's a bit fiddly to call though, lots of Fortran <-> C conversions to do, which makes me think that maybe it is not widely used, and therefore maybe there are some other libraries that are used instead?
Alternatively, are there some wrappers for ScaLAPACK that make it relatively not too painful to use in a C or C++ environment, when one is already using MPI, and MPI has already been initialized in the program?
Are these dense or sparse matrices?
Trilinos is a huge library for parallel scientific computation. The sub-package Amesos can link to Scalapack for parallel, direct solution of dense systems and to UMFPACK, SuperLU or MUMPS for sparse systems. Trilinos is mostly in C++, but there are Python bindings if that's your taste. It might be overkill, but it'll get the job done.
Intel MKL might also be a choice, since it calls ScaLAPACK on the inside. Note that Intel supports student use of this library, but in this case you have to use an open source MPI version. Also the Intel forum is very helpful.
Elemental is also an option, written in C++, which is surely a big advantage when you want to integrate with your C/C++ application and the project leader, Jack Poulson is a very friendly and helps a lot.
OpenBLAS, SuperLU and PETSc are also interesting and you may want to read more in my answer.

What's a good C++ library for matrix operations

I need to do multiplication on matrices. I'm looking for a library that can do it fast. I'm using the Visual C++ 2008 compiler and I have a core i7 860 so if the library is optimized for my configuration it's perfect.
FWIW, Eigen 3 uses threads (OpenMP) for matrix products (in reply to above statement about Eigen not using threads).
BLAS is a de facto Fortran standard for all basic linear algebra operations (essentially multiplications of matrices and vectors). There are numerous implementations available. For instance:
ATLAS is free and supposedly self-optimizing. You need to compile it yourself though.
Goto BLAS is maintained by Kazushige Goto at TACC. He is very good at getting the last performance bit out of modern processors. It is only for academic use though.
Intel MKL provides optimised BLAS for Intel processors. It is not free, even for academic use.
Then, you may want to use a C++ wrapper, for instance boost::ublas.
If you program on distributed systems, there are PBLAS and ScaLAPACK which enable the use of message passing for distributed linear algebra operations. On a multicore machine, usually implementations of BLAS (at least Intel MKL) use threads for large enough matrices.
If you want more advanced linear algebra routines (eigenvalues, linear systems, least square, ...), then there is the other de facto Fortran standard LAPACK. To my knowledge, there is nothing to integrate it elegantly with C++ other than calling the bare Fortran routines. You have to write some wrappers to hide the Fortran calls and provide a sound type-checking implementation.
Look into Eigen. It should have all you need.
I have had good experience with Boost's uBLAS. It's a nice option if you're already using Boost.
You can use the GNU Scientific Library(GSL).
Here's a page describing the matrix operations available in the library, including multiplication(gsl_matrix_mul_elements()):
http://www.gnu.org/software/gsl/manual/html_node/Matrix-operations.html
And here are some links to get you started with using GSL with visual studio:
http://gladman.plushost.co.uk/oldsite/computing/gnu_scientific_library.php
http://www.quantcode.com/modules/smartfaq/faq.php?faqid=33
it can't race with scientific libraries, but with visual c++ it is at hand
#include <windows.h>
#include <gdiplus.h>
#pragma comment (lib,"Gdiplus.lib")
using namespace Gdiplus;
int main()
{
ULONG_PTR gpToken = 0;
GdiplusStartup(&gpToken, &GdiplusStartupInput(), NULL);
//lib inited
Matrix A;
A.Translate(10,20);
Matrix B;
B.Rotate(35.0);
A.Multiply(&B);
if (A.IsInvertible())
A.Invert();
if (!A.IsIdentity())
A.RotateAt(120.0, PointF(10,10));
//getting values
REAL elements[6];
A.GetElements(elements);
//lib stopped
GdiplusShutdown(gpToken);
return 0;
}
so with this you can easily take the matrix multiplication obstacle (on Windows)
GdiPlus Matrix Documentation
for more recent version of Visual Studio, you can use ScaLapack + MKL.
A sample of code is provided here , with a tutorial on how to make it run.
http://code.msdn.microsoft.com/Using-ScaLAPACK-on-Windows-d16a5e76#content
There's an option to implement this yourself, perhaps using std::valarray because that may be parallelised using OpenMP: gcc certainly has such a version, MSVC++ probably does too.
Otherwise, the following tricks: one of the matrices should be transposed. Then you have:
AB[i,j] = Sum(k) A[i,k] B^t [j,k]
where you're scanning contiguous memory. If you have 8 cores you can fairly easily divide the set of [i,j] indices into 8, and give each core 1/8 of the total job. To make it even faster you can use vector multiply instructions, most compilers will provide a special function for this. The result won't be as fast as a tuned library but it should be OK.
If you're doing longer calculations such as polynomial evaluation, a threading evaluator which also has thread support (gak, two kind of threads) will do a good job even though it won't do low level tuning. If you really want to do stuff fast, you have to use a properly tuned library like Atlas, but then, you probably wouldn't be running Windows if you were serious about HPC.

Package for distributing calculations

Do you know of any package for distributing calculations on several computers and/or several cores on each computer? The calculation code is in c++, the package needs to be able to cope with data >2GB and work on a windows x64 machine. Shareware would be nice, but isn't a requirement.
A suitable solution would depend on the type of calculation and data you wish you process, the granularity of parallelism you wish to achieve, and how much effort you are willing to invest in it.
The simplest would be to just use a suitable solver/library that supports parallelism (e.g.
scalapack). Or if you wish to roll your own solvers, you can squeeze out some paralleisation out of your current code using OpenMP or compilers that provide automatic paralleisation (e.g Intel C/C++ compiler). All these will give you a reasonable performance boost without requiring massive restructuring of your code.
At the other end of the spectrum, you have the MPI option. It can afford you the most performance boost if your algorithm parallelises well. It will however require a fair bit of reengineering.
Another alternative would be to go down the threading route. There are libraries an tools out there that will make this less of a nightmare. These are worth a look: Boost C++ Parallel programming library and Threading Building Block
You may want to look at OpenMP
There's an MPI library and the DVM system working on top of MPI. These are generic tools widely used for parallelizing a variety of tasks.