uBlas is extremely slow [duplicate] - c++

Is there a way to improve the boost ublas product performance?
I have two matrices A,B which i want to mulitply/add/sub/...
In MATLAB vs. C++ i get the following times [s] for a 2000x2000 matrix Operations
OPERATION | MATLAB | C++ (MSVC10)
A + B | 0.04 | 0.04
A - B | 0.04 | 0.04
AB | 1.0 | 62.66
A'B' | 1.0 | 54.35
Why there is such a huge performance loss here?
The matrices are only real doubles.
But i also need positive definites,symmetric,rectangular products.
EDIT:
The code is trivial
matrix<double> A( 2000 , 2000 );
// Fill Matrix A
matrix<double> B = A;
C = A + B;
D = A - B;
E = prod(A,B);
F = prod(trans(A),trans(B));
EDIT 2:
The results are mean values of 10 trys. The stddev was less than 0.005
I would expect an factor 2-3 maybe to but not 50 (!)
EDIT 3:
Everything was benched in Release ( NDEBUG/MOVE_SEMANTICS/.. ) mode.
EDIT 4:
Preallocated Matrices for the product results did not affect the runtime.

Post your C+ code for advice on any possible optimizations.
You should be aware however that Matlab is highly specialized for its designed task, and you are unlikely to be able to match it using Boost. On the other hand - Boost is free, while Matlab decidedly not.
I believe that best Boost performance can be had by binding the uBlas code to an underlying LAPACK implementation.

You should use noalias in the left hand side of matrix multiplications in order to get rid of unnecessary copies.
Instead of E = prod(A,B); use noalias(E) = prod(A,b);
From documentation:
If you know for sure that the left hand expression and the right hand
expression have no common storage, then assignment has no aliasing. A
more efficient assignment can be specified in this case: noalias(C) =
prod(A, B); This avoids the creation of a temporary matrix that is
required in a normal assignment. 'noalias' assignment requires that
the left and right hand side be size conformant.

There are many efficient BLAS implementation, like ATLAS, gotoBLAS, MKL, use them instead.
I don't pick at the code, but guess the ublas::prod(A, B) using three-loops, no blocks and not cache friendly. If that's true, prod(A, B.trans()) will be much faster then others.
If cblas is avaiable, using cblas_dgemm to do the calculation. If not, you can simply rearrange the data, means, prod(A, B.trans()) instead.

You don't know what role memory-management is playing here. prod is having to allocate a 32mb matrix, and so is trans, twice, and then you're doing all that 10 times. Take a few stackhots and see what it's really doing. My dumb guess is if you pre-allocate the matrices you get a better result.
Other ways matrix-multiplication could be speeded up are
pre-transposing the left-hand matrix, to be cache-friendly, and
skipping over zeros. Only if A(i,k) and B(k,j) are both non-zero is any value contributed.
Whether this is done in uBlas is anybody's guess.

Related

How to extract matrixL() and matrixU() when using Eigen::CholmodSupernodalLLT?

I'm trying to use Eigen::CholmodSupernodalLLT for Cholesky decomposition, however, it seems that I could not get matrixL() and matrixU(). How can I extract matrixL() and matrixU() from Eigen::CholmodSupernodalLLT for future use?
A partial answer to integrate what others have said.
Consider Y ~ MultivariateNormal(0, A). One may want to (1) evaluate the (log-)likelihood (a multivariate normal density), (2) sample from such density.
For (1), it is necessary to solve Ax = b where A is symmetric positive-definite, and compute its log-determinant. (2) requires L such that A = L * L.transpose() since Y ~ MultivariateNormal(0, A) can be found as Y = L u where u ~ MultivariateNormal(0, I).
A Cholesky LLT or LDLT decomposition is useful because chol(A) can be used for both purposes. Solving Ax=b is easy given the decomposition, andthe (log)determinant can be easily derived from the (sum)product of the (log-)components of D or the diagonal of L. By definition L can then be used for sampling.
So, in Eigen one can use:
Eigen::SimplicialLDLT solver(A) (or Eigen::SimplicialLLT), when solver.solve(b) and calculate the determinant using solver.vectorD().diag(). Useful because if A is a covariance matrix, then solver can be used for likelihood evaluations, and matrixL() for sampling.
Eigen::CholmodDecomposition does not give access to matrixL() or vectorD() but exposes .logDeterminant() to achieve the (1) goal but not (2).
Eigen::PardisoLDLT does not give access to matrixL() or vectorD() and does not expose a way to get the determinant.
In some applications, step (2) - sampling - can be done at a later stage so Eigen::CholmodDecomposition is enough. At least in my configuration, Eigen::CholmodDecomposition works 2 to 5 times faster than Eigen::SimplicialLDLT (I guess because of the permutations done under the hood to facilitate parallelization)
Example: in Bayesian spatial Gaussian process regression, the spatial random effects can be integrated out and do not need to be sampled. So MCMC can proceed swiftly with Eigen::CholmodDecomposition to achieve convergence for the uknown parameters. The spatial random effects can then be recovered in parallel using Eigen::SimplicialLDLT. Typically this is only a small part of the computations but having matrixL() directly from CholmodDecomposition would simplify them a bit.
You cannot do this using the given class. The class you are referencing is equotation solver (which indeed uses cholesky decomposition). To decompose your matrix you should rather use Eigen::LLT. Code example from their website:
MatrixXd A(3,3);
A << 4,-1,2, -1,6,0, 2,0,5;
LLT<MatrixXd> lltOfA(A);
MatrixXd L = lltOfA.matrixL();
MatrixXd U = lltOfA.matrixU();
As reported somewhere else, e.g., it cannot be done easily.
I am copying a possible recommendation (answered by Gael Guennebaud himself), even if somewhat old:
If you really need access to the factor to do your own cooking, then
better use the built-in SimplicialL{D}LT<> class. Extracting the
factors from the supernodal internal represations of Cholmod/Pardiso
is indeed not straightforward and very rarely needed. We have to
check, but if Cholmod/Pardiso provide routines to manipulate the
factors, like applying it to a vector, then we could let
matrix{L,U}() return a pseudo expression wrapping these routines.
Developing code for extracting this is likely beyond SO, and probably a topic for a feature request.
Of course, the solution with LLT is at hand (but not the topic of the OP).

Eigen: how to speed up a += coeffs * coeffs.transpose()

I need to compute many (about 400k) solutions of small linear least square problems. Each problem contains 10-300 equations with only 7 variables.
To solve these problems i use eigen library. Straight solving takes too much time and i transform each problem to solving 7x7 system of linear equations by deriving derivatives by my hand.
I recieve nice speed-up but i want to increase performance again.
I use vagrind to profile my program and i found that operation with highest self cost is operator += of eigen matrix. This operation takes more than ten calls of a.ldlt().solve(b);
I use this operator to compose A matrix and B vector of each system of equations
//I cal these code to solve each problem
const int nVars = 7;
//i really need double precision
Eigen::Matrix<double, nVars, nVars> a = Eigen::Matrix<double, nVars, nVars>::Zero();
Eigen::Matrix<double, nVars, 1> b = Eigen::Matrix<double, nVars, 1>::Zero();
Eigen::Matrix<double, nVars, 1> equationCoeffs;
//............................
//Somewhere in big cycle.
//equationCoeffs and z are updated on each iteration
a += equationCoeffs * equationCoeffs.transpose();
b += equationCoeffs * z;
Where z is some scalar
So my question is: How can i improve performance of these operations?
PS Sorry for my poor English
Instead of forming the matrix and vector components of the normal equation by hand, one equation at a time, you might try to allocate a large enough matrix once (e.g. 300 x 7) to store all coefficients and then let Eigen's optimized matrix-matrix product kernels do the job for you:
Matrix<double,Dynamic,nbVars> D(300,nbVars);
VectorXd f(300);
for(...)
{
int nb_equations = ...;
for(i=0..nb_equations-1)
{
D.row(i) = equationCoeffs;
f(i) = z;
}
a = D.topRows(nb_equations).transpose() * D.topRows(nb_equations);
b = D.topRows(nb_equations).transpose() * f.head(nb_equations);
// solve ax=b
}
You might bench with both a column-major and row-major storage for the matrix D to see which one is best.
Another possible approach would be to declare a, equationCoeffs, and b as 8x8 or 8x1 matrix or vectors making sure that equationCoeffs(7)==0. This way you maximize SIMD usage. Then use a.topLeftCorners<7,7>(), b.head<7>() when calling LDLT. You might even combine this strategy with the previous one.
Finally, if your CPU support AVX or FMA, you might use the devel branch and compile with -mavx or -mfma to get a significant speedup.
If you can use g++5.1, you might want to take a look at OpenMP
( http://openmp.org/mp-documents/OpenMP4.0.0.Examples.pdf ).
G++5.1 (or gcc5.1 for C) also has some basic support for OpenACC, you can try that as well. There should be more implementation of OpenACC in the future.
Also if you have access to intel compiler (icc, icpc) it speeded up my code even just by using it.
If you can use nvidia's nvcc, you might use the thrust library
( http://docs.nvidia.com/cuda/thrust/#axzz3g8xJPGHe ), there's a lot of sample code on their github as well
( https://github.com/thrust/thrust ). However, using thrust is not so straight forward and needs some real thinking.
EDIT:
Thrust also requires Nvidia GPU.
For AMD cards I believe there is a library called ArrayFire, which looks very similar to Thrust (I have not tried that one, yet)
I have a single problem Ax=b with 480k float variables. The matrix A is sparse and solving it with Eigen BiCGSTAB took 4.8 seconds.
I also worked with ViennaCL before, so I tried to solve the same problem there, and it took only 1.2 seconds. The increase in spead is realised
by the processing on the GPU.

Multiple matrix-vector calls with CUBLAS

I currently have to perform 128 independent sequential matrix-vector CUBLAS operations. All the matrices and vectors are different. Each independent matrix is stored right after the next in memory and the vectors are likewise stored contiguously in memory (all in row-major form).
A bit more context:
The matrices are (2048 X 8) and the vector is length 2048. The outputs are all independent. Because I have super matrices, I have the following:
matrix[(2048*128)x8]
vector[(2048*128)x1]
output[(8*128)x1]
With cublasSgemv I'm doing a transpose on the each mini matrix first and then adding (rather than replacing) the result into memory with:
cublasSgemv(*handle, CUBLAS_OP_T, Bdim, Adim, scale1, d_matrix + offset1, Bdim, d_vector + offset2, 1, scale2, out + offset3, 1);
I am making 128 such calls which I would like to do in one.
The profiler shows significant performance degradation from making these multiple calls. What is the best way to do multiple matrix-vector operations? Is there a way to batch them together into one fast call?
Are streams the best way to go or is there some way to make a call with relevant offsets (to index into my array of matrices and vectors)? The only other efficient option seemed to be to use a CUSPASE call and stick all the matrices on the diagonal.
NOTE: I'm not interested in getting the transposes or row/column major ordering in the gemv call correct for this particular question.
Updated
In fact you have to pay special attention to the r/c major ordering if your want to speed up your code in this case.
As shown in your revised question, you use row-major matrices. then you have a super-matrix A[(2048*128)x8] and a super vector V[(2048*128)x1]. And here I assume that you want a col-major matrix output[8x128] (can be seen as a super-vector [(8*128)x1]), where each col is the result of transpose( miniA[2048x8] ) * miniV[2048x1].
On the other hand, CUBLAS assumes that matrices are stored in column-major. So it may need some extra matrix transpose routines to change the ordering.
Since you need 128 independent [8x1] results, it should be able to calculate the result in 4 cuda API calls, which should be more efficient than your original 128 calls.
1. Row-major A[(2048*128)x8] can be seen as colum-major AA[8x(2048*128)]
B[8x(2048*128)] = AA[8x(2048*128)] * diag( V[[(2048*128)x1]] ) by 1 dgmm()
2. C[(2048*128)x8] = transpose( B[8x(2048*128)] ) by 1 geam()
3. Col-major C[(2048*128)x8] can be seen as col-major CC[2048x(8*128)]
O[1x(8*128)] = ones[1x2048] * CC[2048x(8*128)] by 1 gemv()
4. Row vector O[1x(8*128)] can be seen as col-major matrix OO[128x8]
output[8x128] = transpose( OO[128x8] ) by 1 geam()
This col-major output[8x128] is what you want.
Since you need adding rather then replacing, you may need one more call to add the orginal values to output
I have done a very quick launch of the batchCUBLAS SDK example. I have considered 128 independent runs for matrices of size 2048x8 and 8x1. Here are the results on an NVIDIA GeForce GT 540M (compute capability 2.1) and on a Kepler K20c (compute capability 3.5).
For the NVIDIA GeForce GT 540M case, there is no relevant improvement for the "streamed" and "batched" versions against the "non-streamed" cuBLAS execution.
For the NVIDIA Kepler K20c, I have obtained
sgemm 1.87 GFlops (non-streamed); 3.08 GFlops (streamed); 6.58 GFlops (batched);
dgemm 1.00 GFlops (non-streamed); 1.43 GFlops (streamed); 6.67 GFlops (batched);
Streamed and batched cases seem to relevantly improve the non-streamed case for single precision.
Disclaimers
I'm not accounting for transposition, as you do;
The SDK example considers matrix-matrix multiplications, whereas you are needing matrix-vector multiplications; streaming is possible for gemv, but not batching.
I hope that those partial results could provide you with some useful information.

Worse performance using Eigen than using my own class

A couple of weeks ago I asked a question about the performance of matrix multiplication.
I was told that in order to enhance the performance of my program I should use some specialised matrix classes rather than my own class.
StackOverflow users recommended:
uBLAS
EIGEN
BLAS
At first I wanted to use uBLAS however reading documentation it turned out that this library doesn't support matrix-matrix multiplication.
After all I decided to use EIGEN library. So I exchanged my matrix class to Eigen::MatrixXd - however it turned out that now my application works even slower than before.
Time before using EIGEN was 68 seconds and after exchanging my matrix class to EIGEN matrix program runs for 87 seconds.
Parts of program which take the most time looks like that
TemplateClusterBase* TemplateClusterBase::TransformTemplateOne( vector<Eigen::MatrixXd*>& pointVector, Eigen::MatrixXd& rotation ,Eigen::MatrixXd& scale,Eigen::MatrixXd& translation )
{
for (int i=0;i<pointVector.size();i++ )
{
//Eigen::MatrixXd outcome =
Eigen::MatrixXd outcome = (rotation*scale)* (*pointVector[i]) + translation;
//delete prototypePointVector[i]; // ((rotation*scale)* (*prototypePointVector[i]) + translation).ConvertToPoint();
MatrixHelper::SetX(*prototypePointVector[i],MatrixHelper::GetX(outcome));
MatrixHelper::SetY(*prototypePointVector[i],MatrixHelper::GetY(outcome));
//assosiatedPointIndexVector[i] = prototypePointVector[i]->associatedTemplateIndex = i;
}
return this;
}
and
Eigen::MatrixXd AlgorithmPointBased::UpdateTranslationMatrix( int clusterIndex )
{
double membershipSum = 0,outcome = 0;
double currentPower = 0;
Eigen::MatrixXd outcomePoint = Eigen::MatrixXd(2,1);
outcomePoint << 0,0;
Eigen::MatrixXd templatePoint;
for (int i=0;i< imageDataVector.size();i++)
{
currentPower =0;
membershipSum += currentPower = pow(membershipMatrix[clusterIndex][i],m);
outcomePoint.noalias() += (*imageDataVector[i] - (prototypeVector[clusterIndex]->rotationMatrix*prototypeVector[clusterIndex]->scalingMatrix* ( *templateCluster->templatePointVector[prototypeVector[clusterIndex]->assosiatedPointIndexVector[i]]) ))*currentPower ;
}
outcomePoint.noalias() = outcomePoint/=membershipSum;
return outcomePoint; //.ConvertToMatrix();
}
As You can see, these functions performs a lot of matrix operations. That is why I thought using Eigen would speed up my application. Unfortunately (as I mentioned above), the program works slower.
Is there any way to speed up these functions?
Maybe if I used DirectX matrix operations I would get better performance ?? (however I have a laptop with integrated graphic card).
If you're using Eigen's MatrixXd types, those are dynamically sized. You should get much better results from using the fixed size types e.g Matrix4d, Vector4d.
Also, make sure you're compiling such that the code can get vectorized; see the relevant Eigen documentation.
Re your thought on using the Direct3D extensions library stuff (D3DXMATRIX etc): it's OK (if a bit old fashioned) for graphics geometry (4x4 transforms etc), but it's certainly not GPU accelerated (just good old SSE, I think). Also, note that it's floating point precision only (you seem to be set on using doubles). Personally I'd much prefer to use Eigen unless I was actually coding a Direct3D app.
Make sure to have compiler optimization switched on (e.g. at least -O2 on gcc). Eigen is heavily templated and will not perform very well if you don't turn on optimization.
Which version of Eigen are you using? They recently released 3.0.1, which is supposed to be faster than 2.x. Also, make sure you play a bit with the compiler options. For example, make sure SSE is being used in Visual Studio:
C/C++ --> Code Generation --> Enable Enhanced Instruction Set
You should profile and then optimize first the algorithm, then the implementation. In particular, the posted code is quite innefficient:
for (int i=0;i<pointVector.size();i++ )
{
Eigen::MatrixXd outcome = (rotation*scale)* (*pointVector[i]) + translation;
I don't know the library, so I won't even try to guess the number of unnecessary temporaries that you are creating, but a simple refactor:
Eigen::MatrixXd tmp = rotation*scale;
for (int i=0;i<pointVector.size();i++ )
{
Eigen::MatrixXd outcome = tmp*(*pointVector[i]) + translation;
Can save you a good amount of expensive multiplications (and again, probably new temporary matrices that get discarded right away.
A couple of points.
Why are you multiplying rotation*scale inside of the loop when that product will have the same value each iteration? That is a lot of wasted effort.
You are using dynamically sized matrices rather than fixed sized matrices. Someone else mentioned this already, and you said you shaved off 2 sec.
You are passing arguments as a vector of pointers to matrices. This adds an extra pointer indirection and destroys any guarantee of data locality, which will give poor cache performance.
I hope this isn't insulting, but are you compiling in Release or Debug? Eigen is very slow in debug builds, because it uses lots of trivial templated functions that are optimized out of release but remain in debug.
Looking at your code, I am hesitant to blame Eigen for performance problems. However, most linear algebra libraries (including Eigen) are not really designed for your use case of lots of tiny matrices. In general, Eigen will be better optimized for 100x100 or larger matrices. You very well may be better off using your own matrix class or the DirectX math helper classes. The DirectX math classes are completely independent from your video card.
Looking back at your previous post and the code in there, my suggestion would be to use your old code, but improve its efficiency by moving things around. I'm posting on that previous question to keep the answers separate.

CUBLAS - matrix addition.. how?

I am trying to use CUBLAS to sum two big matrices of unknown size. I need a fully optimized code (if possible) so I chose not to rewrite the matrix addition code (simple) but using CUBLAS, in particular the cublasSgemm function which allows to sum A and C (if B is a unit matrix): *C = alpha*op(A)*op(B)+beta*c*
The problem is: C and C++ store the matrices in row-major format, cublasSgemm is intended (for fortran compatibility) to work in column-major format. You can specify whether A and B are to be transposed first, but you can NOT indicate to transpose C. So I'm unable to complete my matrix addition..
I can't transpose the C matrix by myself because the matrix is something like 20000x20000 maximum size.
Any idea on how to solve please?
cublasgeam has been added to CUBLAS5.0.
It computes the weighted sum of 2 optionally transposed matrices
If you're just adding the matrices, it doesn't actually matter. You give it alpha, Aij, beta, and Cij. It thinks you're giving it alpha, Aji, beta, and Cji, and gives you what it thinks is Cji = beta Cji + alpha Aji. But that's the correct Cij as far as you're concerned. My worry is when you start going to things which do matter -- like matrix products. There, there's likely no working around it.
But more to the point, you don't want to be using GEMM to do matrix addition -- you're doing a completely pointless matrix multiplication (which takes takes ~20,0003 operations and many passes through memory) for an operatinon which should only require ~20,0002 operations and a single pass! Treat the matricies as 20,000^2-long vectors and use saxpy.
Matrix multiplication is memory-bandwidth intensive, so there is a huge (factors of 10x or 100x) difference in performance between coding it yourself and a tuned version. Ideally, you'd change structures in your code to match the library. If you can't, in this case you can manage just by using linear algebra identities. The C-vs-Fortran ordering means that when you pass in A, CUBLAS "sees" AT (A transpose). Which is fine, we can work around it. If what you want is C=A.B, pass in the matricies in the opposite order, B.A . Then the library sees (BT . AT), and calculates CT = (A.B)T; and then when it passes back CT, you get (in your ordering) C. Test it and see.