CUBLAS - matrix addition.. how? - c++

I am trying to use CUBLAS to sum two big matrices of unknown size. I need a fully optimized code (if possible) so I chose not to rewrite the matrix addition code (simple) but using CUBLAS, in particular the cublasSgemm function which allows to sum A and C (if B is a unit matrix): *C = alpha*op(A)*op(B)+beta*c*
The problem is: C and C++ store the matrices in row-major format, cublasSgemm is intended (for fortran compatibility) to work in column-major format. You can specify whether A and B are to be transposed first, but you can NOT indicate to transpose C. So I'm unable to complete my matrix addition..
I can't transpose the C matrix by myself because the matrix is something like 20000x20000 maximum size.
Any idea on how to solve please?

cublasgeam has been added to CUBLAS5.0.
It computes the weighted sum of 2 optionally transposed matrices

If you're just adding the matrices, it doesn't actually matter. You give it alpha, Aij, beta, and Cij. It thinks you're giving it alpha, Aji, beta, and Cji, and gives you what it thinks is Cji = beta Cji + alpha Aji. But that's the correct Cij as far as you're concerned. My worry is when you start going to things which do matter -- like matrix products. There, there's likely no working around it.
But more to the point, you don't want to be using GEMM to do matrix addition -- you're doing a completely pointless matrix multiplication (which takes takes ~20,0003 operations and many passes through memory) for an operatinon which should only require ~20,0002 operations and a single pass! Treat the matricies as 20,000^2-long vectors and use saxpy.
Matrix multiplication is memory-bandwidth intensive, so there is a huge (factors of 10x or 100x) difference in performance between coding it yourself and a tuned version. Ideally, you'd change structures in your code to match the library. If you can't, in this case you can manage just by using linear algebra identities. The C-vs-Fortran ordering means that when you pass in A, CUBLAS "sees" AT (A transpose). Which is fine, we can work around it. If what you want is C=A.B, pass in the matricies in the opposite order, B.A . Then the library sees (BT . AT), and calculates CT = (A.B)T; and then when it passes back CT, you get (in your ordering) C. Test it and see.

Related

How to extract matrixL() and matrixU() when using Eigen::CholmodSupernodalLLT?

I'm trying to use Eigen::CholmodSupernodalLLT for Cholesky decomposition, however, it seems that I could not get matrixL() and matrixU(). How can I extract matrixL() and matrixU() from Eigen::CholmodSupernodalLLT for future use?
A partial answer to integrate what others have said.
Consider Y ~ MultivariateNormal(0, A). One may want to (1) evaluate the (log-)likelihood (a multivariate normal density), (2) sample from such density.
For (1), it is necessary to solve Ax = b where A is symmetric positive-definite, and compute its log-determinant. (2) requires L such that A = L * L.transpose() since Y ~ MultivariateNormal(0, A) can be found as Y = L u where u ~ MultivariateNormal(0, I).
A Cholesky LLT or LDLT decomposition is useful because chol(A) can be used for both purposes. Solving Ax=b is easy given the decomposition, andthe (log)determinant can be easily derived from the (sum)product of the (log-)components of D or the diagonal of L. By definition L can then be used for sampling.
So, in Eigen one can use:
Eigen::SimplicialLDLT solver(A) (or Eigen::SimplicialLLT), when solver.solve(b) and calculate the determinant using solver.vectorD().diag(). Useful because if A is a covariance matrix, then solver can be used for likelihood evaluations, and matrixL() for sampling.
Eigen::CholmodDecomposition does not give access to matrixL() or vectorD() but exposes .logDeterminant() to achieve the (1) goal but not (2).
Eigen::PardisoLDLT does not give access to matrixL() or vectorD() and does not expose a way to get the determinant.
In some applications, step (2) - sampling - can be done at a later stage so Eigen::CholmodDecomposition is enough. At least in my configuration, Eigen::CholmodDecomposition works 2 to 5 times faster than Eigen::SimplicialLDLT (I guess because of the permutations done under the hood to facilitate parallelization)
Example: in Bayesian spatial Gaussian process regression, the spatial random effects can be integrated out and do not need to be sampled. So MCMC can proceed swiftly with Eigen::CholmodDecomposition to achieve convergence for the uknown parameters. The spatial random effects can then be recovered in parallel using Eigen::SimplicialLDLT. Typically this is only a small part of the computations but having matrixL() directly from CholmodDecomposition would simplify them a bit.
You cannot do this using the given class. The class you are referencing is equotation solver (which indeed uses cholesky decomposition). To decompose your matrix you should rather use Eigen::LLT. Code example from their website:
MatrixXd A(3,3);
A << 4,-1,2, -1,6,0, 2,0,5;
LLT<MatrixXd> lltOfA(A);
MatrixXd L = lltOfA.matrixL();
MatrixXd U = lltOfA.matrixU();
As reported somewhere else, e.g., it cannot be done easily.
I am copying a possible recommendation (answered by Gael Guennebaud himself), even if somewhat old:
If you really need access to the factor to do your own cooking, then
better use the built-in SimplicialL{D}LT<> class. Extracting the
factors from the supernodal internal represations of Cholmod/Pardiso
is indeed not straightforward and very rarely needed. We have to
check, but if Cholmod/Pardiso provide routines to manipulate the
factors, like applying it to a vector, then we could let
matrix{L,U}() return a pseudo expression wrapping these routines.
Developing code for extracting this is likely beyond SO, and probably a topic for a feature request.
Of course, the solution with LLT is at hand (but not the topic of the OP).

Eigen equivalent to Octave/MATLAB mldivide for rectangular matrices

I'm using Eigen v3.2.7.
I have a medium-sized rectangular matrix X (170x17) and row vector Y (170x1) and I'm trying to solve them using Eigen. Octave solves this problem fine using X\Y, but Eigen is returning incorrect values for these matrices (but not smaller ones) - however I suspect that it's how I'm using Eigen, rather than Eigen itself.
auto X = Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic>{170, 17};
auto Y = Eigen::Matrix<T, Eigen::Dynamic, 1>{170};
// Assign their values...
const auto theta = X.colPivHouseholderQr().solve(Y).eval(); // Wrong!
According to the Eigen documentation, the ColPivHouseholderQR solver is for general matrices and pretty robust, but to make sure I've also tried the FullPivHouseholderQR. The results were identical.
Is there some special magic that Octave's mldivide does that I need to implement manually for Eigen?
Update
This spreadsheet has the two input matrices, plus Octave's and my result matrices.
Replacing auto doesn't make a difference, nor would I expect it to because construction cannot be a lazy operation, and I have to call .eval() on the solve result because the next thing I do with the result matrix is get at the raw data (using .data()) on tail and head operations. The expression template versions of the result of those block operations do not have a .data() member, so I have to force evaluation beforehand - in other words theta is the concrete type already, not an expression template.
The result for (X*theta-Y).norm()/Y.norm() is:
2.5365e-007
And the result for (X.transpose()*X*theta-X.transpose()*Y).norm() / (X.transpose()*Y).norm() is:
2.80096e-007
As I'm currently using single precision float for my basic numerical type, that's pretty much zero for both.
According to your verifications, the solution you get is perfectly fine. If you want more accuracy, then use double floating point numbers. Note that MatLab/Octave use double precision by default.
Moreover, it might also likely be that your problem is not full rank, in which case your problem admit an infinite number of solution. ColPivHouseholderQR picks one, somehow arbitrarily. On the other hand, mldivide will pick the minimal norm one that you can also obtain with Eigen::BDCSVD (Eigen 3.3), or the slower Eigen::JacobiSVD.

Element wise multiplication between matrices in BLAS?

Im starting to use BLAS functions in c++ (specifically intel MKL) to create faster versions of some of my old Matlab code.
Its been working out well so far, but I cant figure out how to perform elementwise multiplication on 2 matrices (A .* B in Matlab).
I know gemv does something similar between a matrix and a vector, so should I just break one of my matrices into vectprs and call gemv repeatedly? I think this would work, but I feel like there should be aomething built in for this operation.
Use the Hadamard product. In MKL it's v?MUL. E.g. for doubles:
vdMul( n, a, b, y );
in Matlab notation it performs:
y[1:n] = a[1:n] .* b[1:n]
In your case you can treat matrices as vectors.

FFTW: Only interested in real result

I am using FFTW to compute the inverse DFT of 2-dimensional complex data. The output of the default-setup (complex-to-complex) is complex, imaginary parts are not zero. However, I am only interested in the real-part of the result, not in the complex part. The interleaved-real-complex output of FFTW is not ideal for me since I want to postprocess the (real) output via SSE. Is there a way to get an only-real array from FFTW? The Complex-To-Real plans don't seem to work since the output isn't real.
Real data in [time|freq] domain implies conjugate symmetry about zero in the other domain.
By enforcing conjugate symmetry (adding conjugate flipped version of itself), you can efficiently discard the imaginary part in the other domain. This should allow you to use the real ifft in FFTW, getting roughly 2x speedup. Note you only use nfft/2+1 bins for the FFTW real ifft.
Here's a 1D example to illustrate the point:
X = randn(8,1)+j*randn(8,1);
Xsym = .5*(X + conj(X([1 8:-1:2]'))); % force the symmetric condition
err = real(ifft(X)) - ifft(Xsym);
For a 2D IFFT, it may be best to perform the 2d ifft with 2 passes of 1d ifft as described in another answer

Compute rank of Matrix

I need to calculate rank of 4096x4096 sparse matrix, and I use C/C++ code.
I found some libraries (like Armadillo) that do it but they're too slow (almost 5 minutes).
I've also tried two Open Source version of Matlab (Freemat and Octave) but both crashed when I tried to make a test with a script.
5 minutes isn't so much but I must get rank from something like a million of matrix so the faster the better.
Someone knows a fast library for rank computation?
The Eigen library supports sparse matrices, try it out.
Computing the algebraic rank is O(n^3), where n is the matrix size, so it's inherently slow. You need eg. to perform pivoting, and this is slow and inaccurate if your matrix is not well conditioned (for n = 4096, a typical matrix is very ill conditioned).
Now, what is the rank ? It is the dimension of the image. It is very difficult to compute when n is large and it'll be spoiled by any small numerical inaccuracy of the input. For n = 4096, unless you happen to have particularly well conditioned matrices, this will prevent you from doing anything useful with a pivoting algorithm.
The best way is in fact to fix a cutoff epsilon, compute the singular values s_1 > ... > s_n and take as the rank the lowest integer r such that sum(s_i^2, i > r) < epsilon^2 * sum(s_i^2).
You thus need a sparse SVD routine, eg. from there.
This may not be faster, but to the very least it will be correct.
You can ask for less singular values that you need to speed up things. This is a tough problem, and with no info on the background and how you got these matrices, there is nothing more we can do.
Try the following code (the documentation is here).
It is an example for calculating the rank of the matrix A with Eigen library:
MatrixXd A(2,2);
A << 1 , 0, 1, 0;
FullPivLU<MatrixXd> luA(A);
int rank = luA.rank();