Eigen equivalent to Octave/MATLAB mldivide for rectangular matrices - c++

I'm using Eigen v3.2.7.
I have a medium-sized rectangular matrix X (170x17) and row vector Y (170x1) and I'm trying to solve them using Eigen. Octave solves this problem fine using X\Y, but Eigen is returning incorrect values for these matrices (but not smaller ones) - however I suspect that it's how I'm using Eigen, rather than Eigen itself.
auto X = Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic>{170, 17};
auto Y = Eigen::Matrix<T, Eigen::Dynamic, 1>{170};
// Assign their values...
const auto theta = X.colPivHouseholderQr().solve(Y).eval(); // Wrong!
According to the Eigen documentation, the ColPivHouseholderQR solver is for general matrices and pretty robust, but to make sure I've also tried the FullPivHouseholderQR. The results were identical.
Is there some special magic that Octave's mldivide does that I need to implement manually for Eigen?
Update
This spreadsheet has the two input matrices, plus Octave's and my result matrices.
Replacing auto doesn't make a difference, nor would I expect it to because construction cannot be a lazy operation, and I have to call .eval() on the solve result because the next thing I do with the result matrix is get at the raw data (using .data()) on tail and head operations. The expression template versions of the result of those block operations do not have a .data() member, so I have to force evaluation beforehand - in other words theta is the concrete type already, not an expression template.
The result for (X*theta-Y).norm()/Y.norm() is:
2.5365e-007
And the result for (X.transpose()*X*theta-X.transpose()*Y).norm() / (X.transpose()*Y).norm() is:
2.80096e-007
As I'm currently using single precision float for my basic numerical type, that's pretty much zero for both.

According to your verifications, the solution you get is perfectly fine. If you want more accuracy, then use double floating point numbers. Note that MatLab/Octave use double precision by default.
Moreover, it might also likely be that your problem is not full rank, in which case your problem admit an infinite number of solution. ColPivHouseholderQR picks one, somehow arbitrarily. On the other hand, mldivide will pick the minimal norm one that you can also obtain with Eigen::BDCSVD (Eigen 3.3), or the slower Eigen::JacobiSVD.

Related

How to extract matrixL() and matrixU() when using Eigen::CholmodSupernodalLLT?

I'm trying to use Eigen::CholmodSupernodalLLT for Cholesky decomposition, however, it seems that I could not get matrixL() and matrixU(). How can I extract matrixL() and matrixU() from Eigen::CholmodSupernodalLLT for future use?
A partial answer to integrate what others have said.
Consider Y ~ MultivariateNormal(0, A). One may want to (1) evaluate the (log-)likelihood (a multivariate normal density), (2) sample from such density.
For (1), it is necessary to solve Ax = b where A is symmetric positive-definite, and compute its log-determinant. (2) requires L such that A = L * L.transpose() since Y ~ MultivariateNormal(0, A) can be found as Y = L u where u ~ MultivariateNormal(0, I).
A Cholesky LLT or LDLT decomposition is useful because chol(A) can be used for both purposes. Solving Ax=b is easy given the decomposition, andthe (log)determinant can be easily derived from the (sum)product of the (log-)components of D or the diagonal of L. By definition L can then be used for sampling.
So, in Eigen one can use:
Eigen::SimplicialLDLT solver(A) (or Eigen::SimplicialLLT), when solver.solve(b) and calculate the determinant using solver.vectorD().diag(). Useful because if A is a covariance matrix, then solver can be used for likelihood evaluations, and matrixL() for sampling.
Eigen::CholmodDecomposition does not give access to matrixL() or vectorD() but exposes .logDeterminant() to achieve the (1) goal but not (2).
Eigen::PardisoLDLT does not give access to matrixL() or vectorD() and does not expose a way to get the determinant.
In some applications, step (2) - sampling - can be done at a later stage so Eigen::CholmodDecomposition is enough. At least in my configuration, Eigen::CholmodDecomposition works 2 to 5 times faster than Eigen::SimplicialLDLT (I guess because of the permutations done under the hood to facilitate parallelization)
Example: in Bayesian spatial Gaussian process regression, the spatial random effects can be integrated out and do not need to be sampled. So MCMC can proceed swiftly with Eigen::CholmodDecomposition to achieve convergence for the uknown parameters. The spatial random effects can then be recovered in parallel using Eigen::SimplicialLDLT. Typically this is only a small part of the computations but having matrixL() directly from CholmodDecomposition would simplify them a bit.
You cannot do this using the given class. The class you are referencing is equotation solver (which indeed uses cholesky decomposition). To decompose your matrix you should rather use Eigen::LLT. Code example from their website:
MatrixXd A(3,3);
A << 4,-1,2, -1,6,0, 2,0,5;
LLT<MatrixXd> lltOfA(A);
MatrixXd L = lltOfA.matrixL();
MatrixXd U = lltOfA.matrixU();
As reported somewhere else, e.g., it cannot be done easily.
I am copying a possible recommendation (answered by Gael Guennebaud himself), even if somewhat old:
If you really need access to the factor to do your own cooking, then
better use the built-in SimplicialL{D}LT<> class. Extracting the
factors from the supernodal internal represations of Cholmod/Pardiso
is indeed not straightforward and very rarely needed. We have to
check, but if Cholmod/Pardiso provide routines to manipulate the
factors, like applying it to a vector, then we could let
matrix{L,U}() return a pseudo expression wrapping these routines.
Developing code for extracting this is likely beyond SO, and probably a topic for a feature request.
Of course, the solution with LLT is at hand (but not the topic of the OP).

Eigen: how to speed up a += coeffs * coeffs.transpose()

I need to compute many (about 400k) solutions of small linear least square problems. Each problem contains 10-300 equations with only 7 variables.
To solve these problems i use eigen library. Straight solving takes too much time and i transform each problem to solving 7x7 system of linear equations by deriving derivatives by my hand.
I recieve nice speed-up but i want to increase performance again.
I use vagrind to profile my program and i found that operation with highest self cost is operator += of eigen matrix. This operation takes more than ten calls of a.ldlt().solve(b);
I use this operator to compose A matrix and B vector of each system of equations
//I cal these code to solve each problem
const int nVars = 7;
//i really need double precision
Eigen::Matrix<double, nVars, nVars> a = Eigen::Matrix<double, nVars, nVars>::Zero();
Eigen::Matrix<double, nVars, 1> b = Eigen::Matrix<double, nVars, 1>::Zero();
Eigen::Matrix<double, nVars, 1> equationCoeffs;
//............................
//Somewhere in big cycle.
//equationCoeffs and z are updated on each iteration
a += equationCoeffs * equationCoeffs.transpose();
b += equationCoeffs * z;
Where z is some scalar
So my question is: How can i improve performance of these operations?
PS Sorry for my poor English
Instead of forming the matrix and vector components of the normal equation by hand, one equation at a time, you might try to allocate a large enough matrix once (e.g. 300 x 7) to store all coefficients and then let Eigen's optimized matrix-matrix product kernels do the job for you:
Matrix<double,Dynamic,nbVars> D(300,nbVars);
VectorXd f(300);
for(...)
{
int nb_equations = ...;
for(i=0..nb_equations-1)
{
D.row(i) = equationCoeffs;
f(i) = z;
}
a = D.topRows(nb_equations).transpose() * D.topRows(nb_equations);
b = D.topRows(nb_equations).transpose() * f.head(nb_equations);
// solve ax=b
}
You might bench with both a column-major and row-major storage for the matrix D to see which one is best.
Another possible approach would be to declare a, equationCoeffs, and b as 8x8 or 8x1 matrix or vectors making sure that equationCoeffs(7)==0. This way you maximize SIMD usage. Then use a.topLeftCorners<7,7>(), b.head<7>() when calling LDLT. You might even combine this strategy with the previous one.
Finally, if your CPU support AVX or FMA, you might use the devel branch and compile with -mavx or -mfma to get a significant speedup.
If you can use g++5.1, you might want to take a look at OpenMP
( http://openmp.org/mp-documents/OpenMP4.0.0.Examples.pdf ).
G++5.1 (or gcc5.1 for C) also has some basic support for OpenACC, you can try that as well. There should be more implementation of OpenACC in the future.
Also if you have access to intel compiler (icc, icpc) it speeded up my code even just by using it.
If you can use nvidia's nvcc, you might use the thrust library
( http://docs.nvidia.com/cuda/thrust/#axzz3g8xJPGHe ), there's a lot of sample code on their github as well
( https://github.com/thrust/thrust ). However, using thrust is not so straight forward and needs some real thinking.
EDIT:
Thrust also requires Nvidia GPU.
For AMD cards I believe there is a library called ArrayFire, which looks very similar to Thrust (I have not tried that one, yet)
I have a single problem Ax=b with 480k float variables. The matrix A is sparse and solving it with Eigen BiCGSTAB took 4.8 seconds.
I also worked with ViennaCL before, so I tried to solve the same problem there, and it took only 1.2 seconds. The increase in spead is realised
by the processing on the GPU.

FFTW: Only interested in real result

I am using FFTW to compute the inverse DFT of 2-dimensional complex data. The output of the default-setup (complex-to-complex) is complex, imaginary parts are not zero. However, I am only interested in the real-part of the result, not in the complex part. The interleaved-real-complex output of FFTW is not ideal for me since I want to postprocess the (real) output via SSE. Is there a way to get an only-real array from FFTW? The Complex-To-Real plans don't seem to work since the output isn't real.
Real data in [time|freq] domain implies conjugate symmetry about zero in the other domain.
By enforcing conjugate symmetry (adding conjugate flipped version of itself), you can efficiently discard the imaginary part in the other domain. This should allow you to use the real ifft in FFTW, getting roughly 2x speedup. Note you only use nfft/2+1 bins for the FFTW real ifft.
Here's a 1D example to illustrate the point:
X = randn(8,1)+j*randn(8,1);
Xsym = .5*(X + conj(X([1 8:-1:2]'))); % force the symmetric condition
err = real(ifft(X)) - ifft(Xsym);
For a 2D IFFT, it may be best to perform the 2d ifft with 2 passes of 1d ifft as described in another answer

Alglib: solving A * x = b in a least squares sense

I have a somewhat complicated algorithm that requires the fitting of a quadric to a set of points. This quadric is given by its parametrization (u, v, f(u,v)), where f(u,v) = au^2+bv^2+cuv+du+ev+f.
The coefficients of the f(u,v) function need to be found since I have a set of exactly 6 constraints this function should obey. The problem is that this set of constraints, although yielding a problem like A*x = b, is not completely well behaved to guarantee a unique solution.
Thus, to cut it short, I'd like to use alglib's facilities to somehow either determine A's pseudoinverse or directly find the best fit for the x vector.
Apart from computing the SVD, is there a more direct algorithm implemented in this library that can solve a system in a least squares sense (again, apart from the SVD or from using the naive inv(transpose(A)*A)*transpose(A)*b formula for general least squares problems where A is not a square matrix?
Found the answer through some careful documentation browsing:
rmatrixsolvels( A, noRows, noCols, b, singularValueThreshold, info, solverReport, x)
The documentation states the the singular value threshold is a clamping threshold that sets any singular value from the SVD decomposition S matrix to 0 if that value is below it. Thus it should be a scalar between 0 and 1.
Hopefully, it will help someone else too.

CUBLAS - matrix addition.. how?

I am trying to use CUBLAS to sum two big matrices of unknown size. I need a fully optimized code (if possible) so I chose not to rewrite the matrix addition code (simple) but using CUBLAS, in particular the cublasSgemm function which allows to sum A and C (if B is a unit matrix): *C = alpha*op(A)*op(B)+beta*c*
The problem is: C and C++ store the matrices in row-major format, cublasSgemm is intended (for fortran compatibility) to work in column-major format. You can specify whether A and B are to be transposed first, but you can NOT indicate to transpose C. So I'm unable to complete my matrix addition..
I can't transpose the C matrix by myself because the matrix is something like 20000x20000 maximum size.
Any idea on how to solve please?
cublasgeam has been added to CUBLAS5.0.
It computes the weighted sum of 2 optionally transposed matrices
If you're just adding the matrices, it doesn't actually matter. You give it alpha, Aij, beta, and Cij. It thinks you're giving it alpha, Aji, beta, and Cji, and gives you what it thinks is Cji = beta Cji + alpha Aji. But that's the correct Cij as far as you're concerned. My worry is when you start going to things which do matter -- like matrix products. There, there's likely no working around it.
But more to the point, you don't want to be using GEMM to do matrix addition -- you're doing a completely pointless matrix multiplication (which takes takes ~20,0003 operations and many passes through memory) for an operatinon which should only require ~20,0002 operations and a single pass! Treat the matricies as 20,000^2-long vectors and use saxpy.
Matrix multiplication is memory-bandwidth intensive, so there is a huge (factors of 10x or 100x) difference in performance between coding it yourself and a tuned version. Ideally, you'd change structures in your code to match the library. If you can't, in this case you can manage just by using linear algebra identities. The C-vs-Fortran ordering means that when you pass in A, CUBLAS "sees" AT (A transpose). Which is fine, we can work around it. If what you want is C=A.B, pass in the matricies in the opposite order, B.A . Then the library sees (BT . AT), and calculates CT = (A.B)T; and then when it passes back CT, you get (in your ordering) C. Test it and see.