Eigen efficient inverse of symmetric positive definite matrix - c++

In Eigen, if we have symmetric positive definite matrix A then we can calculate the inverse of A by
A.inverse();
or
A.llt().solve(I);
where I is an identity matrix of the same size as A. But is there a more efficient way to calculate the inverse of symmetric positive definite matrix?
For example if we write the Cholesky decomposition of A as A = LL^{T}, then L^{-T} L^{-1} is an inverse of A since A L^{-T} L^{-1} = LL^{T} L^{-T} L^{-1} = I (and where L^{-T} denotes the inverse of the transpose of L).
So we could obtain the Cholesky decomposition of A, calculate its inverse, and then obtain the cross-product of that inverse to find the inverse of A. But my instinct is that calculating these explicit steps will be slower than using A.llt().solve(I) as above.
And before anybody asks, I do indeed need an explicit inverse - it is a calculation for part of a Gibbs sampler.

With A.llt().solve(I), you assumes A to be a SPD matrix and apply Cholesky decomposition to solve the equation Ax=I. The mathematical procedure of solving the equation is exactly same as your explicit way. So the performance should be same if you do every step correctly.
On the other hand, with A.inverse(), you are doing general matrix inversion, which uses LU decomposition for large matrix. Thus the performance should be lower than A.llt().solve(I);.

You should profile the code for your specific problem to get the best answer. I was benchmarking code while trying to evaluate the viability of both approaches using the googletest library and this repo:
#include <gtest/gtest.h>
#define private public
#define protected public
#include <kalman/Matrix.hpp>
#include <Eigen/Cholesky>
#include <chrono>
#include <iostream>
using namespace Kalman;
using namespace std::chrono;
typedef float T;
typedef high_resolution_clock Clock;
TEST(Cholesky, inverseTiming) {
Matrix<T, Dynamic, Dynamic> L;
Matrix<T, Dynamic, Dynamic> S;
Matrix<T, Dynamic, Dynamic> Sinv_method1;
Matrix<T, Dynamic, Dynamic> Sinv_method2;
int Nmin = 2;
int Nmax = 128;
int N(Nmin);
while (N <= Nmax) {
L.resize(N, N);
L.setRandom();
S.resize(N, N);
// create a random NxN SPD matrix
S = L*L.transpose();
std::cout << "\n";
std::cout << "+++++++++++++++++++++++++ N = " << N << " +++++++++++++++++++++++++++++++++++++++" << std::endl;
auto t1 = Clock::now();
Sinv_method1.resize(N, N);
Sinv_method1 = S.inverse();
auto dt1 = Clock::now() - t1;
std::cout << "Method 1 took " << duration_cast<microseconds>(dt1).count() << " usec" << std::endl;
auto t2 = Clock::now();
Sinv_method2.resize(N, N);
Sinv_method2 = S.llt().solve(Matrix<T, Dynamic, Dynamic>::Identity(N, N));
auto dt2 = Clock::now() - t2;
std::cout << "Method 2 took " << duration_cast<microseconds>(dt2).count() << " usec" << std::endl;
for(int i = 0; i < N; i++)
{
for(int j = 0; j < N; j++)
{
EXPECT_NEAR( Sinv_method1(i, j), Sinv_method2(i, j), 1e-3 );
}
}
N *= 2;
std::cout << "+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++" << std::endl;
std::cout << "\n";
}
}
What the above example showed me was that, for my size problem, the speedup was negligible using method2 whereas the lack of accuracy (using the .inverse() call as the benchmark) was noticeable.

Related

Matrix multiplication using three different methods gives different results, depending on amount of values

I would like to multiply two matrices A and B, and wanted to compare three different methods. One of them is simply iterating over the columns of B and multiplying them by the matrix A, the second one is using the function each_col() from armadillo, and applying a lambda, and the third one is simply the multiplication A * B. The resulting code is shown below:
#include <complex>
#include <iostream>
#include <chrono>
#include <armadillo>
constexpr int num_values = 2048;
constexpr int num_rows = 128;
constexpr int num_cols = num_values / num_rows;
constexpr int bench_rounds = 100;
void test_multiply_loop(const arma::mat &in_mat,
const arma::mat &init_mat,
arma::mat &out_mat) {
for(size_t i = 0; i < in_mat.n_cols; ++i) {
out_mat.col(i) = init_mat * in_mat.col(i);
}
}
void test_multiply_matrix(const arma::mat &in_mat,
const arma::mat &init_mat,
arma::mat &out_mat) {
out_mat = init_mat * in_mat;
}
void test_multiply_lambda(const arma::mat &in_mat,
const arma::mat &init_mat,
arma::mat &out_mat) {
out_mat = in_mat;
out_mat.each_col([init_mat](arma::colvec &a) {
a = init_mat * a;
});
}
int main()
{
std::cout << "Hello World" << "\n";
//Create matrix
arma::colvec test_vec = arma::linspace(1, num_values, num_values);
arma::mat init_mat = arma::reshape(test_vec, num_rows, num_cols);
arma::mat out_mat_loop = arma::zeros(num_rows, num_cols),
out_mat_lambda = arma::zeros(num_rows, num_cols),
out_mat_matrix = arma::zeros(num_rows, num_cols);
arma::mat test_mat = arma::eye(num_rows, num_rows);
for(size_t i = 0; i < num_rows; ++i)
for(size_t j = 0; j < num_rows; ++j)
test_mat(i, j) *= (i + 1);
auto t1 = std::chrono::high_resolution_clock::now();
for(size_t i = 0; i < bench_rounds; ++i)
test_multiply_loop(init_mat, test_mat, out_mat_loop);
auto t2 = std::chrono::high_resolution_clock::now();
auto t3 = std::chrono::high_resolution_clock::now();
for(size_t i = 0; i < bench_rounds; ++i)
test_multiply_lambda(init_mat, test_mat, out_mat_lambda);
auto t4 = std::chrono::high_resolution_clock::now();
auto t5 = std::chrono::high_resolution_clock::now();
for(size_t i = 0; i < bench_rounds; ++i)
test_multiply_matrix(init_mat, test_mat, out_mat_matrix);
auto t6 = std::chrono::high_resolution_clock::now();
std::cout << "Multiplication by loop:\t\t" << std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count() << '\n';
std::cout << "Multiplication by lambda:\t" << std::chrono::duration_cast<std::chrono::microseconds>( t4 - t3 ).count() << '\n';
std::cout << "Multiplication by internal:\t" << std::chrono::duration_cast<std::chrono::microseconds>( t6 - t5 ).count() << '\n';
std::cout << "Loop and matrix are equal:\t" << arma::approx_equal(out_mat_loop, out_mat_matrix, "reldiff", 0.1) << '\n';
std::cout << "Loop and lambda are equal:\t" << arma::approx_equal(out_mat_loop, out_mat_lambda, "reldiff", 0.1) << '\n';
std::cout << "Matrix and lambda are equal:\t" << arma::approx_equal(out_mat_matrix, out_mat_lambda, "reldiff", 0.1) << '\n';
return 0;
}
Now, for num_rows = 128 my output is
Multiplication by loop: 124525
Multiplication by lambda: 46690
Multiplication by internal: 1270
Loop and matrix are equal: 0
Loop and lambda are equal: 0
Matrix and lambda are equal: 0
but for num_rows = 64 my output is
Multiplication by loop: 32305
Multiplication by lambda: 6517
Multiplication by internal: 56344
Loop and matrix are equal: 1
Loop and lambda are equal: 1
Matrix and lambda are equal: 1
Why is the output so different when increasing the amount of columns? And why is the timing of the functions changing so much?
The three functions are indeed doing the same thing and the result should be the same, except for precision differences which should not matter since you compare the results with arma::approx_equal.
In my machine the output was correct for both sizes you mention and for other higher values that I have tried. I could not reproduce the problem.
For reference, I'm tried with armadillo 9.870.2 and I linked with openblas and lapack.
How did you install armadillo?
Armadillo uses blas and lapack for much of its functionality. For matrix multiplication it's using some blas implementation. There are several implementations for blas, such as openblas, mkl even cublas (for runing in the gpu), etc..
Armadillo can work without a blas implementation, where it would use its own (slower) implementation for matrix multiplication. I haven't tried it using its own implementation without linking with blas.
Another point that might be related is that depending on the blas implementation the matrix multiplication might use multiple threads, but usually only for large matrices, since using multiple threads for small matrices would hurt performance. That is, the code path used to perform the multiplication could be different depending on the matrix size (but of course it would be a bug if both code paths do not produce the same answer).

A better way to make matrix - log operations in Eigen?

I am playing around with Eigen doing some calculations with matrices and logs/exp, but I found the expressions I got a bit clumsy (and also possibly slower?). Is there a better way to write calculations like this ?
MatrixXd m = MatrixXd::Random(3,3);
m = m * (m.array().log()).matrix();
That is, not having to convert to arrays, then back to a matrix ?
If you are mixing array and matrix operations you can't really avoid them, except for some functions which have a cwise function which works directly on matrices (e.g., cwiseSqrt(), cwiseAbs()).
However, neither .array() nor .matrix() will have an impact on runtime when compiled with optimization (on any reasonable compiler).
If you consider that more readable, you can work with unaryExpr().
I agree fully with chtz's answer, and reiterate that there is no runtime cost to the "casts." You can confirm using the following toy program:
#include "Eigen/Core"
#include <iostream>
#include <chrono>
using namespace Eigen;
int main()
{
typedef MatrixXd matType;
//typedef MatrixXf matType;
volatile int vN = 1024 * 4;
int N = vN;
auto startAlloc = std::chrono::system_clock::now();
matType m = matType::Random(N, N).array().abs();
matType r1 = matType::Zero(N, N);
matType r2 = matType::Zero(N, N);
auto finishAlloc = std::chrono::system_clock::now();
r1 = m * (m.array().log()).matrix();
auto finishLog = std::chrono::system_clock::now();
r2 = m * m.unaryExpr<float(*)(float)>(&std::log);
auto finishUnary = std::chrono::system_clock::now();
std::cout << (r1 - r2).array().abs().maxCoeff() << '\n';
std::cout << "Allocation\t" << std::chrono::duration<double>(finishAlloc - startAlloc).count() << '\n';
std::cout << "Log\t\t" << std::chrono::duration<double>(finishLog - finishAlloc).count() << '\n';
std::cout << "unaryExpr\t" << std::chrono::duration<double>(finishUnary - finishLog).count() << '\n';
return 0;
}
On my computer, there is a slight advantage (~4%) to the first form which probably has to do with the way that the memory is loaded (unchecked). Beyond that, the reason for "casting" the type is to remove any ambiguities. For a clear example, consider operator *. In the matrix form, it should be considered matrix multiplication, whereas in the array form, it should be coefficient wise multiplication. The ambiguity in the case of exp and log are the matrix exponential and matrix logarithm respectively. Presumably, you want the element wise exp and log and therefore the cast is necessary.

Eigen: Slow access to columns of Matrix 4

I am using Eigen for operations similar to Cholesky update, implying a lot of AXPY (sum plus multiplication by a scalar) on the columns of a fixed size matrix, typically a Matrix4d. In brief, it is 3 times more expensive to access to the columns of a Matrix 4 than to a Vector 4.
Typically, the code below:
for(int i=0;i<4;++i ) L.col(0) += x*y[i];
is 3 times less efficient than the code below:
for(int i=0;i<4;++i ) l4 += x*y[i];
where L is typically a matrix of size 4, x, y and l4 are vectors of size 4.
Moreover, the time spent in the first line of code is not depending on the matrix storage organization (either RowMajor of ColMajor).
On a Intel i7 (2.5GHz), it takes about 0.007us for vector operations, and 0.02us for matrix operations (timings are done by repeating 100000 times the same operation). My application would need thousands of such operation in timings hopefully far below the millisecond.
Question: I am doing something improperly when accessing columns of my 4x4 matrix? Is there something to do to make the first line of code more efficient?
Full code used for timings is below:
#include <iostream>
#include <Eigen/Core>
#include <vector>
#include <sys/time.h>
typedef Eigen::Matrix<double,4,1,Eigen::ColMajor> Vector4;
//typedef Eigen::Matrix<double,4,4,Eigen::RowMajor,4,4> Matrix4;
typedef Eigen::Matrix<double,4,4,Eigen::ColMajor,4,4> Matrix4;
inline double operator- ( const struct timeval & t1,const struct timeval & t0)
{
/* TODO: double check the double conversion from long (on 64x). */
return double(t1.tv_sec - t0.tv_sec)+1e-6*double(t1.tv_usec - t0.tv_usec);
}
void sumCols( Matrix4 & L,
Vector4 & x4,
Vector4 & y)
{
for(int i=0;i<4;++i )
{
L.col(0) += x4*y[i];
}
}
void sumVec( Vector4 & L,
Vector4 & x4,
Vector4 & y)
{
for(int i=0;i<4;++i )
{
//L.tail(4-i) += x4.tail(4-i)*y[i];
L += x4 *y[i];
}
}
int main()
{
using namespace Eigen;
const int NBT = 1000000;
struct timeval t0,t1;
std::vector< Vector4> x4s(NBT);
std::vector< Vector4> y4s(NBT);
std::vector< Vector4> z4s(NBT);
std::vector< Matrix4> L4s(NBT);
for(int i=0;i<NBT;++i)
{
x4s[i] = Vector4::Random();
y4s[i] = Vector4::Random();
L4s[i] = Matrix4::Random();
}
int sample = int(z4s[55][2]/10*NBT);
std::cout << "*** SAMPLE = " << sample << std::endl;
gettimeofday(&t0,NULL);
for(int i=0;i<NBT;++i)
{
sumCols(L4s[i], x4s[i], y4s[i]);
}
gettimeofday(&t1,NULL);
std::cout << (t1-t0) << std::endl;
std::cout << "\t\t\t\t\t\t\tForce check" << L4s[sample](1,0) << std::endl;
gettimeofday(&t0,NULL);
for(int i=0;i<NBT;++i)
{
sumVec(z4s[i], x4s[i], y4s[i]);
}
gettimeofday(&t1,NULL);
std::cout << (t1-t0) << std::endl;
std::cout << "\t\t\t\t\t\t\tForce check" << z4s[sample][2] << std::endl;
return -1;
}
As I said in a comment, the generated assembly is exactly the same for both functions.
The problem is that your benchmark is biased in the sense that L4s is 4 times bigger than z4s, and you thus get more cache misses in the matrix case than in the vector case.

Eigen3 matrix multiplication performance

Note: I've posted this also on Eigen forum here
I want to premultiply 3xN matrices by a 3x3 matrix, i.e., to transform 3D points, like
p_dest = T * p_source
after initializing the matrices:
Eigen::Matrix<double, 3, Eigen::Dynamic> points = Eigen::Matrix<double, 3, Eigen::Dynamic>::Random(3, NUMCOLS);
Eigen::Matrix<double, 3, Eigen::Dynamic> dest = Eigen::Matrix<double, 3, Eigen::Dynamic>(3, NUMCOLS);
int NT = 100;
I have evaluated this two versions
// eigen direct multiplication
for (int i = 0; i < NT; i++){
Eigen::Matrix3d T = Eigen::Matrix3d::Random();
dest.noalias() = T * points;
}
and
// col multiplication
for (int i = 0; i < NT; i++){
Eigen::Matrix3d T = Eigen::Matrix3d::Random();
for (int c = 0; c < points.cols(); c++){
dest.col(c) = T * points.col(c);
}
}
the NT repetition are done just to compute average time
I am surprised the the column by column multiplication is about 4/5 time faster than the direct multiplication
(and the direct multiplication is even slower if I do not use the .noalias(), but this is fine since it is doing a temporary copy)
I've tried to change NUMCOLS from 0 to 1000000 and the relation is linear.
I'm using Visual Studio 2013 and compiling in release
The next figure shows on X the number of columns of the matrix and in Y the avg time for a single operation, in blue the col by col multiplication, in red the matrix multiplication
Any suggestion why this happens?
Short answer
You're timing the lazy (and therefore lack of) evaluation in the col multiplication version, vs. the lazy (but evaluated) evaluation in the direct version.
Long answer
Instead of code snippets, let's look at a full MCVE. First, "you're" version:
void ColMult(Matrix3Xd& dest, Matrix3Xd& points)
{
Eigen::Matrix3d T = Eigen::Matrix3d::Random();
for (int c = 0; c < points.cols(); c++){
dest.col(c) = T * points.col(c);
}
}
void EigenDirect(Matrix3Xd& dest, Matrix3Xd& points)
{
Eigen::Matrix3d T = Eigen::Matrix3d::Random();
dest.noalias() = T * points;
}
int main(int argc, char *argv[])
{
srand(time(NULL));
int NUMCOLS = 100000 + rand();
Matrix3Xd points = Matrix3Xd::Random(3, NUMCOLS);
Matrix3Xd dest = Matrix3Xd(3, NUMCOLS);
Matrix3Xd dest2 = Matrix3Xd(3, NUMCOLS);
int NT = 200;
// eigen direct multiplication
auto beg1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < NT; i++)
{
EigenDirect(dest, points);
}
auto end1 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_seconds = end1-beg1;
// col multiplication
auto beg2 = std::chrono::high_resolution_clock::now();
for(int i = 0; i < NT; i++)
{
ColMult(dest2, points);
}
auto end2 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_seconds2 = end2-beg2;
std::cout << "Direct time: " << elapsed_seconds.count() << "\n";
std::cout << "Col time: " << elapsed_seconds2.count() << "\n";
std::cout << "Eigen speedup: " << elapsed_seconds2.count() / elapsed_seconds.count() << "\n\n";
return 0;
}
With this code (and SSE turned on), I get:
Direct time: 0.449301
Col time: 0.10107
Eigen speedup: 0.224949
Same 4-5 slowdown you complained of. Why?!?! Before we get to the answer, let's modify the code a bit so that the dest matrix is sent to an ostream. Add std::ostream outPut(0); to the beginning of main() and before ending the timers add outPut << dest << "\n\n"; and outPut << dest2 << "\n\n";. The std::ostream outPut(0); doesn't output anything (I'm pretty sure the badbit is set), but it does cause Eigens operator<< to be called, which forces the evaluation of the matrix.
NOTE: if we used outPut << dest(1,1) then dest would be evaluated only enough to output the single element in the col multiplication method.
We then get
Direct time: 0.447298
Col time: 0.681456
Eigen speedup: 1.52349
as a result as expected. Note that the Eigen direct method took the exact(ish) same time (meaning the evaluation took place even without the added ostream), whereas the col method all of the sudden took much longer.

Why does MATLAB/Octave wipe the floor with C++ in Eigenvalue Problems?

I'm hoping that the answer to the question in the title is that I'm doing something stupid!
Here is the problem. I want to compute all the eigenvalues and eigenvectors of a real, symmetric matrix. I have implemented code in MATLAB (actually, I run it using Octave), and C++, using the GNU Scientific Library. I am providing my full code below for both implementations.
As far as I can understand, GSL comes with its own implementation of the BLAS API, (hereafter I refer to this as GSLCBLAS) and to use this library I compile using:
g++ -O3 -lgsl -lgslcblas
GSL suggests here to use an alternative BLAS library, such as the self-optimizing ATLAS library, for improved performance. I am running Ubuntu 12.04, and have installed the ATLAS packages from the Ubuntu repository. In this case, I compile using:
g++ -O3 -lgsl -lcblas -latlas -lm
For all three cases, I have performed experiments with randomly-generated matrices of sizes 100 to 1000 in steps of 100. For each size, I perform 10 eigendecompositions with different matrices, and average the time taken. The results are these:
The difference in performance is ridiculous. For a matrix of size 1000, Octave performs the decomposition in under a second; GSLCBLAS and ATLAS take around 25 seconds.
I suspect that I may be using the ATLAS library incorrectly. Any explanations are welcome; thanks in advance.
Some notes on the code:
In the C++ implementation, there is no need to make the matrix
symmetric, because the function only uses the lower triangular part
of it.
In Octave, the line triu(A) + triu(A, 1)' enforces the matrix to be symmetric.
If you wish to compile the C++ code your own Linux machine, you also need to add the flag -lrt, because of the clock_gettime function.
Unfortunately I don't think clock_gettime exits on other platforms. Consider changing it to gettimeofday.
Octave Code
K = 10;
fileID = fopen('octave_out.txt','w');
for N = 100:100:1000
AverageTime = 0.0;
for k = 1:K
A = randn(N, N);
A = triu(A) + triu(A, 1)';
tic;
eig(A);
AverageTime = AverageTime + toc/K;
end
disp([num2str(N), " ", num2str(AverageTime), "\n"]);
fprintf(fileID, '%d %f\n', N, AverageTime);
end
fclose(fileID);
C++ Code
#include <iostream>
#include <fstream>
#include <time.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <gsl/gsl_eigen.h>
#include <gsl/gsl_vector.h>
#include <gsl/gsl_matrix.h>
int main()
{
const int K = 10;
gsl_rng * RandomNumberGenerator = gsl_rng_alloc(gsl_rng_default);
gsl_rng_set(RandomNumberGenerator, 0);
std::ofstream OutputFile("atlas.txt", std::ios::trunc);
for (int N = 100; N <= 1000; N += 100)
{
gsl_matrix* A = gsl_matrix_alloc(N, N);
gsl_eigen_symmv_workspace* EigendecompositionWorkspace = gsl_eigen_symmv_alloc(N);
gsl_vector* Eigenvalues = gsl_vector_alloc(N);
gsl_matrix* Eigenvectors = gsl_matrix_alloc(N, N);
double AverageTime = 0.0;
for (int k = 0; k < K; k++)
{
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
gsl_matrix_set(A, i, j, gsl_ran_gaussian(RandomNumberGenerator, 1.0));
}
}
timespec start, end;
clock_gettime(CLOCK_MONOTONIC_RAW, &start);
gsl_eigen_symmv(A, Eigenvalues, Eigenvectors, EigendecompositionWorkspace);
clock_gettime(CLOCK_MONOTONIC_RAW, &end);
double TimeElapsed = (double) ((1e9*end.tv_sec + end.tv_nsec) - (1e9*start.tv_sec + start.tv_nsec))/1.0e9;
AverageTime += TimeElapsed/K;
std::cout << "N = " << N << ", k = " << k << ", Time = " << TimeElapsed << std::endl;
}
OutputFile << N << " " << AverageTime << std::endl;
gsl_matrix_free(A);
gsl_eigen_symmv_free(EigendecompositionWorkspace);
gsl_vector_free(Eigenvalues);
gsl_matrix_free(Eigenvectors);
}
return 0;
}
I disagree with the previous post. This is not a threading issue, this is an algorithm issue. The reason matlab, R, and octave wipe the floor with C++ libraries is because their C++ libraries use more complex, better algorithms. If you read the octave page you can find out what they do[1]:
Eigenvalues are computed in a several step process which begins with a Hessenberg decomposition, followed by a Schur decomposition, from which the eigenvalues are apparent. The eigenvectors, when desired, are computed by further manipulations of the Schur decomposition.
Solving eigenvalue/eigenvector problems is non-trivial. In fact its one of the few things "Numerical Recipes in C" recommends you don't implement yourself. (p461). GSL is often slow, which was my initial response. ALGLIB is also slow for its standard implementation (I'm getting about 12 seconds!):
#include <iostream>
#include <iomanip>
#include <ctime>
#include <linalg.h>
using std::cout;
using std::setw;
using std::endl;
const int VERBOSE = false;
int main(int argc, char** argv)
{
int size = 0;
if(argc != 2) {
cout << "Please provide a size of input" << endl;
return -1;
} else {
size = atoi(argv[1]);
cout << "Array Size: " << size << endl;
}
alglib::real_2d_array mat;
alglib::hqrndstate state;
alglib::hqrndrandomize(state);
mat.setlength(size, size);
for(int rr = 0 ; rr < mat.rows(); rr++) {
for(int cc = 0 ; cc < mat.cols(); cc++) {
mat[rr][cc] = mat[cc][rr] = alglib::hqrndnormal(state);
}
}
if(VERBOSE) {
cout << "Matrix: " << endl;
for(int rr = 0 ; rr < mat.rows(); rr++) {
for(int cc = 0 ; cc < mat.cols(); cc++) {
cout << setw(10) << mat[rr][cc];
}
cout << endl;
}
cout << endl;
}
alglib::real_1d_array d;
alglib::real_2d_array z;
auto t = clock();
alglib::smatrixevd(mat, mat.rows(), 1, 0, d, z);
t = clock() - t;
cout << (double)t/CLOCKS_PER_SEC << "s" << endl;
if(VERBOSE) {
for(int cc = 0 ; cc < mat.cols(); cc++) {
cout << "lambda: " << d[cc] << endl;
cout << "V: ";
for(int rr = 0 ; rr < mat.rows(); rr++) {
cout << setw(10) << z[rr][cc];
}
cout << endl;
}
}
}
If you really need a fast library, probably need to do some real hunting.
[1] http://www.gnu.org/software/octave/doc/interpreter/Basic-Matrix-Functions.html
I have also encountered with the problem. The real cause is that the eig() in matlab doesn't calculate the eigenvectors, but the C version code above does. The different in time spent can be larger than one order of magnitude as shown in the figure below. So the comparison is not fair.
In Matlab, depending on the return value, the actual function called will be different. To force the calculation of eigenvectors, the [V,D] = eig(A) should be used (see codes below).
The actual time to compute eigenvalue problem depends heavily on the matrix properties and the desired results, such as
Real or complex
Hermitian/Symmetric or not
Dense or sparse
Eigenvalues only, Eigenvectors, Maximum eigenvalue only, etc
Serial or parallel
There are algorithms optimized for each of the above case. In the gsl, these algorithm are picked manually, so a wrong selection will decrease performance significantly. Some C++ wrapper class or some language such as matlab and mathematica will choose the optimized version through some methods.
Also, the Matlab and Mathematica have used parallelization. These are further broaden the gap you see by few times, depending on the machine. It is reasonable to say that the calculation of eigenvalues and eigenvectors of a general complex 1000x1000 are about a second and ten second, without parallelization.
Fig. Compare Matlab and C. The "+ vec" means the codes included the calculations of the eigenvectors. The CPU% is the very rough observation of CPU usage at N=1000 which is upper bounded by 800%, though they are supposed to fully use all 8 cores. The gap between Matlab and C are smaller than 8 times.
Fig. Compare different matrix type in Mathematica. Algorithms automatically picked by program.
Matlab (WITH the calculation of eigenvectors)
K = 10;
fileID = fopen('octave_out.txt','w');
for N = 100:100:1000
AverageTime = 0.0;
for k = 1:K
A = randn(N, N);
A = triu(A) + triu(A, 1)';
tic;
[V,D] = eig(A);
AverageTime = AverageTime + toc/K;
end
disp([num2str(N), ' ', num2str(AverageTime), '\n']);
fprintf(fileID, '%d %f\n', N, AverageTime);
end
fclose(fileID);
C++ (WITHOUT the calculation of eigenvectors)
#include <iostream>
#include <fstream>
#include <time.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <gsl/gsl_eigen.h>
#include <gsl/gsl_vector.h>
#include <gsl/gsl_matrix.h>
int main()
{
const int K = 10;
gsl_rng * RandomNumberGenerator = gsl_rng_alloc(gsl_rng_default);
gsl_rng_set(RandomNumberGenerator, 0);
std::ofstream OutputFile("atlas.txt", std::ios::trunc);
for (int N = 100; N <= 1000; N += 100)
{
gsl_matrix* A = gsl_matrix_alloc(N, N);
gsl_eigen_symm_workspace* EigendecompositionWorkspace = gsl_eigen_symm_alloc(N);
gsl_vector* Eigenvalues = gsl_vector_alloc(N);
double AverageTime = 0.0;
for (int k = 0; k < K; k++)
{
for (int i = 0; i < N; i++)
{
for (int j = i; j < N; j++)
{
double rn = gsl_ran_gaussian(RandomNumberGenerator, 1.0);
gsl_matrix_set(A, i, j, rn);
gsl_matrix_set(A, j, i, rn);
}
}
timespec start, end;
clock_gettime(CLOCK_MONOTONIC_RAW, &start);
gsl_eigen_symm(A, Eigenvalues, EigendecompositionWorkspace);
clock_gettime(CLOCK_MONOTONIC_RAW, &end);
double TimeElapsed = (double) ((1e9*end.tv_sec + end.tv_nsec) - (1e9*start.tv_sec + start.tv_nsec))/1.0e9;
AverageTime += TimeElapsed/K;
std::cout << "N = " << N << ", k = " << k << ", Time = " << TimeElapsed << std::endl;
}
OutputFile << N << " " << AverageTime << std::endl;
gsl_matrix_free(A);
gsl_eigen_symm_free(EigendecompositionWorkspace);
gsl_vector_free(Eigenvalues);
}
return 0;
}
Mathematica
(* Symmetric real matrix + eigenvectors *)
Table[{NN, Mean[Table[(
M = Table[Random[], {i, NN}, {j, NN}];
M = M + Transpose[Conjugate[M]];
AbsoluteTiming[Eigensystem[M]][[1]]
), {K, 10}]]
}, {NN, Range[100, 1000, 100]}]
(* Symmetric real matrix *)
Table[{NN, Mean[Table[(
M = Table[Random[], {i, NN}, {j, NN}];
M = M + Transpose[Conjugate[M]];
AbsoluteTiming[Eigenvalues[M]][[1]]
), {K, 10}]]
}, {NN, Range[100, 1000, 100]}]
(* Asymmetric real matrix *)
Table[{NN, Mean[Table[(
M = Table[Random[], {i, NN}, {j, NN}];
AbsoluteTiming[Eigenvalues[M]][[1]]
), {K, 10}]]
}, {NN, Range[100, 1000, 100]}]
(* Hermitian matrix *)
Table[{NN, Mean[Table[(
M = Table[Random[] + I Random[], {i, NN}, {j, NN}];
M = M + Transpose[Conjugate[M]];
AbsoluteTiming[Eigenvalues[M]][[1]]
), {K, 10}]]
}, {NN, Range[100, 1000, 100]}]
(* Random complex matrix *)
Table[{NN, Mean[Table[(
M = Table[Random[] + I Random[], {i, NN}, {j, NN}];
AbsoluteTiming[Eigenvalues[M]][[1]]
), {K, 10}]]
}, {NN, Range[100, 1000, 100]}]
In the C++ implementation, there is no need to make the matrix
symmetric, because the function only uses the lower triangular part of
it.
This may not be the case. In the reference, it is stated that:
int gsl_eigen_symmv(gsl_matrix *A,gsl_vector *eval, gsl_matrix *evec, gsl_eigen_symmv_workspace * w)
This function computes the eigenvalues and eigenvectors of the real symmetric matrix
A. Additional workspace of the appropriate size must be provided in w.
The diagonal and lower triangular part of A are destroyed during the
computation, but the strict upper triangular part is not referenced.
The eigenvalues are stored in the vector eval and are unordered. The
corresponding eigenvectors are stored in the columns of the matrix
evec. For example, the eigenvector in the first column corresponds to
the first eigenvalue. The eigenvectors are guaranteed to be mutually
orthogonal and normalised to unit magnitude.
It seems that you also need to apply a similar symmetrization operation in C++ in order to get at least correct results although you can get the same performance.
On the MATLAB side, eigen value decomposition may be faster due to its multi-threaded execution as stated in this reference:
Built-in Multithreading
Linear algebra and numerical functions such as fft, \ (mldivide), eig,
svd, and sort are multithreaded in MATLAB. Multithreaded computations
have been on by default in MATLAB since Release 2008a. These
functions automatically execute on multiple computational threads in a
single MATLAB session, allowing them to execute faster on
multicore-enabled machines. Additionally, many functions in Image
Processing Toolbox™ are multithreaded.
In order to test the performance of MATLAB for single core, you can disable multithreading by
File>Preferences>General>Multithreading
in R2007a or newer as stated here.