I want to do a singular value decomposition for large matrices containing a lot of zeros. In particular I need U and S, obtained from the diagonalization of a symmetric matrix A. This means that A = U * S * transpose(U^*), where S is a diagonal matrix and U contains all eigenvectors as columns.
I searched the web for c++ librarys that combine SVD and sparse matrices, but could only find libraries that find a few, but not all eigenvectors. Does anyone know if there is such a library?
Also after obtaining U and S I need to multiply them to some dense vector.
For this problem, I am using a combination of different techniques:
Arpack can compute a set of eigenvalues and associated eigenvectors, unfortunately it is fast only for high frequencies and slow for low frequencies
but since the eigenvectors of the inverse of a matrix are the same as the eigenvectors of a matrix, one can factor the matrix (using a sparse matrix factorization routine, such as SuperLU, or Choldmod if the matrix is symmetric). The "communication protocol" with Arpack only expects you to compute a matrix-vector product, so if you do a linear system solve using the factored matrix instead, then this makes Arpack fast for the low frequencies of the spectrum (do not forget then to replace the eigenvalue lambda by 1/lambda !)
This trick can be used to explore the entire spectrum, with a generalized transform (the transform in the previous point is refered as "invert" transform). There is also a "shift-invert" transform that allows one to explore an arbitrary portion of the spectrum and have fast convergence of Arpack. Then you compute (1/lambda + sigma) instead of lambda, when sigma is a "shift" (the transform is slightly more complicated than the "invert" transform, see the references below for a full explanation).
ARPACK: http://www.caam.rice.edu/software/ARPACK/
SUPERLU: http://crd-legacy.lbl.gov/~xiaoye/SuperLU/
The numerical algorithm is explained in my article that can be downloaded here:
http://alice.loria.fr/index.php/publications.html?redirect=0&Paper=ManifoldHarmonics#2008
Sourcecode is available there:
https://gforge.inria.fr/frs/download.php/file/27277/manifold_harmonics-a4-src.tar.gz
See also my answer to this question:
https://scicomp.stackexchange.com/questions/20243/sparse-generalized-eigensolver-using-opencl/20255#20255
Related
I am trying to understand the different methods for dimensionality reduction in data analysis. In particular I am interested in Singular Value Decomposition (SVD) and Principle Component Analysis (PCA).
Can anyone please explain there terms to a layperson? - I understand the general premis of dimensionality reduction as bringing data to a lower dimension - But
a) how do SVD and PCA do this, and
b) how do they differ in their approach
OR maybe if you can explain what the results of each technique is telling me, so for
a) SVD - what are singular values
b) PCA - "proportion of variance"
Any example would be brilliant. I am not very good at maths!!
Thanks
You probably already figured this out, but I'll post a short description anyway.
First, let me describe the two techniques speaking generally.
PCA basically takes a dataset and figures out how to "transform" it (i.e. project it into a new space, usually of lower dimension). It essentially gives you a new representation of the same data. This new representation has some useful properties. For instance, each dimension of the new space is associated with the amount of variance it explains, i.e. you can essentially order the variables output by PCA by how important they are in terms of the original representation. Another property is the fact that linear correlation is removed from the PCA representation.
SVD is a way to factorize a matrix. Given a matrix M (e.g. for data, it could be an n by m matrix, for n datapoints, each of dimension m), you get U,S,V = SVD(M) where:M=USV^T, S is a diagonal matrix, and both U and V are orthogonal matrices (meaning the columns & rows are orthonormal; or equivalently UU^T=I & VV^T=I).
The entries of S are called the singular values of M. You can think of SVD as dimensionality reduction for matrices, since you can cut off the lower singular values (i.e. set them to zero), destroying the "lower parts" of the matrices upon multiplying them, and get an approximation to M. In other words, just keep the top k singular values (and the top k vectors in U and V), and you have a "dimensionally reduced" version (representation) of the matrix.
Mathematically, this gives you the best rank k approximation to M, essentially like a reduction to k dimensions. (see this answer for more).
So Question 1
I understand the general premis of dimensionality reduction as bringing data to a lower dimension - But
a) how do SVD and PCA do this, and b) how do they differ in their approach
The answer is that they are the same.
To see this, I suggest reading the following posts on the CV and math stack exchange sites:
What is the intuitive relationship between SVD and PCA?
Relationship between SVD and PCA. How to use SVD to perform PCA?
How to use SVD for dimensionality reduction to reduce the number of columns (features) of the data matrix?
How to use SVD for dimensionality reduction (in R)
Let me summarize the answer:
essentially, SVD can be used to compute PCA.
PCA is closely related to the eigenvectors and eigenvalues of the covariance matrix of the data. Essentially, by taking the data matrix, computing its SVD, and then squaring the singular values (and doing a little scaling), you end up getting the eigendecomposition of the covariance matrix of the data.
Question 2
maybe if you can explain what the results of each technique is telling me, so for a) SVD - what are singular values b) PCA - "proportion of variance"
These eigenvectors (the singular vectors of the SVD, or the principal components of the PCA) form the axes of the news space into which one transforms the data.
The eigenvalues (closely related to the squares of the data matrix SVD singular values) hold the variance explained by each component. Often, people want to retain say 95% of the variance of the original data, so if they originally had n-dimensional data, they reduce it to d-dimensional data that keeps that much of the original variance, by choosing the largest d-eigenvalues such that 95% of the variance is kept. This keeps as much information as possible, while retaining as few useless dimensions as possible.
In other words, these values (variance explained) essentially tell us the importance of each principal component (PC), in terms of their usefulness reconstructing the original (high-dimensional) data. Since each PC forms an axis in the new space (constructed via linear combinations of the old axes in the original space), it tells us the relative importance of each of the new dimensions.
For bonus, note that SVD can also be used to compute eigendecompositions, so it can also be used to compute PCA in a different way, namely by decomposing the covariance matrix directly. See this post for details.
According to your question,I only understood the topic of Principal component analysis.so that I share few below points about PCA i hope you definitely understand.
PCA:
1.PCA is a linear transformation dimensionality reduction technique.
2.It is used for operations such as noise filtering,feature extraction and data visualization.
3.The goal of PCA is to identify patterns and detecting the correlations between variables.
4.If there is a strong correlation,then we could reduce the dimensionality which PCA is intended for.
5.Eigenvector is to make linear transformation without changing the directions.
this is the sample url to understand the PCA:https://www.solver.com/xlminer/help/principal-components-analysis-example
I am using the Armadillo library to manually port a piece of Matlab code. The matlab code uses the eigs() function to find a small number (~3) of eigen vectors of a relative large(200x200) covariance matrix R. The code looks like this:
[E,D] = eigs(R,3,"lm");
In Armadillo there are two functions eigs_sym() and eigs_gen() however the former only support real symmetric matrix and the latter requires ARPACK (I'm building the code for Android). Is there a reason eigs_sym doesn't support complex matrices? Is there any other way to find the eigenvectors of a complex symmetric matrix?
The eigs_sym() and eigs_gen() functions (where the s in eigs stands for sparse) in Armadillo are for large sparse matrices. A "large" size in this context is roughly 5000x5000 or larger.
Your R matrix has a size of 200x200. This is very small by current standards. It would be much faster to simply use the dense eigendecomposition eig_sym() or eig_gen() functions to get all the eigenvalues / eigenvectors, followed by extracting a subset of them using submatrix operations like .tail_cols()
Have you tested constructing a 400x400 real symmetric matrix by replacing each complex value, a+bi, by a 2x2 matrix [a,b;-b,a] (alternatively using a block variant of this)?
This should construct a real symmetric matrix that in some way correspond to the complex one.
There will be a slow-down due to the larger size, and all eigenvalues will be duplicated (which may slow down the algorithm), but it seems straightforward to test.
I am applying Kernel PCA for a feature extraction task in a computer vision problem, which involves solving the eigen value problem for a very large symmetrix matrix, like 6400x6400 in size. I am using OpenCV in my implementation and I use the cv::eigen method for the purpose of EigenDecomposition. This method calculates all the eigenvalues and eigenvectors of the given matrix, which becomes easily intractable in case of very large and dense matrices, like in my case, since the problem has O(N^3) complexity as far as I know. But in fact, I only need a small subset of the eigenvectors, which correspond to n largest eigenvalues of the matrix, which is n < N. Is there any method available in OpenCV for this purpose, which only calculates some of the largest eigenvalues and their corresponding eigenvectors? I failed to locate such a method in OpenCV documentation. Any method from any other library is welcome, as well.
I'm writing a program with Armadillo C++ (4.400.1)
I have a matrix that has to be sparse and complex, and I want to calculate the inverse of such matrix. Since it is sparse it could be the pseudoinverse, but I can guarantee that the matrix has the full diagonal.
In the API documentation of Armadillo, it mentions the method .i() to calculate the inverse of any matrix, but sp_cx_mat members do not contain such method, and the inv() or pinv() functions cannot handle the sp_cx_mat type apparently.
sp_cx_mat Y;
/*Fill Y ensuring that the diagonal is full*/
sp_cx_mat Z = Y.i();
or
sp_cx_mat Z = inv(Y);
None of them work.
I would like to know how to compute the inverse of matrices of sp_cx_mat type.
Sparse matrix support in Armadillo is not complete and many of the factorizations/complex operations that are available for dense matrices are not available for sparse matrices. There are a number of reasons for this, the largest being that efficient complex operations such as factorizations for sparse matrices is still very much an open research field. So, there is no .i() function available for cx_sp_mat or other sp_mat types. Another reason for this is lack of time on the part of the sparse matrix developers (...which includes me).
Given that the inverse of a sparse matrix is generally going to be dense, then you may simply be better off turning your cx_sp_mat into a cx_mat and then using the same inversion techniques that you normally would for dense matrices. Since you are planning to represent this as a dense matrix anyway, then it's a fair assumption that you have enough RAM to do that.
What is the easiest and fastest way (with some library, of course) to compute k largest eigenvalues and eigenvectors for a large dense matrix in C++? I'm looking for an equivalent of MATLAB's eigs function; I've looked through Armadillo and Eigen but couldn't find one, and computing all eigenvalues takes forever in my case (I need top 10 eigenvectors for an approx. 30000x30000 dense non-symmetric real matrix).
Desperate, I've even tried to implement power iterations by myself with Armadillo's QR decomposition but ran into complex pairs of eigenvalues and gave up. :)
Did you tried https://github.com/yixuan/spectra ?
It similar to ARPACK but with nice Eigen-like interface (it compatible with Eigen!)
I used it for 30kx30k matrices (PCA) and it was quite ok
AFAIK the problem of finding the first k eigenvalues of a generic matrix has no easy solution. The Matlab function eigs you mentioned is supposed to work with sparse matrices.
Matlab probably uses Arnoldi/Lanczos, you might try if it works decently in your case even if your matrix is not sparse. The reference package for Arnlodi is ARPACK which has a C++ interface.
Here is how I get the k largest eigenvectors of a NxN real-valued (float), dense, symmetric matrix A in C++ Eigen:
#include <Eigen/Dense>
...
Eigen::MatrixXf A(N,N);
...
Eigen::SelfAdjointEigenSolver<Eigen::MatrixXf> solver(N);
solver.compute(A);
Eigen::VectorXf lambda = solver.eigenvalues().reverse();
Eigen::MatrixXf X = solver.eigenvectors().block(0,N-k,N,k).rowwise().reverse();
Note that the eigenvalues and associated eigenvectors are returned in ascending order so I reverse them to get the largest values first.
If you want eigenvalues and eigenvectors for other (non-symmetric) matrices they will, in general, be complex and you will need to use the Eigen::EigenSolver class instead.
Eigen has an EigenValues module that works pretty well.. But, I've never used it on anything quite that large.