I'm working with large sparse matrices that are not exactly very sparse and I'm always wondering how much sparsity is required for storage of a matrix as sparse to be beneficial? We know that sparse representation of a reasonably dense matrix could have a larger size than the original one. So is there a threshold for the density of a matrix so that it would be better to store it as sparse? I know that the answer to this question usually depends on the structure of the sparsity, etc but I was wondering if there is just some guidelines? for example I have a very large matrix with density around 42%. should I store this matrix as dense or sparse?
scipy.coo_matrix format stores the matrix as 3 np.arrays. row and col are integer indices, data has the same data type as the equivalent dense matrix. So it should be straight forward to calculate the memory it will take as a function of overall shape and sparsity (as well as the data type).
csr_matrix may be more compact. data and indices are the same as with coo, but indptr has a value for each row plus 1. I was thinking that indptr would be shorter than the others, but I just constructed a small matrix where it was longer. An empty row, for example, requires a value in indptr, but none in data or indices. The emphasis with this format is computational efficiency.
csc is similar, but working with columns. Again you should be able to the math to calculate this size.
Brief discussion of memory advantages from MATLAB (using similar storage options)
http://www.mathworks.com/help/matlab/math/computational-advantages.html#brbrfxy
background paper from MATLAB designers
http://www.mathworks.com/help/pdf_doc/otherdocs/simax.pdf
SPARSE MATRICES IN MATLAB: DESIGN AND IMPLEMENTATION
Related
I am using the Eigen library in C++ for solving sparse linear equations: Ax=b where, A is a square sparse matrix and b is a dense vector with ILU-preconditioned BiCGSTAB. I am initializing the matrix A using the setFromTriplets function. The linear system is generated from discretization of partial differential equations in space and time.
My application changes the matrix slightly at every time-step. I want to modify a small number of rows (around 1% rows) in the matrix in the beginning of each time-step. I am storing the matrix in the row-major format so that I can access the row directly. I don't want to re-assemble the entire matrix from triplets since the number of rows to be modified are around 1%. Moreover, the modification is such that the number of non-zeros in the row are exactly identical. I just want to change the column indices and values. Hence, I do not need to allocate extra memory for the row. After going through the Eigen documentation as well as the forum, I found the functions coeffRef and insert. Both of them will allocate extra memory if the element does not exist. I would like to avoid this since the number of non-zeros are not changing.
Any help is appreciated.
I am trying to understand the different methods for dimensionality reduction in data analysis. In particular I am interested in Singular Value Decomposition (SVD) and Principle Component Analysis (PCA).
Can anyone please explain there terms to a layperson? - I understand the general premis of dimensionality reduction as bringing data to a lower dimension - But
a) how do SVD and PCA do this, and
b) how do they differ in their approach
OR maybe if you can explain what the results of each technique is telling me, so for
a) SVD - what are singular values
b) PCA - "proportion of variance"
Any example would be brilliant. I am not very good at maths!!
Thanks
You probably already figured this out, but I'll post a short description anyway.
First, let me describe the two techniques speaking generally.
PCA basically takes a dataset and figures out how to "transform" it (i.e. project it into a new space, usually of lower dimension). It essentially gives you a new representation of the same data. This new representation has some useful properties. For instance, each dimension of the new space is associated with the amount of variance it explains, i.e. you can essentially order the variables output by PCA by how important they are in terms of the original representation. Another property is the fact that linear correlation is removed from the PCA representation.
SVD is a way to factorize a matrix. Given a matrix M (e.g. for data, it could be an n by m matrix, for n datapoints, each of dimension m), you get U,S,V = SVD(M) where:M=USV^T, S is a diagonal matrix, and both U and V are orthogonal matrices (meaning the columns & rows are orthonormal; or equivalently UU^T=I & VV^T=I).
The entries of S are called the singular values of M. You can think of SVD as dimensionality reduction for matrices, since you can cut off the lower singular values (i.e. set them to zero), destroying the "lower parts" of the matrices upon multiplying them, and get an approximation to M. In other words, just keep the top k singular values (and the top k vectors in U and V), and you have a "dimensionally reduced" version (representation) of the matrix.
Mathematically, this gives you the best rank k approximation to M, essentially like a reduction to k dimensions. (see this answer for more).
So Question 1
I understand the general premis of dimensionality reduction as bringing data to a lower dimension - But
a) how do SVD and PCA do this, and b) how do they differ in their approach
The answer is that they are the same.
To see this, I suggest reading the following posts on the CV and math stack exchange sites:
What is the intuitive relationship between SVD and PCA?
Relationship between SVD and PCA. How to use SVD to perform PCA?
How to use SVD for dimensionality reduction to reduce the number of columns (features) of the data matrix?
How to use SVD for dimensionality reduction (in R)
Let me summarize the answer:
essentially, SVD can be used to compute PCA.
PCA is closely related to the eigenvectors and eigenvalues of the covariance matrix of the data. Essentially, by taking the data matrix, computing its SVD, and then squaring the singular values (and doing a little scaling), you end up getting the eigendecomposition of the covariance matrix of the data.
Question 2
maybe if you can explain what the results of each technique is telling me, so for a) SVD - what are singular values b) PCA - "proportion of variance"
These eigenvectors (the singular vectors of the SVD, or the principal components of the PCA) form the axes of the news space into which one transforms the data.
The eigenvalues (closely related to the squares of the data matrix SVD singular values) hold the variance explained by each component. Often, people want to retain say 95% of the variance of the original data, so if they originally had n-dimensional data, they reduce it to d-dimensional data that keeps that much of the original variance, by choosing the largest d-eigenvalues such that 95% of the variance is kept. This keeps as much information as possible, while retaining as few useless dimensions as possible.
In other words, these values (variance explained) essentially tell us the importance of each principal component (PC), in terms of their usefulness reconstructing the original (high-dimensional) data. Since each PC forms an axis in the new space (constructed via linear combinations of the old axes in the original space), it tells us the relative importance of each of the new dimensions.
For bonus, note that SVD can also be used to compute eigendecompositions, so it can also be used to compute PCA in a different way, namely by decomposing the covariance matrix directly. See this post for details.
According to your question,I only understood the topic of Principal component analysis.so that I share few below points about PCA i hope you definitely understand.
PCA:
1.PCA is a linear transformation dimensionality reduction technique.
2.It is used for operations such as noise filtering,feature extraction and data visualization.
3.The goal of PCA is to identify patterns and detecting the correlations between variables.
4.If there is a strong correlation,then we could reduce the dimensionality which PCA is intended for.
5.Eigenvector is to make linear transformation without changing the directions.
this is the sample url to understand the PCA:https://www.solver.com/xlminer/help/principal-components-analysis-example
I am using the Armadillo library to manually port a piece of Matlab code. The matlab code uses the eigs() function to find a small number (~3) of eigen vectors of a relative large(200x200) covariance matrix R. The code looks like this:
[E,D] = eigs(R,3,"lm");
In Armadillo there are two functions eigs_sym() and eigs_gen() however the former only support real symmetric matrix and the latter requires ARPACK (I'm building the code for Android). Is there a reason eigs_sym doesn't support complex matrices? Is there any other way to find the eigenvectors of a complex symmetric matrix?
The eigs_sym() and eigs_gen() functions (where the s in eigs stands for sparse) in Armadillo are for large sparse matrices. A "large" size in this context is roughly 5000x5000 or larger.
Your R matrix has a size of 200x200. This is very small by current standards. It would be much faster to simply use the dense eigendecomposition eig_sym() or eig_gen() functions to get all the eigenvalues / eigenvectors, followed by extracting a subset of them using submatrix operations like .tail_cols()
Have you tested constructing a 400x400 real symmetric matrix by replacing each complex value, a+bi, by a 2x2 matrix [a,b;-b,a] (alternatively using a block variant of this)?
This should construct a real symmetric matrix that in some way correspond to the complex one.
There will be a slow-down due to the larger size, and all eigenvalues will be duplicated (which may slow down the algorithm), but it seems straightforward to test.
I wrote a derived data type to store banded matrices in Compressed Diagonal Storage format; in particular I store each diagonal of the banded matrix in a column of the 2D array cds(1:N,-L:U), where N is the number of rows of the full matrix and L and U are the number of lower and upper diagonals (this question includes the definition of the type).
I also wrote a function to perform the product between a matrix in this CDS format and a full vector. To obtain each element of the product vector, the elements of the corresponding row of cds are used, which are not contiguous in memory, since the language is Fortran. Because of this I was wandering if a better solution would be to store the diagonals in the rows of a 2D array cds2(-L:U,1:N), which seems pretty reasonable to me.
On the contrary here I read
we can allocate for the matrix A an array val(1:n,-p:q). The declaration with reversed dimensions (-p:q,n) corresponds to the LINPACK band format [132], which, unlike compressed diagonal storage (CDS), does not allow for an efficiently vectorizable matrix-vector multiplication if p + q is small.
Which is just what seems appropriate to C in my opinion. What am I missing?
EDIT
The core of the routine performing matrix vector products is the following
DO i = A%lb(1), A%ub(1)
CDS_mat_x_full_vec(i) = DOT_PRODUCT(A%matrix(i,max(-lband,lv-i):min(uband,uv-i)), &
& v(max(i-lband,lv):min(i+uband,uv)))
END DO
(Where lv and uv are used to take into account the case of the vector indexed from an index other than 1.)
The matrix A is then accessed by rows.
I implemented the derived type which stores the diagonals in an array val(-p:q,1:n) and it is faster, as I supposed. So I think that the link I referenced refers to a row major storage language as C and not a column major one as Fortran. (Or it implements the matrix product in a way I can't even imagine.)
Introduction
I am developing a code in Fortran solving an MHD problem with preconditioning of a linear operator. The sparse matrix to be inverted can be considered as a matrix of the following hierarchical structure. The original matrix (say, A_1) is a band matrix of blocks. Each block of A_1 is a sparse matrix (say, A_2) of the same structure (i.e. a block banded matrix). Each block of A_2 is again a block banded matrix of the same sparsity structure, A_3. Each block of A_3 is, finally, a dense matrix 5 by 5, A_4. I find this hierarchical representation is very convenient to initialize elements of the matrix.
Question
I wonder if there exists a library (in Fortran) permitting to handle such a structure and convert it in one of the standard sparse matrix formats (CSR, CSC, BSR,...), since Sparse BLAS or MKL Pardiso will be used to invert it. Let me stress that my intention is to use the hierarchical structure only to initialize elements of the matrix. Of course, the hierarchical structure can be disregarded and the matrix could be hard-coded in the CSR format, but I find this is too time consuming to implement and test.
Comments
I don't expect a linear solver to use the hierarchical structure, although in S. Pissanetsky " Sparse matrix technology", 1984, Academmic Press, page 27 (available online here) such storage schemes are mentioned, namely, the "hypermatrix" and "supersparse" storage schemes, and were used in Gauss elimination. I have not found available implementations of these schemes yet.
Block compressed sparse row (BSR) format (supported by MKL) can be used to handle two levels of the matrix, A_3(sparse) + A_4(dense), not more.