multi-dimensional Sparse Matrix Compression - c++

Can anybody suggest a good C++ library for storing Multi-dimensional Sparse Matrix that focuses on the compression of data in matrix. The number of dimensions of the matrix will be huge (say, 80 dimensions). Any help is most welcome :).
EDIT:
The matrix is highly sparse, in the order of 0.0000001 (or) 1x10-6.

In c# I have used key value pairs or "dictionaries" to store sparse populated arrays. I think for 80 dimensions you would have to construct a string based key. Use a single function to create the key it should all remain consistent. Simply concatenate a comma separated list of the dimensions. Unfortunately I'm not aware of a good key pair, dictionary library for c++. Possibly STL if you have used it before but I would not recommend it otherwise.

Related

Eigen::Map<Sparse> for COO-SpMV

Here is my question in short:
What is the correct code for generating a map for an (unsorted) coo-matrix in Tux Eigen (C++)?
The following code succeeds at generating a map A_map for a compressed row storage (crs/csr) format sparse matrix, that is stored in a crs_structure A1. (I use metis notation. m=rows, n=cols, nnz=#nonzeros.)
Eigen::Map<Eigen::SparseMatrix< double,Eigen::RowMajor,myInt>> A_map(A1.m,A1.n,A1.nnz,A1.adj,A1.adjncy,A1.values,NULL );
I use the following code in attempting to generate a map A_map for a coordinate storage (coo) format sparse matrix, that is stored in a coo_structure A2. ptrI,ptrJ,ptrV are int64*,int64*,double*, giving row-,col-coordinates of values in ptrV.
Eigen::Map<Eigen::SparseMatrix< double,Eigen::RowMajor,myInt>> A_map(A2.m,A2.n,A2.nnz,A2.ptrI,A2.ptrJ,A2.ptrV,innerNonZerosPtr);
I need the map because I want to benchmark Eigen's sparse matrix vector product (matvec) against mine.
In general, none of the indices of A are sorted.
Otherwise, the csr format could be created from the coo format in $\cO(nnz)$, circumventing the issue.
That is not an option here because sorting the indices consumes far more time then computing the matvec.
Side note:
I hove not understood what the "innerNonZerosPtr" means; I failed at finding its actual explanation in the Eigen documentation.
Possibly, understanding its intention and purposeful use in my scenario could solve my problem.
Cheers, and many thanks in advance for any help.
There is an example here sparseTutorial
For row-major case, innerNnz vector stores the number of non-zero elements of each rows.
If matrix is in compressed form, innerNnz is non required.

What is a fast matrix or two-dimensional array to store an adjacency matrix in C++

I'm trying to infer a Markov chain of a process I can only simulate. The amount of states/vertices that the final graph will contain is very large, but I do not know the amount of vertices in advance.
Right now I have the following:
My simulation outputs a boost::dynamic_bitset containing 112 bits every timestep.
I use the bitset as a key in a Google Sparse Hash to map to an integer value that can be used as an index to the adjacency matrix I want to construct.
Now I need a good/fast matrix or two-dimensional array to store integers. It should:
Use the integer values I stored in the Google Sparse Hash as row/column numbers. (Eg. I want to access/change a stored integer by doing something like matrix(3,4) = 3.
I do not know the amount of rows or columns I will need in advance. So it should be able to just add rows and columns on the fly.
Most values will be 0, so it should probably be a sparse implementation of something.
The amount of rows and columns will be very large, so it should be very fast.
Simple to use. I don't need a lot of mathematical operations, it should just be a fast and simple way to store and access integers.
I hope I put my question clear enough.
I'd recommend http://www.boost.org/doc/libs/1_54_0/libs/numeric/ublas/doc/matrix_sparse.htm -- boost UBLAS sparse matrices. There are several different implementations of sparse matrix storages, so reading the documentation can help you choose a type that's right for your purpose. (TLDR: sparse matrices have either fast retrieval or fast insertion.)

Storing Matrix information in STL vector. Which is better vector or vector of vectors?

I've created my own Matrix class were inside the class the information regarding the Matrix is stored in a STL vector. I've notice that while searching the web some people work with a vector of vectors to represent the Matrix information. My best guess tells me that so long as the matrix is small or skinny (row_num >> column_num) the different should be small, but what about if the matrix is square or fat (row_num << column_num)? If I were to create a very large matrix would I see a difference a run time? Are there other factors that need to be considered?
Thanks
Have you considered using an off-the-shelf matrix representation such as boost's instead of reinventing the wheel?
If you have a lot of empty rows for example, using the nested representation could save a lot of space. Unless you have specific information in actual use cases showing one way isn't meeting your requirements, code the way that's easiest to maintain and implement properly.
There are too many variables to answer your question.
Create an abstraction so that your code does not care how the matrix is represented. Then write your code using any implementation. Then profile it.
If your matrix is dense, the "vector of vectors" is very unlikely to be faster than a single big memory block and could be slower. (Chasing two pointers for random access + worse locality.)
If your matrices are large and sparse, the right answer to your question is probably "neither".
So create an abstract interface, code something up, and profile it. (And as #Mark says, there are lots of third-party libraries you should probably consider.)
If you store everything in a single vector, an iterator will traverse the entire matrix. If you use a vector of vectors, an iterator will only traverse a single dimension.

What are the differences between the various boost ublas sparse vectors?

In boost::numeric::ublas, there are three sparse vector types.
I can see that the mapped_vector is essentially an stl::map from index to value, which considers all not-found values to be 0 (or whatever is the common value).
But the documentation is sparse (ha ha) on information about compressed_vector and coordinate_vector.
Is anyone able to clarify? I'm trying to figure out the algorithmic complexity of adding items to the various vectors, and also of dot products between two such vectors.
A very helpful answer offered that compressed_vector is very similar to compressed_matrix. But it seems that, for example, compressed row storage is only for storing matrices -- not just vectors.
I see that unbounded_array is the storage type, but I'm not quite sure what the specification is for that, either. If I create a compressed_vector with size 200,000,000, but with only 5 non-zero locations, is this less efficient in any way than creating a compressed_vector with size 10 and 5 non-zero locations?
Many thanks!
replace matrix with vector and you have the answers
http://www.guwi17.de/ublas/matrix_sparse_usage.html

Is there a Boost (or other common lib) type for matrices with string keys?

I have a dense matrix where the indices correspond to genes. While gene identifiers are often integers, they are not contiguous integers. They could be strings instead, too.
I suppose I could use a boost sparse matrix of some sort with integer keys, and it wouldn't matter if they're contiguous. Or would this still occupy a great deal of space, particularly if some genes have identifiers that are nine digits?
Further, I am concerned that sparse storage is not appropriate, since this is an all-by-all matrix (there will be a distance in each and every cell, provided the gene exists).
I'm unlikely to need to perform any matrix operations (e.g., matrix multiplication). I will need to pull vectors out of the matrix (slices).
It seems like the best type of matrix would be keyed by a Boost unordered_map (a hash map), or perhaps even simply an STL map.
Am I looking at this the wrong way? Do I really need to roll my own? I thought I saw such a class somewhere before.
Thanks!
You could use a std::map to map the gene identifiers to unique, consecutively assigned integers (every time you add a new gene identifier to the map, you can give it the map's size as its identifier, assuming you never remove genes from the map).
If you want to be able to search for the identifier of a gene based on its unique integer, you can use a second map or you could use a boost::bimap, which provides a bidirectional mapping of elements.
As for which matrix container to use, you might consider boost::ublas::matrix; it provides vector-like access to rows and columns of the matrix.
If you don't need matrix operations, you don't need a matrix. A 2D map with string keys can be done with map<map<string> > in plain C++, or using a hash map accordingly from Boost.
There is Boost.MultiArray which will allow you to manage with non-continuous indexes.
If you want an efficient implementation working with matrices with static size, there is also Boost.LA, which in now on the review schedule.
And las there is also NT2 which should be submitted to Boost soon.