What are the differences between the various boost ublas sparse vectors? - c++

In boost::numeric::ublas, there are three sparse vector types.
I can see that the mapped_vector is essentially an stl::map from index to value, which considers all not-found values to be 0 (or whatever is the common value).
But the documentation is sparse (ha ha) on information about compressed_vector and coordinate_vector.
Is anyone able to clarify? I'm trying to figure out the algorithmic complexity of adding items to the various vectors, and also of dot products between two such vectors.
A very helpful answer offered that compressed_vector is very similar to compressed_matrix. But it seems that, for example, compressed row storage is only for storing matrices -- not just vectors.
I see that unbounded_array is the storage type, but I'm not quite sure what the specification is for that, either. If I create a compressed_vector with size 200,000,000, but with only 5 non-zero locations, is this less efficient in any way than creating a compressed_vector with size 10 and 5 non-zero locations?
Many thanks!

replace matrix with vector and you have the answers
http://www.guwi17.de/ublas/matrix_sparse_usage.html

Related

Do iterators of SpMat<Type> in Armadillo only visit non-zero entries?

I was wondering how to loop through all the non-zero entries of a sp_umat (i.e., SpMat<unsigned int>) in Armadillo, and came across this related question ( link ). That post suggests using a const_iterator to retrieve the non-zero locations and values in sp_mat. Can one assume that all iterators of sp_mat (and other related types of sparse matrices in armadillo; sp_umat in my case) support only iterators that visit non-zero entries alone? I was not able to get this sorted out from the documentation. Another related question also comes to mind: in general, does Armadillo support visiting any other locations in a sparse matrix at all by other means? Thanks very much for the help!
1) Yes, all iterators of sparse objects only iterate over nonzero locations. I'm sorry that isn't clear in the documentation, I'll see if maybe that can be improved.
2) Yes, you can access any location in a sparse matrix with matrix(i, j) just like dense matrices. So in that sense the sparse and dense matrices are somewhat interchangeable.

Reduce the length of a feature vector for comparison

I have a problem, where several different objects are each described by a vector of real numbers, between 0 and 100, and a length (dimension) of 1000 elements.
Then I want to compare a new vector of equal characteristics with the set of vectors above, to find the most similar, with the Mahalanobis distance.
My question is:
How can I reduce the length of the vectors to the N most relevant elements (say, 100 of the 1000), without affecting too much the quality of the answers found, ie, the distance does not vary too much?
Remember that each vector is a description of a different object, unrelated to others.
I thought about using PCA, but after studying it, I saw that I needed at least two samples per object, or so I understood.
Any idea? In case of coding, I´m using C++, OpenCV
Thanks in advance.

multi-dimensional Sparse Matrix Compression

Can anybody suggest a good C++ library for storing Multi-dimensional Sparse Matrix that focuses on the compression of data in matrix. The number of dimensions of the matrix will be huge (say, 80 dimensions). Any help is most welcome :).
EDIT:
The matrix is highly sparse, in the order of 0.0000001 (or) 1x10-6.
In c# I have used key value pairs or "dictionaries" to store sparse populated arrays. I think for 80 dimensions you would have to construct a string based key. Use a single function to create the key it should all remain consistent. Simply concatenate a comma separated list of the dimensions. Unfortunately I'm not aware of a good key pair, dictionary library for c++. Possibly STL if you have used it before but I would not recommend it otherwise.

Storing Matrix information in STL vector. Which is better vector or vector of vectors?

I've created my own Matrix class were inside the class the information regarding the Matrix is stored in a STL vector. I've notice that while searching the web some people work with a vector of vectors to represent the Matrix information. My best guess tells me that so long as the matrix is small or skinny (row_num >> column_num) the different should be small, but what about if the matrix is square or fat (row_num << column_num)? If I were to create a very large matrix would I see a difference a run time? Are there other factors that need to be considered?
Thanks
Have you considered using an off-the-shelf matrix representation such as boost's instead of reinventing the wheel?
If you have a lot of empty rows for example, using the nested representation could save a lot of space. Unless you have specific information in actual use cases showing one way isn't meeting your requirements, code the way that's easiest to maintain and implement properly.
There are too many variables to answer your question.
Create an abstraction so that your code does not care how the matrix is represented. Then write your code using any implementation. Then profile it.
If your matrix is dense, the "vector of vectors" is very unlikely to be faster than a single big memory block and could be slower. (Chasing two pointers for random access + worse locality.)
If your matrices are large and sparse, the right answer to your question is probably "neither".
So create an abstract interface, code something up, and profile it. (And as #Mark says, there are lots of third-party libraries you should probably consider.)
If you store everything in a single vector, an iterator will traverse the entire matrix. If you use a vector of vectors, an iterator will only traverse a single dimension.

Is there a Boost (or other common lib) type for matrices with string keys?

I have a dense matrix where the indices correspond to genes. While gene identifiers are often integers, they are not contiguous integers. They could be strings instead, too.
I suppose I could use a boost sparse matrix of some sort with integer keys, and it wouldn't matter if they're contiguous. Or would this still occupy a great deal of space, particularly if some genes have identifiers that are nine digits?
Further, I am concerned that sparse storage is not appropriate, since this is an all-by-all matrix (there will be a distance in each and every cell, provided the gene exists).
I'm unlikely to need to perform any matrix operations (e.g., matrix multiplication). I will need to pull vectors out of the matrix (slices).
It seems like the best type of matrix would be keyed by a Boost unordered_map (a hash map), or perhaps even simply an STL map.
Am I looking at this the wrong way? Do I really need to roll my own? I thought I saw such a class somewhere before.
Thanks!
You could use a std::map to map the gene identifiers to unique, consecutively assigned integers (every time you add a new gene identifier to the map, you can give it the map's size as its identifier, assuming you never remove genes from the map).
If you want to be able to search for the identifier of a gene based on its unique integer, you can use a second map or you could use a boost::bimap, which provides a bidirectional mapping of elements.
As for which matrix container to use, you might consider boost::ublas::matrix; it provides vector-like access to rows and columns of the matrix.
If you don't need matrix operations, you don't need a matrix. A 2D map with string keys can be done with map<map<string> > in plain C++, or using a hash map accordingly from Boost.
There is Boost.MultiArray which will allow you to manage with non-continuous indexes.
If you want an efficient implementation working with matrices with static size, there is also Boost.LA, which in now on the review schedule.
And las there is also NT2 which should be submitted to Boost soon.