Data structure for representing sparse tensor? - c++

What is an appropriate data structure to represent a sparse tesnor in C++?
The first option that comes to mind is a boost::unordered_map since it allows operations like fast setting and retrieval of an an element like below:
A(i,j,k,l) = 5
However, I would also like to be able to do contractions over a single index, which would involve summation over one of the indices
C(i,j,k,m) = A(i,j,k,l)*B(l,m)
How easy would it be to implement this operator with a boost::unordered_map? Is there a more appropriate data structure?

There are tensor libraries available, like:
http://www.codeproject.com/KB/recipes/tensor.aspx
and
http://cadadr.org/fm/package/ftensor.html
Any issue with those? You'd get more tensor operations that way over using a map.

Related

Bitwise operation on a dynamic data structure

I am implementing a simple document indexer for information retrieval. Now I need to implement an incidence matrix, that should be able to be extended dynamically (not satisfied with just a static array or smth).
And to make boolean search possible I have to be able to perform bitwise operations on the rows of the matrix. However, I have not come up with a fast solution. The question is data structure for each row of the matrix.
If it were just a std::vector<bool>, is it possible to do FAST bitwise operations on it? Or is there any other data structure, like BitArray from C#, applicable in the situation?
If FAST is your goal, look into using largest int available on your system (likely - uint64_t) and do a simple bitwise operations on that. If your matrix is wider that 64 - use an std::array of those. Then check if your compiler generates SIMD instructions from your code. If not - consider using intrinsics https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#

Multi-dimensional datasets in C++: cleanest approach to go from a std::vector of 2D data, to a 2D grid of std::vectors?

Context:
I've been processing scientific satellite images, currently keeping the individual end results at each timestamp as cv::Mat_<double>, which can for instance be stored in a std::container of images, such as a std::vector<cv::Mat_<double>>.
The issue:
I would now like to study the physical properties of each individual pixel over time. For that, it would be far preferable if I could look at the data along the time dimension and work with a 2D table of vectors instead. In other words: to have a std::vector<double> associated to each pixel on the 2D grid that is common to all images.
A reason for that is that the type of calculations (computing percentiles, curve fitting, etc) will rely on std::algorithms and libraries which expect to be fed with std::vectors and the like. For a given pixel the data is definitely not contiguous in memory along the time dimension though.
Can/Should I really avoid copying the data in such a case? If yes, what would be the best approach, then? By best I mean efficient yet as 'clean'/'clear' as possible.
I thought of std::reference_wrapper to store the addresses in a std::vector; it's simple and works but each entry takes as much memory as if I had simply duplicated the data in a std::vector<double>. Each data point is just a double after all.
NB:
I've stumbled upon Boost MultiArray, but I'd like to avoid having to add a Boost dependency.
Many thanks in advance for your time/input.
You could try something like std::views::transform (or it's precursors, range-v3 and boost range adaptors), with function objects to lookup each pixel
[x, y](cv::Mat_<double> & mat) -> double & { return mat[y][x]; }
However you should definitely profile if that is worthwhile vs copying, as I expect the cache locality to be horrible.

Armadillo C++:- Efficient access of columns in a cube structure

Using Armadillo matrix library I am aware that the efficient way of accessing a column in a 2d matrix is via a simply call to .col(i).
I am wondering is there an efficient way of extracting a column stored in a "cube", without first having to call the slice command?
I need the most efficient possible way of accessing the data stored in for instance (using matlab notation) A(:,i,j) . I will be doing this millions of times on a very large dataset, so speed and efficiency is of a high priority.
I think you want
B = A.subcube( span:all, span(i), span(j) );
or equivalently
B = A.subcube( span(), span(i), span(j) );
where B will be a row or column vector of the same type as A (e.g. containing double by default, or a number of other available types).
.slice() should be pretty quick. It simply provides a reference to the underlying Mat class. You could try something along these lines:
cube C(4,3,2);
double* mem = C.slice(1).colptr(2);
Also, bear in mind that Armadillo has range checks enabled by default. If you want to avoid the range checks, use the .at() element accessors:
cube C(4,3,2);
C.at(3,2,1) = 456;
Alternatively, you can store your matrices in a field class:
field<mat> F(100);
F(0).ones(12,34);
Corresponding element access:
F(0)(1,2); // with range checks
F.at(0).at(1,2); // without range checks
You can also compile your code with ARMA_NO_DEBUG defined, which will remove all run-time debugging (such as range checks). This will give you a speedup, but it is only recommended once you have debugged all your code (ie. verified that your algorithm is working correctly). The debugging checks are very useful in picking up mistakes.

C++: Implementing Row Iterator for Table

I have a generic table class implemented in C++ that uses a shared_ptr< ptr_vector< vector<T> > > as its backing, where T is an arbitrary typename; the ptr_vector contains pointers to the vectors corresponding to the columns in the table. I decided to wrap the ptr_vector in a shared_ptr since the tables may contain many millions of rows, and the vectors containing data for each column in a ptr_vector for the same reason. (Please tell me if this can be improved.)
Implementing column-wise operations on this table is trivial, since I have access to the native iterator supplied by the vector. However, I also need the table to support row-wise operations: relatively mundane operations such as adding and removing rows should be supported, as well as the ability to use the STL algorithms with the table. Now, I have run across some design issues that I need some help to address:
It seems that implementing a custom iterator to conduct row-wise operations is necessary to accomplish what is describe above. Would boost::iterator_adaptor be the right way to go about doing this?
When the user adds rows to the table, I do not wish to impose a specific data structure upon the user -- how would I go about doing this? I am thinking of accepting iterators as parameters to the add_row() method.
If you think that I should be implementing this table structure differently, I am open to any suggestions that you may have for me. It was originally designed with the intent to store strings read from tab-delimited files containing hundreds of thousands of row entries.
Thank you very much for your help!
The Boost library has a container called multi_array which provides and n-dimensional dynamic array which can be accessed and manipulated along each dimension. This seems to be very similar to what you are trying to build, perhaps you could use it instead?

Is there a Boost (or other common lib) type for matrices with string keys?

I have a dense matrix where the indices correspond to genes. While gene identifiers are often integers, they are not contiguous integers. They could be strings instead, too.
I suppose I could use a boost sparse matrix of some sort with integer keys, and it wouldn't matter if they're contiguous. Or would this still occupy a great deal of space, particularly if some genes have identifiers that are nine digits?
Further, I am concerned that sparse storage is not appropriate, since this is an all-by-all matrix (there will be a distance in each and every cell, provided the gene exists).
I'm unlikely to need to perform any matrix operations (e.g., matrix multiplication). I will need to pull vectors out of the matrix (slices).
It seems like the best type of matrix would be keyed by a Boost unordered_map (a hash map), or perhaps even simply an STL map.
Am I looking at this the wrong way? Do I really need to roll my own? I thought I saw such a class somewhere before.
Thanks!
You could use a std::map to map the gene identifiers to unique, consecutively assigned integers (every time you add a new gene identifier to the map, you can give it the map's size as its identifier, assuming you never remove genes from the map).
If you want to be able to search for the identifier of a gene based on its unique integer, you can use a second map or you could use a boost::bimap, which provides a bidirectional mapping of elements.
As for which matrix container to use, you might consider boost::ublas::matrix; it provides vector-like access to rows and columns of the matrix.
If you don't need matrix operations, you don't need a matrix. A 2D map with string keys can be done with map<map<string> > in plain C++, or using a hash map accordingly from Boost.
There is Boost.MultiArray which will allow you to manage with non-continuous indexes.
If you want an efficient implementation working with matrices with static size, there is also Boost.LA, which in now on the review schedule.
And las there is also NT2 which should be submitted to Boost soon.