Read from file a sparse-matrix - c++

I'm using a the Yale representation of sparse-matrix in power iteration algorithm, everything goes well and fast.
But, now I have a problem, my professor will send the sparse-matrix in a data file unordered, and since the matrix is symmetric only one pair of index will be there.
The problem is, in my implementation I need to insert the elements in order.
I tried somethings to read and after that insert into my sparse-matrix:
1) Using a dense matrix.
2) Using another sparse-matrix implementation, I tried with std::map.
3) Priority queue, I made a array of priority_queues. I insert the element i,j in the priority_queue[i], so when I pop the priority_queue[i] I take the lowest j-index of the row i.
But I need something really fast and memory efficient, because the largest matrix I'll use will be like 100k x 100k, and the tries I made was so slow, almost 200 times slower than the power iteration itself.
Any suggestions? Sorry for the poor english :(

The way many sparse loaders work is that you use an intermediate pure triples structure. I.e. whatever the file looks like, you load it into something like vector< tuple< row, column, value> >.
You then build the sparse structure from that. The reason is precisely what you're running into. Your sparse matrix structure likely has constraints, like you need to know the number of elements in each row/column, or the input needs to be sorted, etc. You can massage your triples array into whatever you need (i.e. by sorting it).
This also makes it trivial to solve your symmetry dilemma. For every triple in the source file, you insert both (row, column, value) and (column, row, value) into your intermediate structure.
Another option is to simply write a script that will sort your professor's file.
FYI, in the sparse world the number of elements (nonzeros) is what matters, not the dimensions of the matrix. 100k-by-100k is a meaningless piece of information. That entire matrix could be totally empty, for example.

Related

Get row vector of 2d vector in C++

I have a vector of vector in C++ defined with: vector < vector<double> > A;
Let's suppose that A has been filled with some values. Is there a quick way to extract a row vector from A ?
For instance, A[0] will give me the first column vector, but how can I get quicky the first row vector?
There is no "quick" way with that data structure, you have to iterate each column vector and get the value for desired row and add it to temporary row vector. Wether this is fast enough for you or not depends on what you need. To make it as fast as possible, be sure to allocate right amount of space in the target row vector, so it doesn't need to be resized while you add the values to it.
Simple solution to performance problem is to use some existing matrix library, such as Eigen suggested in comments.
If you need to do this yourself (because it is assignment, or because of licensing issues, or whatever), you should probably create your own "Matrix 2D" class, and hide implementation details in it. Then depending on what exactly you need, you can employ tricks like:
have a "cache" for rows, so if same row is fetched many times, it can be fetched from the cache and a new vector does not need to be created
store data both as vector of row vectors, and vector of column vectors, so you can get either rows or columns at constant time, at the cost of using more memory and making changes twice as expensive due to duplication of data
dynamically change the internal representation according to current needs, so you get the fixed memory usage, but need to pay the processing cost when you need to change the internal representation
store data in flat vector with size of rows*columns, and calculate the correct offset in your own code from row and column
But it bears repeating: someone has already done this for you, so try to use an existing library, if you can...
There is no really fast way to do that. Also as pointed out, I would say that the convention is the other way around, meaning that A[0] is actually the first row, rather than the first column. However even trying to get a column is not really trivial, since
{0, 1, 2, 3, 4}
{0}
{0, 1, 2}
is a very possible vector<vector<double>> A, but there is no real column 1, 2, 3 or 4. If you wish to enforce behavior like same length columns, creating a Matrix class may be a good idea (or using a library).
You could write a function that would return a vector<double> by iterating over the rows, storing the appropriate column value. But you would have to be careful about whether you want to copy or point to the matrix values (vector<double> / vector<double *>). This is not very fast as the values are not next to each other in memory.
The answer is: in your case there is no corresponding simple option as for columns. And one of the reasons is that vector> is a particular poorly suited container for multi-dimensional data.
In multi-dimension it is one of the important design decisions: which dimension you want to access most efficiently, and based on the answer you can define your container (which might be very specialized and complex).
For example in your case: it is entirely up to you to call A[0] a 'column' or a 'row'. You only need to do it consistently (best define a small interface around which makes this explicit). But STOP, don't do that:
This brings you to the next level: for multi-dimensional data you would typically not use vector> (but this is a different issue). Look at smart and efficient solutions already existing e.g. in ublas https://www.boost.org/doc/libs/1_65_1/libs/numeric/ublas/doc/index.html or eigen3 https://eigen.tuxfamily.org/dox/
You will never be able to beat these highly optimized libraries.

How to efficiently add and remove vector values C++

I am trying to efficiently calculate the averages from a vector.
I have a matrix (vector of vectors) where:
row: the days I am going back (250)
column: the types of things I am calculating the average of (10,000 different things)
Currently I am using .push_back() which essentially iterates through each row in each column and then I use erase() in order to remove the last value. As this method goes through all the values, my code is very slow.
I am thinking of a method linked to substitution, however I have a hard time implementing the idea, as all the values have an order (i.e. I need to remove the old value and the value I add / substitute will be the newest).
Below is my code so far.
Any ideas for a solution or guides for the right direction will be much appreciated.
//declaration
vector <vector<float> > vectorOne;
//initialization
vectorOne(250, vector<float>(10000, 0)),
//This is the slow method
vectorOne[column].push_back(1);//add newest value
vectorOne[column].erase(vectorOne[column].begin() + 0); //remove latest value
You probably need a different data structure.
The problem sounds like a queue. You add to the end and take from the front. With real queues, everyone then shuffles up a step. With computer queues, we can use a circular buffer (you do need to be able to get a reasonable bound on maximum queue length).
I suggest implementing your own on top of a plain C array first, then using the STL version when you've understood the principle.

Representation of a symmetric diagonal matrix

Lets assume we have a huge symmetric diagonal matrix. What is the efficient way to implement this?
The only way that i could think of is that by using the symmetric property where Xij = Xji, we can reduce the size of this matrix by half. But then representing this matrix using a 2D array would be inefficient, since we cant reduce the matrix size by using arrays.
Another thing representing this matrix using adjacency list also would be inefficient, because relating this matrix to a graph. It would be a density graph. And the operation of adj list takes lots of time such as removing, inserting and searching.
But what about using heaps?
There is no one answer until you decide what you are going to do with this matrix (or maybe matrices?).
If you are just going to store and remember it, then just store it sequentially, leaving out the redundant entries. (Your code knows how to access it, because that is all it does, right?)
More probably, you want to do normal matrix operations on it. In that case, are you trying to make the storage efficient, or the execution? In the later case, I don't see many opportunities based on it being symmetric--the multiplies are the expensive thing and you probably still need all of those. If it is the storage, then are you limiting yourself to operations that only take symmetric in and symmetric out? Sounds awfully specific. If so, then you only need to do the calculations for the part you are storing, because, by definition the other entries are symmetric, so just write your code to generate that part of the matrix and you are done.

c++ read matrix of unkown size

This is my problem, I need to read from a text file a certain matrix of int, without knowing the size of it
suppose for instance:
"matrix.dsv"
1,0,1,0,0,0,1,0
0,1,1,1,1,1,1,1
0,0,0,0,0,0,0,0
0,0,0,1,0,0,0,0
Is there any way of knowing the size of the matrix without importing it?
Since I will choose a different way of memorizing it (vector, sparse matrix, full matrix) depending on the sparseness, is there also a way of counting the nonzero elements?
Thank you very much (sorry for the lame question, i'm quite new to managing files!)
EDIT:
Thank you coincoin!
One last question, how does cin reacts to the end of the line?
Or better, how do I increment one of the indexes when I finish a line?
Thanks ;)
Unfortunately, you have to read the file in order to know the size of the matrix unless you have information in the file name or at the beginning of your file.
The question is a little broad.
std::vector should be the way to go since you can dynamically add elements with push_back() for sparse or dense storage. Prefer flatten linear (one dimension) storage.
Processing the file should be one pass, you read element by element, check and add to a counter if it is zero.
When reaching end of file and know the size of your matrix, you can rearrange the data.
If you can you should try to use known libraries for convenient Matrix classes and methods such as Eigen, Armadillo if you are doing linear algebra.

3D-Grid of bins: nested std::vector vs std::unordered_map

pros, I need some performance-opinions with the following:
1st Question:
I want to store objects in a 3D-Grid-Structure, overall it will be ~33% filled, i.e. 2 out of 3 gridpoints will be empty.
Short image to illustrate:
Maybe Option A)
vector<vector<vector<deque<Obj>> grid;// (SizeX, SizeY, SizeZ);
grid[x][y][z].push_back(someObj);
This way I'd have a lot of empty deques, but accessing one of them would be fast, wouldn't it?
The Other Option B) would be
std::unordered_map<Pos3D, deque<Obj>, Pos3DHash, Pos3DEqual> Pos3DMap;
where I add&delete deques when data is added/deleted. Probably less memory used, but maybe less fast? What do you think?
2nd Question (follow up)
What if I had multiple containers at each position? Say 3 buckets for 3 different entities, say object types ObjA, ObjB, ObjC per grid point, then my data essentially becomes 4D?
Another illustration:
Using Option 1B I could just extend Pos3D to include the bucket number to account for even more sparse data.
Possible queries I want to optimize for:
Give me all Objects out of ObjA-buckets from the entire structure
Give me all Objects out of ObjB-buckets for a set of
grid-positions
Which is the nearest non-empty ObjC-bucket to
position x,y,z?
PS:
I had also thought about a tree based data-structure before, reading about nearest neighbour approaches. Since my data is so regular I had thought I'd save all the tree-building dividing of the cells into smaller pieces and just make a static 3D-grid of the final leafs. Thats how I came to ask about the best way to store this grid here.
Question associated with this, if I have a map<int, Obj> is there a fast way to ask for "all objects with keys between 780 and 790"? Or is the fastest way the building of the above mentioned tree?
EDIT
I ended up going with a 3D boost::multi_array that has fortran-ordering. It's a little bit like the chunks games like minecraft use. Which is a little like using a kd-tree with fixed leaf-size and fixed amount of leaves? Works pretty fast now so I'm happy with this approach.
Answer to 1st question
As #Joachim pointed out, this depends on whether you prefer fast access or small data. Roughly, this corresponds to your options A and B.
A) If you want fast access, go with a multidimensional std::vector or an array if you will. std::vector brings easier maintenance at a minimal overhead, so I'd prefer that. In terms of space it consumes O(N^3) space, where N is the number of grid points along one dimension. In order to get the best performance when iterating over the data, remember to resolve the indices in the reverse order as you defined it: innermost first, outermost last.
B) If you instead wish to keep things as small as possible, use a hash map, and use one which is optimized for space. That would result in space O(N), with N being the number of elements. Here is a benchmark comparing several hash maps. I made good experiences with google::sparse_hash_map, which has the smallest constant overhead I have seen so far. Plus, it is easy to add it to your build system.
If you need a mixture of speed and small data or don't know the size of each dimension in advance, use a hash map as well.
Answer to 2nd question
I'd say you data is 4D if you have a variable number of elements a long the 4th dimension, or a fixed large number of elements. With option 1B) you'd indeed add the bucket index, for 1A) you'd add another vector.
Which is the nearest non-empty ObjC-bucket to position x,y,z?
This operation is commonly called nearest neighbor search. You want a KDTree for that. There is libkdtree++, if you prefer small libraries. Otherwise, FLANN might be an option. It is a part of the Point Cloud Library which accomplishes a lot of tasks on multidimensional data and could be worth a look as well.