This is my problem, I need to read from a text file a certain matrix of int, without knowing the size of it
suppose for instance:
"matrix.dsv"
1,0,1,0,0,0,1,0
0,1,1,1,1,1,1,1
0,0,0,0,0,0,0,0
0,0,0,1,0,0,0,0
Is there any way of knowing the size of the matrix without importing it?
Since I will choose a different way of memorizing it (vector, sparse matrix, full matrix) depending on the sparseness, is there also a way of counting the nonzero elements?
Thank you very much (sorry for the lame question, i'm quite new to managing files!)
EDIT:
Thank you coincoin!
One last question, how does cin reacts to the end of the line?
Or better, how do I increment one of the indexes when I finish a line?
Thanks ;)
Unfortunately, you have to read the file in order to know the size of the matrix unless you have information in the file name or at the beginning of your file.
The question is a little broad.
std::vector should be the way to go since you can dynamically add elements with push_back() for sparse or dense storage. Prefer flatten linear (one dimension) storage.
Processing the file should be one pass, you read element by element, check and add to a counter if it is zero.
When reaching end of file and know the size of your matrix, you can rearrange the data.
If you can you should try to use known libraries for convenient Matrix classes and methods such as Eigen, Armadillo if you are doing linear algebra.
Related
I have a (kind of) performance problem in my code, that roots in the chosen architecture.
I will use multidimensional tensors (basically matrices with more dimensions) in the form of cubes to store my data.
Since the dimension is not known at compile-time, I can't use Boost's MultidimensionalArray (IIRC), but have to come up, with my own solution.
Right now, I save each dimension, on it's own. I have a Tensor of dimension (let's say 3), that holds a lot of tensors of dimension 2 (in an std::vector), that each have a std::vector with tensors of dimension 1, that each holds a std::vector of (numerical) data. I use an abstract base-class for my tensor, so everything in there is a pointer to the abstract class, while beeing (secretly) multi- or one-dimensional.
I extract a single numerical data-point by giving a std::list of indices to a tensor, that get's the first element, searches for the according tensor and passes the rest of the list to that tensor in a (kind of) recursive call.
I now have to do a multi-dimensional Fast-Fourier Transformation on that data. I use a Threadpool and Job-Objects, that works on copying data from an Tensor along one dimension, doing an FFT and writes that data back.
I already have logic to implement ThreadPool and organize the dimensions to FFT along, but there is one problem:
My data-structure is the cache-unfriendliest beast, one can think of... While the Data-Copying along the first dimension (that, with it's data in a single 1D-Tensor) is reasonable fast, but in other directions, I need to copy my data from all over the place.
Since there are no race-conditions (I make sure every concurrent FFT is on distinct data-points), I thought, I would not use a Mutex-Guard to let everybody copy at the same time. However this heavily slows down the process ("I copy my data now!" - "No, I copy my data now!"- "But it's my turn now!"...)
Guarding the copy-Process with a mutex, does not increase speed. The FFT of a vector with 1024 elements is way faster, then the copy-process to get these elements, resulting in nearly all of my threads waiting, while one is copying.
Long story short:
Is there any kind of multi-dimensional data-structure, that does not need to set the dimension at compile-time, that allows me to traverse fast along all axis? I searched for a while now, by nothing came up besides Boost MultiArray. Vectorization also does not work since the indices would grow too fast to hold in usual int-types.
I can't think of how to present code-examples here, since most of that code is rather simple, but If needed, I can get that in.
Eigen has multi-dimensional tensor support (nominally unsupported, but written by the DeepMind people, so "somewhat" supported?), and FFTW has 1d to 3d FFTs. Using external libraries with a set of 1D to 3D FFTs would outsource most of the hard work.
Edit: Actually, FFTW has support for threaded n-dimensional FFTs
I'm using a the Yale representation of sparse-matrix in power iteration algorithm, everything goes well and fast.
But, now I have a problem, my professor will send the sparse-matrix in a data file unordered, and since the matrix is symmetric only one pair of index will be there.
The problem is, in my implementation I need to insert the elements in order.
I tried somethings to read and after that insert into my sparse-matrix:
1) Using a dense matrix.
2) Using another sparse-matrix implementation, I tried with std::map.
3) Priority queue, I made a array of priority_queues. I insert the element i,j in the priority_queue[i], so when I pop the priority_queue[i] I take the lowest j-index of the row i.
But I need something really fast and memory efficient, because the largest matrix I'll use will be like 100k x 100k, and the tries I made was so slow, almost 200 times slower than the power iteration itself.
Any suggestions? Sorry for the poor english :(
The way many sparse loaders work is that you use an intermediate pure triples structure. I.e. whatever the file looks like, you load it into something like vector< tuple< row, column, value> >.
You then build the sparse structure from that. The reason is precisely what you're running into. Your sparse matrix structure likely has constraints, like you need to know the number of elements in each row/column, or the input needs to be sorted, etc. You can massage your triples array into whatever you need (i.e. by sorting it).
This also makes it trivial to solve your symmetry dilemma. For every triple in the source file, you insert both (row, column, value) and (column, row, value) into your intermediate structure.
Another option is to simply write a script that will sort your professor's file.
FYI, in the sparse world the number of elements (nonzeros) is what matters, not the dimensions of the matrix. 100k-by-100k is a meaningless piece of information. That entire matrix could be totally empty, for example.
I've created my own Matrix class were inside the class the information regarding the Matrix is stored in a STL vector. I've notice that while searching the web some people work with a vector of vectors to represent the Matrix information. My best guess tells me that so long as the matrix is small or skinny (row_num >> column_num) the different should be small, but what about if the matrix is square or fat (row_num << column_num)? If I were to create a very large matrix would I see a difference a run time? Are there other factors that need to be considered?
Thanks
Have you considered using an off-the-shelf matrix representation such as boost's instead of reinventing the wheel?
If you have a lot of empty rows for example, using the nested representation could save a lot of space. Unless you have specific information in actual use cases showing one way isn't meeting your requirements, code the way that's easiest to maintain and implement properly.
There are too many variables to answer your question.
Create an abstraction so that your code does not care how the matrix is represented. Then write your code using any implementation. Then profile it.
If your matrix is dense, the "vector of vectors" is very unlikely to be faster than a single big memory block and could be slower. (Chasing two pointers for random access + worse locality.)
If your matrices are large and sparse, the right answer to your question is probably "neither".
So create an abstract interface, code something up, and profile it. (And as #Mark says, there are lots of third-party libraries you should probably consider.)
If you store everything in a single vector, an iterator will traverse the entire matrix. If you use a vector of vectors, an iterator will only traverse a single dimension.
I'm trying to write a C++ program that needs to store and adjust data in a 3D array. The size is given by the user and doesn't change throughout the run, and I don't need to perform any complicated matrix operations on it. I just need it to be optimized to set and get from given 3D coordinates (I do quite some iterations over all the members, and it's a big array). What's the best way to go about defining that array? Vector of vector of vector? Arrays of vectors? CvMat/IplImage with multi channels? Should I even keep it as 3D or just turn it into one very long interleaved vector and calculate indexes accordingly?
Thanks!
I would go with your last option, a single large array with transformed indices. If all you want to do is read and write known indices, this is probably the most efficient structure, both in terms of storage and speed. You can also wrap this in a class and overload operator () to make it easy to access 3D coordinates, for eg. you could write a(1,2,3) = 10; and the overloaded operator could take care transforming the 3D coordinates into a linear index. Iterating over such an array would also be quite simple since there's only one dimension.
It depends on what you mean by efficient, but have you looked at KD Trees?
I am implementing a sparse matrix class in compressed row format. This means i have a fixed number of rows and each row consists of a number of elements (this number can be different for different rows but will not change after the matrix has been initialised.
Is it suitable to implement this via vectors of vectors or will this somehow fragment the memory?
How whould i implement the allocation of this so i will have one big chunk of memory?
Thanks for sharing your wisdom!
Dirk
You can use existing libraries, like boost sparse matrix (http://www.boost.org/doc/libs/1_43_0/libs/numeric/ublas/doc/matrix_sparse.htm)
The general rule with sparse matrices is that you pick a structure that best fits the location of non-zeros in matrix; so maybe you could write a bit more about the matrix structure, and also what (sort of) algorithms will be invoked on it.
About the memory -- if this matrix is not too big, it is better to alloc it as a one big chunk and make some index magic; this may result in a better cache use.
If it is huge, then the chunked version should be used since it will better fit in memory.
You won't get 'one big chunk' with vector-of-vectors. Each row will be a contiguous chunk, but each row's data could be located at a different place on the heap.
Keep in mind that this might be a good thing if you're going to be dealing with huge matrices. The larger the matrix, the less likely you'll be able to acquire a contiguous chunk to fit the whole thing.
I've implemented the compressed column format with 3 std::vector<int> (2 for row & column indices and one for the value). Why do you need std::vector<std::vector<int>>?
Are you sure you're implementing the right format? A description of the compressed column format (and code for matrix-vector multiplication) can be found here.