How to efficiently generate random subsets of rows from a matrix - c++

I have a large matrix M implemented as vector<vector<double> with m rows, i.e. the matrix is a vector of m vectors of n column elements.
I have to create two subsets of the rows of this matrix, i.e. A holds k rows, and B the other m-k rows. The rows must be selected at random.
I do not want to use any libraries other than STL, so no boost either.
Two approaches that I considered are:
generate a std::random_shuffle of row indices, copy the rows indicated by the first k indices to A and the rows indicated by the other m-k to B
do a std::random_shuffle of M. copy k rows to A, and m-k rows to B
Are there other options, and how do the two options above compare in terms of memory consumption and processing time?
Thanks!

If you don't need B to be in random order, then random_shuffle does more work than you need.
If by "STL" you mean SGI's STL, then use random_sample.
If by "STL" you mean the C++ standard libraries, then you don't have random_sample. You might want to copy the implementation, except stop after the first n steps. This will reduce the time.
Note that these both modify a sequence in place. Depending where you actually want A and B to end up, and who owns the original, this might mean that you end up doing 2 copies of each row - once to get it into a mutable container for the shuffle, then again to get it into its final destination. This is more memory and processing time than is required. To fix this you could maybe swap rows out of the temporary container, and into A and B. Or copy the algorithm, but adapt it to:
Make a list of the indexes of the first vector
Partially shuffle the list of indexes
Copy the rows corresponding to the first n indexes to A, and the rest to B.
I'm not certain this is faster or uses less memory, but I suspect so.
The standard for random_shuffle says that it performs "swaps". I hope that means it's efficient for vectors, but you might want to check that it is actually using an optimised swap, not doing any copying. I think it should mean that, especially since the natural implementation is as Fisher-Yates, but I'm not sure whether the language in the standard should be taken to guarantee it. If it is copying, then your second approach is going to be very slow. If it's using swap then they're roughly comparable. swap on a vector is going to be slightly slower than swap on an index, but there's not a whole lot in it. Swapping either a vector or an index is very quick compared with copying a row, and there are M of each operation, so I doubt it will make a huge difference to total run time.
[Edit: Alex Martelli was complaining recently about misuse of the term "STL" to mean the C++ standard libraries. In this case it does make a difference :-)]

I think that the random_shuffle of indices makes sense.
If you need to avoid the overhead of copying the individual rows, and don't mind sharing data, you might be able to make the A and B matrices be vectors of pointers to rows in the original matrix.

Easiest way: use a random whole number generator, and queue up the offsets of each row in a separate container (assuming that a row is of the same offset in each column vector). The container you use will depend more on its eventual use. (Remember to take care of size_t limit, and tying in the offset container's life to the Matrix itself).
Edit: replaced pointers with offsets - makes more sense and is safer.
Orig: Quick Q:is each (inner) vector a row or a column?
i.e. is M a vector of columns or a vector of rows?

Related

Get row vector of 2d vector in C++

I have a vector of vector in C++ defined with: vector < vector<double> > A;
Let's suppose that A has been filled with some values. Is there a quick way to extract a row vector from A ?
For instance, A[0] will give me the first column vector, but how can I get quicky the first row vector?
There is no "quick" way with that data structure, you have to iterate each column vector and get the value for desired row and add it to temporary row vector. Wether this is fast enough for you or not depends on what you need. To make it as fast as possible, be sure to allocate right amount of space in the target row vector, so it doesn't need to be resized while you add the values to it.
Simple solution to performance problem is to use some existing matrix library, such as Eigen suggested in comments.
If you need to do this yourself (because it is assignment, or because of licensing issues, or whatever), you should probably create your own "Matrix 2D" class, and hide implementation details in it. Then depending on what exactly you need, you can employ tricks like:
have a "cache" for rows, so if same row is fetched many times, it can be fetched from the cache and a new vector does not need to be created
store data both as vector of row vectors, and vector of column vectors, so you can get either rows or columns at constant time, at the cost of using more memory and making changes twice as expensive due to duplication of data
dynamically change the internal representation according to current needs, so you get the fixed memory usage, but need to pay the processing cost when you need to change the internal representation
store data in flat vector with size of rows*columns, and calculate the correct offset in your own code from row and column
But it bears repeating: someone has already done this for you, so try to use an existing library, if you can...
There is no really fast way to do that. Also as pointed out, I would say that the convention is the other way around, meaning that A[0] is actually the first row, rather than the first column. However even trying to get a column is not really trivial, since
{0, 1, 2, 3, 4}
{0}
{0, 1, 2}
is a very possible vector<vector<double>> A, but there is no real column 1, 2, 3 or 4. If you wish to enforce behavior like same length columns, creating a Matrix class may be a good idea (or using a library).
You could write a function that would return a vector<double> by iterating over the rows, storing the appropriate column value. But you would have to be careful about whether you want to copy or point to the matrix values (vector<double> / vector<double *>). This is not very fast as the values are not next to each other in memory.
The answer is: in your case there is no corresponding simple option as for columns. And one of the reasons is that vector> is a particular poorly suited container for multi-dimensional data.
In multi-dimension it is one of the important design decisions: which dimension you want to access most efficiently, and based on the answer you can define your container (which might be very specialized and complex).
For example in your case: it is entirely up to you to call A[0] a 'column' or a 'row'. You only need to do it consistently (best define a small interface around which makes this explicit). But STOP, don't do that:
This brings you to the next level: for multi-dimensional data you would typically not use vector> (but this is a different issue). Look at smart and efficient solutions already existing e.g. in ublas https://www.boost.org/doc/libs/1_65_1/libs/numeric/ublas/doc/index.html or eigen3 https://eigen.tuxfamily.org/dox/
You will never be able to beat these highly optimized libraries.

Performance implications of using a list of vectors versus a vector of vectors when appending in parallel

It seems that in general, vectors are to be preferred over lists, see for example here, when appending simple types.
What if I want to fill a matrix with simple types? Every vector is a column, so I am going to go through the outer vector, and append 1 item to each vector, repeatedly.
Do the latter vectors of the outer vector always have to be moved when the previous vectors increase their reserved space? As in is the whole data in one continuous space? Or do the vectors all just hold a pointer to their individual memory regions, so the outer vector's memory size remains unchanged even as the individual vectors grow?
Taken from the comments, it appears vectors of vectors can happily be used.
For small to medium applications, the efficiency of the vectors will seldom be anything to worry about.
There a couple of cases where you might worry, but they will be uncommon.
class CData {}; // define this
typedef std::vector<CData> Column;
typedef std::vector<Column> Table;
Table tab;
To add a new row, you will append an item to every column. In a worst case, you might cause a reallocation of each column. That could be a problem if CData is extremely complex and the columns currently hold a very large number of CData cells (I'd say 10s of thousands, at least)
Similarly, if you add a new column and force the Table vector to reallocate, it might have to copy each column and again, for very large data sets, that might be a bit slow.
Note, however, that a new-ish compiler will probably be able to move the columns from the old table to the new (rather than copying them), making that trivially fast.
As #kkuryllo said in a comment, it generally isn't anything to worry about.
Work on making your code as clean, simple and correct as possible. Only if profiling reveals a performance problem should you worry about optimising for speed.

Quick(est) way to remove all the elements of a vector from another one

I'm coding a physical simulation and I'm making great use of two vectors of same elements (homebaked struct). A necessary crucially slowing down my computer is when I'm trying to remove all the elements contained in my vec2 from the vec1 (there could be also many copies of each one of this element in vec2), my current implementation runs with complexity size(vec1)*size(vec2) but it seems like something not to far from sorting algorithm and I'm thinking that someone may have already implemented something much faster (N.log(N)) getting the job done. Have you heard of/manipulated something alike?
If the vectors are not ordered then in any case the complexity will be equal to O( m * n ) where m and n are sizes of the vectors.
Sorting the vectors by itself is a way slower from what you could achieve if the items are hash-able. It will take (N+M) to create a hash table from one of the vectors, and then search for the items from the other in it.

Algorithm for merging short lists into a long vector

I have a sparse matrix class whose non-zeros and corresponding column indices are stored, in row-order, in what are basically STL-vector-like containers. They may have unused capacity, like vectors; and to insert/remove elements, existing elements must be moved.
Say I have an operation, insert_erase_replace, or ier for short. ier can do the following, given a position p, a column index j, and a value v:
if v==0, ier removes the entry at p and left-shifts all subsequent entries.
if v!=0, and j is already present at p, ier replaces the cell contents at p with v.
if v!=0, and j is not present at p, ier inserts the entry v and column index j at p after right-shifting all subsequent entries.
So all of that is trivial.
Now let's say I have ier2, which does the same thing, except that it takes a list containing multiple column indices j and corresponding values v. It also has a size n, which indicates how many index/value pairs are present in the list. But because the vector only stores non-zeros, sometimes the actual insertion size is smaller than n.
Still trivial.
But now let's say I have ier3, which takes not just one list like ier2, but multiple lists. This represents editing a slice of the sparse matrix.
At some point, it becomes more efficient to iterate through the vectors, copying them piece by piece and inserting/replacing/erasing the list indices/values ier2-style as we arrive at each insertion point. And if the total insertion size would cause my vector to need a resize anyway, then we do that.
Given that my vector is much, much larger than the total length of the lists, is there an algorithm for efficiently merging the lists into the vector?
So far, here's what I have:
Each list passed to ier3 represents either a net deletion of entries (a left shift), a net replacement (no movement, therefore cheap), or a net insertion of entries (a right shift). There may also be some re-arrangement of elements in there, but the expensive parts are the net deletions and net insertions.
It's not hard to figure out an algorithm for efficiently doing ONLY net insertions or net deletions.
It's harder when either of the two may be happening.
The only thing I can think to do is to handle it in two passes:
Erase/replace
Insert/replace
We erase first because it makes it more likely that any insertions will require fewer copies.
Is this the right approach? Does anyone know of a better one?
Okay, so I'm going to suppose the intervals covered in each list in ier3 are disjoint and given to you in order. If it's meant for editing slices of a matrix, this seems reasonable. I'm also assuming you that you don't need to resize the vector, because that case is easily detectable and solvable.
Initialise a read pointer and a write pointer to the start of the vector you're editing. There'll be an instruction pointer into ie3 too, but I'll ignore that here for clarity's sake. You'll also need a queue. At each step, one of several things can happen:
Default: Neither read nor write are at a position detailed by ier3. In this case, add the element under read to the back of the queue and write the element at the front of the queue to the cell under write. Move both pointers forward one.
read is over a cell that needs to be deleted. In this case, simply move read forward one without adding anything to the queue.
read passes from one cell to the next such that an insertion should happen between them. In this case, add the insertion to the back of the queue and then continue with the next relevant case.
read is at a cell that needs to be modified. In this case, insert the modified cell at the back of the queue, write whatever's at the front of the queue to write, and step them both forwards.
read has arrived at the unused capacity of the vector. In which case just write whatever's left in the queue.
That's the basic outline, but a couple of optimizations can be made: first, if the queue's empty, step both pointers forward to the next position detailed by ie3 without doing anything. Second, minimize the buffer by doing extra writing steps whenever read is ahead of write and the queue is nonempty.
I'd go with your plan with a few important points highlighted.
The erase/replace step should start from the left and only move points within the affected range - it can leave a "gap". It should determine the size of the final vector. At the end of this step, use the determined size to shift the "tail" of the vector as needed, leaving the exact amount of space required for insertions free.
The insertions should start from the right and fill up the gap we left in step 1 by copying each point to it's final position.
This will never shift the main vector once and never copy any point (from the existing slice or insertion set) more than twice so it's essentially linear.
Other data structures might be helpful too - reserving space at both the front and end, or building it out of multiple sections so a resize doesn't force a full copy.
One further optimisation would be to allow some insertions during step 1. If you've erased some, completing any insertion you come across immediately until it balances will prevent you needing to move any points until you reach another erase.
Let n be the size of the list and m be the size of the vector. It sounds like ier does a binary search for j every time, so the searching part is O(n*log(m)).
Assuming the elements in the list are sorted, once you find the first element, it's faster to just navigate up the vector to find the next one. That way searching becomes O(log(m) + n) = O(n).
Also, do a dry pass first to count net deletions/insertions, and a second pass to actually apply the changes. I think these two passes will run faster than the two you describe.
I can suggest a different design for a sparse matrix that should help you achieve performance and a low memory footprint for large sparse matrices.
Instead of vector, why not use a 2D hash table. something like (no std:: for smaller code):
typedef unordered_map< unsigned /* index */, int /* value */ > col_type;
unordered_map< unsigned /* index */, col_type*>; // may need to define hash function for col_type
the outer class (sparse_matrix) searches in O(1) for a column. If not found, it allocates a new column.
Then the column type is searched for the column index in O(1) and either delete/replace or insert based on the original logic. It can see if the column is now empty and delete it from the 'row' hash map.
all basic operations add/delete/replace are O(1).
If you need a fast ordered iteration of the matrix, you can replace the unordered_map with 'map'. If the matrix is very sparse, the O(nlog(n)) complexity will be similar to the hash_map's.
BTW I used pointer to the col_type on purse, the outer hash map grows much (much!) faster this way.

How to efficiently *nearly* sort a list?

I have a list of items; I want to sort them, but I want a small element of randomness so they are not strictly in order, only on average ordered.
How can I do this most efficiently?
I don't mind if the quality of the random is not especially good, e.g. it simply based on the chance ordering of the input, e.g. an early-terminated incomplete sort.
The context is implementing a nearly-greedy search by introducing a very slight element of inexactness; this is in a tight loop and so the speed of sorting and calling random() are to be considered
My current code is to do a std::sort (this being C++) and then do a very short shuffle just in the early part of the array:
for(int i=0; i<3; i++) // I know I have more than 6 elements
std::swap(order[i],order[i+rand()%3]);
Use first two passes of JSort. Build heap twice, but do not perform insertion sort. If element of randomness is not small enough, repeat.
There is an approach that (unlike incomplete JSort) allows finer control over the resulting randomness and has time complexity dependent on randomness (the more random result is needed, the less time complexity). Use heapsort with Soft heap. For detailed description of the soft heap, see pdf 1 or pdf 2.
You could use a standard sort algorithm (is a standard library available?) and pass a predicate that "knows", given two elements, which is less than the other, or if they are equal (returning -1, 0 or 1). In the predicate then introduce a rare (configurable) case where the answer is random, by using a random number:
pseudocode:
if random(1000) == 0 then
return = random(2)-1 <-- -1,0,-1 randomly choosen
Here we have 1/1000 chances to "scamble" two elements, but that number strictly depends on the size of your container to sort.
Another thing to add in the 1000 case, could be to remove the "right" answer because that would not scramble the result!
Edit:
if random(100 * container_size) == 0 then <-- here I consider the container size
{
if element_1 < element_2
return random(1); <-- do not return the "correct" value of -1
else if element_1 > element_2
return random(1)-1; <-- do not return the "correct" value of 1
else
return random(1)==0 ? -1 : 1; <-- do not return 0
}
in my pseudocode:
random(x) = y where 0 <= y <=x
One possibility that requires a bit more space but would guarantee that existing sort algorithms could be used without modification would be to create a copy of the sort value(s) and then modify those in some fashion prior to sorting (and then use the modified value(s) for the sort).
For example, if the data to be sorted is a simple character field Name[N] then add a field (assuming data is in a structure or class) called NameMod[N]. Fill in the NameMod with a copy of Name but add some randomization. Then 3% of the time (or some appropriate amount) change the first character of the name (e.g., change it by +/- one or two characters). And then 10% of the time change the second character +/- a few characters.
Then run it through whatever sort algorithm you prefer. The benefit is that you could easily change those percentages and randomness. And the sort algorithm will still work (e.g., it would not have problems with the compare function returning inconsistent results).
If you are sure that element is at most k far away from where they should be, you can reduce quicksort N log(N) sorting time complexity down to N log(k)....
edit
More specifically, you would create k buckets, each containing N/k elements.
You can do quick sort for each bucket, which takes k * log(k) times, and then sort N/k buckets, which takes N/k log(N/k) time. Multiplying these two, you can do sorting in N log(max(N/k,k))
This can be useful because you can run sorting for each bucket in parallel, reducing total running time.
This works if you are sure that any element in the list is at most k indices away from their correct position after the sorting.
but I do not think you meant any restriction.
Split the list into two equally-sized parts. Sort each part separately, using any usual algorithm. Then merge these parts. Perform some merge iterations as usual, comparing merged elements. For other merge iterations, do not compare the elements, but instead select element from the same part, as in the previous step. It is not necessary to use RNG to decide, how to treat each element. Just ignore sorting order for every N-th element.
Other variant of this approach nearly sorts an array nearly in-place. Split the array into two parts with odd/even indexes. Sort them. (It is even possible to use standard C++ algorithm with appropriately modified iterator, like boost::permutation_iterator). Reserve some limited space at the end of the array. Merge parts, starting from the end. If merged part is going to overwrite one of the non-merged elements, just select this element. Otherwise select element in sorted order. Level of randomness is determined by the amount of reserved space.
Assuming you want the array sorted in ascending order, I would do the following:
for M iterations
pick a random index i
pick a random index k
if (i<k)!=(array[i]<array[k]) then swap(array[i],array[k])
M controls the "sortedness" of the array - as M increases the array becomes more and more sorted. I would say a reasonable value for M is n^2 where n is the length of the array. If it is too slow to pick random elements then you can precompute their indices beforehand. If the method is still too slow then you can always decrease M at the cost of getting a poorer sort.
Take a small random subset of the data and sort it. You can use this as a map to provide an estimate of where every element should appear in the final nearly-sorted list. You can scan through the full list now and move/swap elements that are not in a good position.
This is basically O(n), assuming the small initial sorting of the subset doesn't take a long time. Hopefully you can build the map such that the estimate can be extracted quickly.
Bubblesort to the rescue!
For a unsorted array, you could pick a few random elements and bubble them up or down. (maybe by rotation, which is a bit more efficient) It will be hard to control the amount of (dis)order, even if you pick all N elements, you are not sure that the whole array will be sorted, because elements are moved and you cannot ensure that you touched every element only once.
BTW: this kind of problem tends to occur in game playing engines, where the list with candidate moves is kept more-or-less sorted (because of weighted sampling), and sorting after each iteration is too expensive, and only one or a few elements are expected to move.