Eigen Sparse Vector : Find max coefficient - c++

I am working with sparse vector with Eigen, and I need to find an efficient way to compute the index of the max coefficient (or the nth max coefficient).
My initial method uses Eigen::SparseVector::InnerIterator, however it does not compute the right value in the case of vector containing only zeros and negative value because InnerIterator only iterate on non-zero values.
How to implement it in order to take into account zero values ?

To get the index of the largest non-zero element, you can use this function:
Eigen::Index maxRow(Eigen::SparseVector<double> const & v)
{
Eigen::Index nnz = v.nonZeros();
Eigen::Index rowIdx;
double value = Eigen::VectorXd::Map(v.valuePtr(), nnz).maxCoeff(&rowIdx);
// requires special handling if value <= 0.0
return v.innerIndexPtr()[rowIdx];
}
In case value <=0 (and v.nonZeros()<v.size()), you can iterate through innerIndexPtr() until you find a gap between consecutive elements (or write something more sophisticated using std::lower_bound)
For getting the nth largest element it depends on how large your n is relative to the vector size, how many non-zeros you have, if you can modify your SparseVector, etc.
Especially, if n is relatively large, consider to partition your elements into positive and negative elements, then using std::nth_element in the correct half.

Iterate over the index array (innerIndices I think) as well at the same time as the inner iterator.

Related

Is it possible to iterate over an array in random order?

So, suppose I have an array:
int arr[5];
Instead of iterating through the array from 0 to 4, is it possible to do that in random order, and in a way that the iterator wouldn't go through the same position twice?
All the below assumes you want to implement your iteration in O(1) space - not using e.g. an auxiliary array of indices, which you could shuffle. If you can afford O(n) space, the solution is relatively easy, so I assume you can only afford O(1).
To iterate over an array in a random manner, you can make a random number generator, and use its output as an index into the array. You cannot use the standard RNGs this way because they can output the same index twice too soon. However, the theory of RNGs is pretty extensive, and you have at least two types of RNG which have guarantees on their period.
Linear-feedback shift register
Lehmer random number generator
To use these theoretical ideas:
Choose a number N ≤ n (where n is the size of your array), but not much greater than n, such that it's possible to construct a RNG with period N
Call your RNG N times, so it generates a permutation of numbers from 0 to N-1
For each number, if it's smaller than the size of your array, output the element of your array with that index
The techniques for making a RNG with a given period may be too annoying to implement in code. So, if you know the size of your array in advance, you can choose N = n, and create your RNG manually (factoring n or n+1 will be the first step). If you don't know the size of your array, choose some easy-to-use number for N, like some suitable power of 2.
As mentioned in the comments, there are several ways you can do this. The most obvious way would be to just shuffle the range and then iterate over it:
std::shuffle(arr, arr + 5);
for (int a : arr)
// a is a new random element from arr in each iteration
This will modify the range of course, which may not be what you want.
For an option that doesn't modify the range: you can generate a range of indices, shuffle those indices and use those indices to index into the range:
int ind[5];
std::iota(ind, ind + 5, 0);
std::shuffle(ind, ind + 5);
for (int i : ind)
// arr[i] is a new random element from arr in each iteration
You could also implement a custom iterator that iterates over a range in random order without repeats, though this might be overkill for the functionality you're trying to achieve. It's still a good thing to try out, to learn how that works.

C++: Efficient way to check if elements in a vector are greater than elements in another having same indices?

I have a vector < vector <int> > like so:
v = {{1,2,3}, {4,2,1}, {3,1,1}....}}
All v's elements like v[0], v[1], v[2]... have the same size. There may be duplicate elements.
What I am trying to do is to find and delete vectors (like v[2]) that are "majorized" by another vector (like v[1]), i.e. all elements of v[1] are greater than/equal to the respective elements(in order of indices) in v[2].
A naive way of doing this would be to loop thorough v and compare each vector with another vector and further compare each element with another vector's element.
But I feel there must a better way to do this without getting O(n^3) in the number of elements of all the vectors in v.
If multiple vectors are equal, I need only one of them (i.e delete all except one). A random choice would be sufficient.
Any thoughts or ideas are appreciated!
This is called the maxima of a point set. For two and three dimensions, this can be solved in O(n log n) time. For more than three dimensions, this can be solved in O(n(log n)^(d − 3)  log log n) time. For random points, a linear expected time algorithm is available.

Most efficient way to find index of matching values in two sorted arrays using C++

I currently have a solution but I feel it's not as efficient as it could be to this problem, so I want to see if there is a faster method to this.
I have two arrays (std::vectors for example). Both arrays contain only unique integer values that are sorted but are sparse in value, ie: 1,4,12,13... What I want to ask is there fast way I can find the INDEX to one of the arrays where the values are the same. For example, array1 has values 1,4,12,13 and array2 has values 2,12,14,16. The first matching value index is 1 in array2. The index into the array is what is important as I have other arrays that contain data that will use this index that "matches".
I am not confined to using arrays, maps are possible to. I am only comparing the two arrays once. They will not be reused again after the first matching pass. There can be small to large number of values (300,000+) in either array, but DO NOT always have the same number of values (that would make things much easier)
Worse case is a linear search O(N^2). Using map would get me better O(log N) but I would still have convert an array to into a map of value, index pairs.
What I currently have to not do any container type conversions is this. Loop over the smaller of the two arrays. Compare current element of small array (array1) with the current element of large array (array2). If array1 element value is larger than array2 element value, increment the index for array2 until is it no longer larger than array1 element value (while loop). Then, if array1 element value is smaller than array2 element, go to next loop iteration and begin again. Otherwise they must be equal and I have my index to either arrays of the matching value.
So in this loop, I am at best O(N) if all values have matches and at worse O(2N) if none match. So I am wondering if there is something faster out there? It's hard to know for sure how often the two arrays will match, but I would way I would lean more toward most of the arrays will mostly have matches than not.
I hope I explained the problem well enough and I appreciate any feedback or tips on improving this.
Code example:
std::vector<int> array1 = {4,6,12,34};
std::vector<int> array2 = {1,3,6,34,40};
for(unsigned int i=0, z=0; i < array1.size(); i++)
{
int value1 = array1[i];
while(value1 > array2[z] && z < array2.size())
z++;
if (z >= array2.size())
break; // reached end of array2
if (value1 < array2[z])
continue;
// we have a match, i and z indices have same value
}
Result will be matching indexes for array1 = [1,3] and for array2= [2,3]
I wrote an implementation of this function using an algorithm that performs better with sparse distributions, than the trivial linear merge.
For distributions, that are similar†, it has O(n) complexity but ranges where the distributions are greatly different, it should perform below linear, approaching O(log n) in optimal cases. However, I wasn't able to prove that the worst case isn't better than O(n log n). On the other hand, I haven't been able to find that worst case either.
I templated it so that any type of ranges can be used, such as sub-ranges or raw arrays. Technically it works with non-random access iterators as well, but the complexity is much greater, so it's not recommended. I think it should be possible to modify the algorithm to fall back to linear search in that case, but I haven't bothered.
† By similar distribution, I mean that the pair of arrays have many crossings. By crossing, I mean a point where you would switch from one array to another if you were to merge the two arrays together in sorted order.
#include <algorithm>
#include <iterator>
#include <utility>
// helper structure for the search
template<class Range, class Out>
struct search_data {
// is any there clearer way to get iterator that might be either
// a Range::const_iterator or const T*?
using iterator = decltype(std::cbegin(std::declval<Range&>()));
iterator curr;
const iterator begin, end;
Out out;
};
template<class Range, class Out>
auto init_search_data(const Range& range, Out out) {
return search_data<Range, Out>{
std::begin(range),
std::begin(range),
std::end(range),
out,
};
}
template<class Range, class Out1, class Out2>
void match_indices(const Range& in1, const Range& in2, Out1 out1, Out2 out2) {
auto search_data1 = init_search_data(in1, out1);
auto search_data2 = init_search_data(in2, out2);
// initial order is arbitrary
auto lesser = &search_data1;
auto greater = &search_data2;
// if either range is exhausted, we are finished
while(lesser->curr != lesser->end
&& greater->curr != greater->end) {
// difference of first values in each range
auto delta = *greater->curr - *lesser->curr;
if(!delta) { // matching value was found
// store both results and increment the iterators
*lesser->out++ = std::distance(lesser->begin, lesser->curr++);
*greater->out++ = std::distance(greater->begin, greater->curr++);
continue; // then start a new iteraton
}
if(delta < 0) { // set the order of ranges by their first value
std::swap(lesser, greater);
delta = -delta; // delta is always positive after this
}
// next crossing cannot be farther than the delta
// this assumption has following pre-requisites:
// range is sorted, values are integers, values in the range are unique
auto range_left = std::distance(lesser->curr, lesser->end);
auto upper_limit =
std::min(range_left, static_cast<decltype(range_left)>(delta));
// exponential search for a sub range where the value at upper bound
// is greater than target, and value at lower bound is lesser
auto target = *greater->curr;
auto lower = lesser->curr;
auto upper = std::next(lower, upper_limit);
for(int i = 1; i < upper_limit; i *= 2) {
auto guess = std::next(lower, i);
if(*guess >= target) {
upper = guess;
break;
}
lower = guess;
}
// skip all values in lesser,
// that are less than the least value in greater
lesser->curr = std::lower_bound(lower, upper, target);
}
}
#include <iostream>
#include <vector>
int main() {
std::vector<int> array1 = {4,6,12,34};
std::vector<int> array2 = {1,3,6,34};
std::vector<std::size_t> indices1;
std::vector<std::size_t> indices2;
match_indices(array1, array2,
std::back_inserter(indices1),
std::back_inserter(indices2));
std::cout << "indices in array1: ";
for(std::vector<int>::size_type i : indices1)
std::cout << i << ' ';
std::cout << "\nindices in array2: ";
for(std::vector<int>::size_type i : indices2)
std::cout << i << ' ';
std::cout << std::endl;
}
Since the arrays are already sorted you can just use something very much like the merge step of mergesort. This just looks at the head element of each array, and discards the lower element (the next element becomes the head). Stop when you find a match (or when either array becomes exhausted, indicating no match).
This is O(n) and the fastest you can do for arbitrary distubtions. With certain clustered distributions a "skip ahead" approach could be used rather than always looking at the next element. This could result in better than O(n) running times for certain distributions. For example, given the arrays 1,2,3,4,5 and 10,11,12,13,14 an algorithm could determine there were no matches to be found in as few as one comparison (5 < 10).
What is the range of the stored numbers?
I mean, you say that the numbers are integers, sorted, and sparse (i.e. non-sequential), and that there may be more than 300,000 of them, but what is their actual range?
The reason that I ask is that, if there is a reasonably small upper limit, u, (say, u=500,000), the fastest and most expedient solution might be to just use the values as indices. Yes, you might be wasting memory, but is 4*u really a lot of memory? This depends on your application and your target platform (i.e. if this is for a memory-constrained embedded system, its less likely to be a good idea than if you have a laptop with 32GiB RAM).
Of course, if the values are more-or-less evenly spread over 0-2^31-1, this crude idea isn't attractive, but maybe there are properties of the input values that you can exploit other simply than the range. You might be able to hand-write a fairly simple hash function.
Another thing worth considering is whether you actually need to be able to retrieve the index quickly or if it helps just be able to tell if the index exists in the other array quickly. Whether or not a value exists at a particular index requires only one bit, so you could have a bitmap of the range of the input values using 32x less memory (i.e. mask off 5 LSBs and use that as a bit position, then shift the remaining 27 bits 5 places right and use that as an array index).
Finally, a hybrid approach might be worth considering, where you decide how much memory you're prepared to use (say you decide 256KiB, which corresponds to 64Ki 4-byte integers) then use that as a lookup-table to into much smaller sub-problems. Say you have 300,000 values whose LSBs are pretty evenly distributed. Then you could use 16 LSBs as indices into a lookup-table of lists that are (on average) only 4 or 5 elements long, which you can then search by other means. A couple of year ago, I worked on some simulation software that had ~200,000,000 cells, each with a cell id; some utility functionality used a binary search to identify cells by id. We were able to speed it up significantly and non-intrusively with this strategy. Not a perfect solution, but a great improvement. (If the LSBs are not evenly distributed, maybe that's a property that you can exploit or maybe you can choose a range of bits that are, or do a bit of hashing.)
I guess the upshot is “consider some kind of hashing”, even the “identity hash” or simple masking/modulo with a little “your solution doesn't have to be perfectly general” on the side and some “your solution doesn't have to be perfectly space efficient” sauce on top.

Finding nth largest element inplace

I need to find nth largest element in an array and currently I'm doing it the following way:
std::vector<double> buffer(sequence); // sequence is const std::vector<double>
std::nth_element(buffer.begin(), buffer.begin() + idx, buffer.end(), std::greater<double>());
nth_element = buffer[idx];
But is there any way to find the n-th largest element in an array without using an external buffer?
You can avoid copying the entire buffer without modifying the original range by using std::partial_sort_copy. Simply copy the partially sorted range into a smaller buffer of size n, and take the last element.
If you may modify the original buffer, then you can simply use std::nth_element in place.
You can use the partition function used in Quick Sort to find the nth largest elements.
The partition function will divide the original array into two parts.The first part will be smaller than A[i], The second part will be larger than A[i],partition will return i.If i is equal to n, then return A[i].If i is smaller than n,then partition the second part.If i is larger than n,then partition the first part.
It won't cost you extra buffer,and the average time cost is O(n).

Super long arrays in C++

I have two sets A and B. Set A contains unique elements. Set B contains all elements. Each element in the B is a 10 by 10 matrix where all entries are either 1 or 0. I need to scan through set B and everytime i encounter a new matrix i will add it to set A. Therefore set A is a subset of B containing only unique matrices.
It seems like you might really be looking for a way to manage a large, sparse array. Trivially, you could use a hash map with your giant index as your key, and your data as the value. If you talk more about your problem, we might be able to find a more appropriate data structure for your problem.
Update:
If set B is just some set of matrices and not the set of all possible 10x10 binary matrices, then you just want a sparse array. Every time you find a new matrix, you compute its key (which could simply be the matrix converted into a 100 digit binary value, or even a 100 character string!), look up that index. If no such key exists, insert the value 1 for that key. If the key does exist, increment and re-store the new value for that key.
Here is some code, maybe not very efficient :
# include <vector>
# include <bitset>
# include <algorithm>
// I assume your 10x10 boolean matrix is implemented as a bitset of 100 bits.
// Comparison of bitsets
template<size_t N>
class bitset_comparator
{
public :
bool operator () (const std::bitset<N> & a, const std::bitset<N> & b) const
{
for(size_t i = 0 ; i < N ; ++i)
{
if( !a[i] && b[i] ) return true ;
else if( !b[i] && a[i] ) return false ;
}
return false ;
}
} ;
int main(int, char * [])
{
std::set< std::bitset<100>, bitset_comparator<100> > A ;
std::vector< std::bitset<100> > B ;
// Fill B in some manner ...
// Keeping unique elements in A
std::copy(B.begin(), B.end(), std::inserter(A, A.begin())) ;
}
You can use std::listinstead of std::vector. The relative order of elements in B is not preserved in A (elements in A are sorted).
EDIT : I inverted A and B in my first post. It's correct now. Sorry for the inconvenience. I also corrected the comparison functor.
Each element in the B is a 10 by 10 matrix where all entries are either 1 or 0.
Good, that means it can be represented by a 100-bit number. Let's round that up to 128 bits (sixteen bytes).
One approach is to use linked lists - create a structure like (in C):
typedef struct sNode {
unsigned char bits[16];
struct sNode *next;
};
and maintain the entire list B as a sorted linked list.
The performance will be somewhat less (a) than using the 100-bit number as an array index into a truly immense (to the point of impossible given the size of the known universe) array.
When it comes time to insert a new item into B, insert it at its desired position (before one that's equal or greater). If it was a brand new one (you'll know this if the one you're inserting before is different), also add it to A.
(a) Though probably not unmanageably so - there are options you can take to improve the speed.
One possibility is to use skip lists, for faster traversal during searches. These are another pointer that references not the next element but one 10 (or 100 or 1000) elements along. That way you can get close to the desired element reasonably quickly and just do the one-step search after that point.
Alternatively, since you're talking about bits, you can divide B into (for example) 1024 sub-B lists. Use the first 10 bits of the 100-bit value to figure out which sub-B you need to use and only store the next 90 bits. That alone would increase search speed by an average of 1000 (use more leading bits and more sub-Bs if you need improvement on that).
You could also use a hash on the 100-bit value to generate a smaller key which you can use as an index into an array/list, but I don't think that will give you any real advantage over the method in the previous paragraph.
Convert each matrix into a string of 100 binary digits. Now run it through the Linux utilities:
sort | uniq
If you really need to do this in C++, it is possible to implement your own merge sort, then the uniq part becomes trivial.
You don't need N buckets where N is the number of all possible inputs. A binary tree will just do fine. This is implemented with set class in C++.
vector<vector<vector<int> > > A; // vector of 10x10 matrices
// fill the matrices in A here
set<vector<vector<int> > > B(A.begin(), A.end()); // voila!
// now B contains all elements in A, but only once for duplicates