Sort elements, but keep certain ones fixed - c++

The function
template <typename Container, typename Comparator, typename Predicate>
void sortButKeepSomeFixed (Container& c, const Comparator& comp, const Predicate& pred)
is to sort the container c according to the ordering criterion comp, but those elements that satisfy pred shall remain fixed in their original positions after the sort (i.e. unaffected by the sort).
I tried to adapt quick sort to fit this, but could not think of it. In the end, I decided to adapt the crude selection sort to get the job done:
#include <iostream>
#include <vector>
std::vector<int> numbers = {5,7,1,8,9,3,20,2,11};
template <typename Container, typename Comparator, typename Predicate>
void sortButKeepSomeFixed (Container& c, const Comparator& comp, const Predicate& pred) { // O(n^2), but want O(nlogn) on average (like quick sort or merge sort)
const std::size_t N = c.size();
std::size_t i, j, minIndex;
for (i = 0; i < N-1; i++) {
if (pred(c[i]))
continue; // c[i] shall not swap with any element.
minIndex = i;
for (j = i + 1; j < N; j++) {
if (pred(c[j]))
continue; // c[j] shall not swap with any element.
if (comp(c[j], c[minIndex]))
minIndex = j;
}
if (minIndex != i)
std::swap(c[i], c[minIndex]);
}
}
int main() {
sortButKeepSomeFixed (numbers,
std::greater<int>(), // Ordering condition.
[](int x) {return x % 2 == 0;}); // Those that shall remain fixed.
for (int x : numbers) std::cout << x << ' '; // 11 9 7 8 5 3 20 2 1
}
But the time complexity is O(N^2) (I think). Can someone improve on the time complexity here, to perhaps O(NlogN) on average? In other words, find an overall better algorithm, using recursion or something like that?
Or perhaps a better idea is to take out the elements that satisfy pred, sort what left with std::sort and then put the extracted elements back in their original positions? Would that be any more efficient, or would that just make it worse?
Update:
This is based on Beta's suggestion (sorting the iterators that don't pass pred). But though the elements that pass pred do indeed remain fixed, the sorting at the end is not correct.
template <typename Container, typename Comparator, typename Predicate>
void sortButKeepSomeFixed (Container& c, const Comparator& comp, const Predicate& pred) {
std::vector<typename Container::iterator> iterators;
for (typename Container::iterator it = c.begin(); it != c.end(); ++it) {
if (!pred(*it))
iterators.emplace_back(it);
}
std::vector<typename Container::iterator> originalIterators = iterators;
std::sort(iterators.begin(), iterators.end(),
[comp](const typename Container::iterator& x, const typename Container::iterator& y)
{return comp(*x, *y);});
for (int i = 0; i < originalIterators.size(); i++)
*originalIterators[i] = *iterators[i];
}
The incorrect output is 11 9 9 8 11 3 20 2 9 when it should be 11 9 7 8 5 3 20 2 1.

That's a fun one. I first tried to code the IMO correct approach, using a custom iterator that just skips elements that satisfy the predicate. This turned out to be quite challenging, at least writing that on a mobile phone as I'm doing it.
Basically, this should lead to code similar to what you can find in Eric Niebler's ranges v3.
But there's also the simpler, direct approach that you're trying to use above. The problem of your non working solution is, that it's changing the values the (rest of the sorted) iterators point to when assigning in that last for loop. This issue can be avoided by having a copy, like in my code:
int main(int, char **) {
vector<int> input {1,2,3,4,5,6,7,8,9};
vector<reference_wrapper<int>> filtered{begin(input), end(input)};
filtered.erase(remove_if(begin(filtered), end(filtered),
[](auto e) {return e%2==0;}), end(filtered));
vector<int> sorted{begin(filtered), end(filtered)};
// change that to contain reference wrappers to see the issue
sort(begin(sorted), end(sorted),
greater<int>{});
transform(begin(filtered), end(filtered),
begin(sorted),
begin(filtered),
[](auto to, auto from) {
to.get() = from; return to;});
copy(begin(input), end(input),
ostream_iterator<int>{cout, ", "});
return 0;
}
Live example here. Forgot to fork before modifying, sorry.
(Instead of using copies at last for types that are using heap allocated data move should probably be used. Though I'm not sure whether you can assign to a moved from object.)
Using a ... rather weird ... wrapper class instead of the std::reference_wrapper makes it possible to achieve the filtered sorting without having to use a vector with (copied or moved) elements of the value type:
template <class T>
class copyable_ref {
public:
copyable_ref(T& ref) noexcept
: _ptr(std::addressof(ref)), _copied(false) {}
copyable_ref(T&&) = delete;
copyable_ref(const copyable_ref& x) noexcept
: _ptr (new int(*x._ptr)), _copied (true) {
}
~copyable_ref() {
if (_copied) {
delete _ptr;
}
}
copyable_ref& operator=(const copyable_ref& x) noexcept {
*_ptr = *x._ptr;
}
operator T& () const noexcept { return *_ptr; }
T& get() const noexcept { return *_ptr; }
private:
T* _ptr;
bool _copied;
};
Upon construction this class stores a pointer to it's argument, which is also modified when the copy assignment operator is used. But when an instance is copy constructed, then a heap allocated copy of the referenced (by the other) value is made. This way, it's possible to swap two referenced values with code similar to
Value a, b;
copyable_ref<Value> ref_a{a}, ref_b{b};
copyable_ref<Value> temp{ref_a};
ref_a = ref_b;
ref_b = temp;
// a and b are swapped
This was necessary because std::sort doesn't seem to use swap (found through ADL or std::swap) but code equivalent to the one above.
Now it's possible to sort a filtered "view" by filling a vector with (not copy constructed) instances of the weird wrapper class and sorting that vector. As the output in the example is showing, there's at most one heap allocated copy of a value type. Not counting the needed size for the pointers inside of the wrapper, this class enables filtered sorting with constant space overhead:
vector<int> input {1,2,3,4,5,6,7,8,9};
vector<copyable_ref<int>> sorted;
sorted.reserve(input.size());
for (auto & e : input) {
if (e % 2 != 0) {
sorted.emplace_back(e);
}
}
sort(begin(sorted), end(sorted),
greater<int>{});
copy(begin(input), end(input),
ostream_iterator<int>{cout, ", "});
cout << endl;
// 9 2 7 4 5 6 3 8 1
Finally, while this works quite well, I probably wouldn't use this in production code. I was especially surprised that std::sort wasn't using my own swap implementation, which led to this adventurous copy constructor.
You cannot generalise your code to work for sets and maps: Those are sorted by design, and they need that fixed order to function properly. And the unordered variants are, well, unordered and thus cannot maintain an order. But you can always (as long as you don't modify the container) use std::reference_wrappers inside of a vector to provide a sorted "view" of your data.

Based on Beta's idea to sort using iterators, though I'm not sure what the time-complexity is. It also does not work with all containers, e.g. std::set, std::map.
template <typename Container, typename Comparator, typename Predicate>
void sortButKeepSomeFixed (Container& c, const Comparator& comp, const Predicate& pred) {
std::vector<typename Container::value_type> toSort;
std::vector<typename Container::iterator> iterators;
for (typename Container::iterator it = c.begin(); it != c.end(); ++it) {
if (!pred(*it)) {
toSort.emplace_back(*it);
iterators.emplace_back(it);
}
}
std::sort(toSort.begin(), toSort.end(), comp);
for (std::size_t i = 0; i < toSort.size(); i++)
*iterators[i] = toSort[i];
}
std::vector<int> vector = {5,7,1,8,9,3,20,2,11};
std::array<int, 9> array = {5,7,1,8,9,3,20,2,11};
std::list<int> list = {5,7,1,8,9,3,20,2,11};
std::set<int> set = {5,7,1,8,9,3,20,2,11};
std::map<double, int> map = { {1.5,5}, {1.2,7}, {3.5,1}, {0.5,8}, {5.2,9}, {7.5,3}, {0.1,20}, {1.8,2}, {2.4,11} };
template <typename Container>
void test (Container& container) {
sortButKeepSomeFixed (container,
std::greater<int>(), // Ordering condition.
[](int x) {return x % 2 == 0;}); // Those that shall remain fixed.
for (int x : container) std::cout << x << ' ';
std::cout << '\n';
}
int main() {
test(vector); // 11 9 7 8 5 3 20 2 1
test(array); // 11 9 7 8 5 3 20 2 1
test(list); // 11 9 7 8 5 3 20 2 1
test(set); // Does not compile.
sortButKeepSomeFixed (map,
[](const std::pair<double, int>& x, const std::pair<double, int>& y) {return x.second > y.second;},
[](const std::pair<double, int>& x) {return x.second % 2 == 0;});
for (const std::pair<double, int>& x : map)
std::cout << "(" << x.first << "," << x.second << ") "; // Does not compile.
}
Error for set and map is "Assignment of read-only location".
Anyone know how to generalize this to work with sets and maps?
Update: So I suggest for set, maps, etc..., simply remove those elements that satisfy pred and the create a new set/map/... with Compare as their key_compare type. Like below. But it is only for set. How to generalize it to other containers that have key_compare types?
template <typename Container, typename Comparator, typename Predicate>
std::set<typename Container::value_type, Comparator, typename Container::allocator_type>
sortButRemoveSomeElements (Container& c, const Comparator&, const Predicate& pred) {
std::set<typename Container::value_type, Comparator, typename Container::allocator_type> set;
std::vector<typename Container::value_type> keep;
for (typename Container::iterator it = c.begin(); it != c.end(); ++it) {
if (!pred(*it))
keep.emplace_back(*it);
}
for (typename Container::value_type x : keep)
set.emplace(x); // Sorted by Comparator automatically due to std::set's insertion property.
return set;
}
The test:
struct GreaterThan { bool operator()(int x, int y) const {return x > y;} };
std::set<int, GreaterThan> newSet = sortButRemoveSomeElements (set,
GreaterThan{}, // Ordering condition.
[](int x) {return x % 2 == 0;}); // Those that shall be removed.
for (int x : newSet) std::cout << x << ' '; // 11 9 7 5 3 1

Related

Hashing std::vector independent of items order

I am looking for a hash function for std::vector, which would be independent from vector's item's ordering.
In other words I am looking for a hash implementation,
that would give me same result for
std::vector<int> v1(1,2,3);
std::vector<int> v2(2,3,1);
std::vector<int> v3(1,3,2);
Any ideas on how I might accomplish this?
template<template<class...>class element_hash=std::hash>
struct symmetric_range_hash {
template<class T>
std::size_t operator()( T const& t ) const {
std::size_t r = element_hash<int>{}(0); // seed with the hash of 0.
for (auto&& x:t) {
using element_type = std::decay_t<decltype(x)>;
auto next = element_hash<element_type>{}(x);
r = r + next;
}
return r;
}
};
That should do it. We gather the hashes via + which is symmetric.
+ is better than ^ because it takes longer to get a cycle. With ^, {1,1} and {2,2} would hash the same (and in general even numbers of anything "disappear"). With + they instead get multiplied.
So the end result is the sum, for each distinct value in the array, of the hash of that value times its count, mod "max(size_t)+1".
Note that an unordered_map requires both a hash and an equality. If you want collision, you'll need to also write an ==.
struct unordered_equal {
template<class C>
bool operator()(C const& lhs, C const& rhs)const {
using std::begin;
using K = std::decay_t< *decltype(begin(lhs)) > >;
std::unordered_map< K, std::size_t > counts;
for (auto&& k : lhs) {
counts[k]++;
}
for (auto&& k : rhs) {
counts[k]--;
}
for (auto&& kv : counts)
if (kv.second != 0) return false;
return true;
}
};

Efficient way to find frequencies of each unique value in the std::vector

Given a vector std::vector<double> v, we can find unique elements efficiently by:
std::vector<double> uv(v.begin(), v.end());
std::sort(uv.begin(), uv.end());
std::erase(std::unique(uv.begin, uv.end()), uv.end());
What would the be the nicest way (without loops, with STL or lambdas) to create a vector:
std::vector<double> freq_uv(uv.size());
which would contain frequencies of each distinct element appearing in v (order the same as sorted unique values)?
Note: type can be anything, not just double
After you sort, before you erase:
std::vector<int> freq_uv;
freq_uv.push_back(0);
auto prev = uv[0]; // you should ensure !uv.empty() if previous code did not already ensure it.
for (auto const & x : uv)
{
if (prev != x)
{
freq_uv.push_back(0);
prev = x;
}
++freq_uv.back();
}
Note that, while I generally like to count occurences with a map, as Yakk is doing, in this case I think it is doing a lot of unnecessary work as we already know the vector is sorted.
Another possibility is to use a std::map (not unordered), instead of sorting. This will get your frequencies first. Then, since the map is ordered, you can just create the sorted, unique vector, and the frequency vector directly from the map.
// uv not yet created
std::map<T, int> freq_map;
for (auto const & x : v)
++freq_map[x];
std::vector<T> uv;
std::vector<int> freq_uv;
for (auto const & p : freq_map)
{
uv.push_back(p.first);
freq_uv.push_back(p.second);
}
First, note that == and to a lesser extent < on double is often a poor idea: often you'll have values that logically "should" be equal if the double was infinite precision, but are slightly different.
However, collecting the frequencies is easy:
template<typename T, typename Allocator>
std::unordered_map< T, std::size_t > frequencies( std::vector<T, Allocator> const& src ) {
std::unordered_map< T, std::size_t > retval;
for (auto&& x:src)
++retval[x];
return retval;
}
assuming std::hash<T> is defined (which it is for double). If not, there is more boilerplate, so I'll skip it. Note that this does not care if the vector is sorted.
If you want it in the form of std::vector<std::size_t> in sync with your sorted vector, you can just do this:
template<typename T, typename Hash, typename Equality, typename Allocator>
std::vector<std::size_t> collate_frequencies(
std::vector<T, Allocator> const& order,
std::unordered_map<T, std::size_t, Hash, Equality> const& frequencies
) {
std::vector<std::size_t> retval;
retval.reserve(order.size());
for( auto&& x : order )
retval.push_back( frequencies[x] );
return retval;
}
I took the liberty of making these functions overly generic, so they support more than just doubles.
using equal_range:
std::vector<int> results;
for(auto i = begin(v); i != end(v);)
{
auto r = std::equal_range(i, end(v), *i);
results.emplace_back( std::distance(r.first, r.second) );
i = r.second;
}
SSCCE:
#include <vector>
#include <algorithm>
#include <iostream>
#include <iterator>
int main()
{
std::vector<double> v{1.0, 2.0, 1.0, 2.0, 1.0, 3.0};
std::sort(begin(v), end(v));
std::vector<int> results;
for(auto i = begin(v); i != end(v);)
{
auto r = std::equal_range(i, end(v), *i);
results.emplace_back( std::distance(r.first, r.second) );
i = r.second;
}
for(auto const& e : results) std::cout << e << "; ";
}
An O(n) solution when the range of values is limited, for example chars. Using less than the CPU level 1 cache for the counter leaves room for other values.
(untested code)
constexp int ProblemSize = 256;
using CountArray = std::array<int, ProblemSize>;
CountArray CountUnique(const std::vector<char>& vec) {
CountArray count;
for(const auto ch : vec)
count[ch]++;
return count;
}

c++ overloading operator[] for std::pair

I work a lot with pairs of values: std::pair<int, int> my_pair. Sometimes I need to perform the same operation on both my_pair.first and my_pair.second.
My code would be much smoother if I could do my_pair[j] and loop over j=0,1.
(I am avoiding using arrays because I don't want to bother with allocating memory, and I use pair extensively with other things).
Thus, I would like to define operator[] for std::pair<int, int>.
And I can't get it to work, (I'm not very good with templates and such)...
#include <utility>
#include <stdlib.h>
template <class T1> T1& std::pair<T1, T1>::operator[](const uint &indx) const
{
if (indx == 0)
return first;
else
return second;
};
int main()
{
// ....
return 0;
}
fails to compile. Other variations fail as well.
As far as I can tell, I am following the Stack Overflow operator overloading FAQ, but I guess I am missing something...
you cannot overload operator[] as a non-member
you cannot define a member function which has not been declared in the class definition
you cannot modify the class definition of std::pair
Here's a non-member implementation:
/// #return the nth element in the pair. n must be 0 or 1.
template <class T>
const T& pair_at(const std::pair<T, T>& p, unsigned int n)
{
assert(n == 0 || n == 1 && "Pair index must be 0 or 1!");
return n == 0 ? p.first: p.second;
}
/// #return the nth element in the pair. n must be 0 or 1.
template <class T>
T& pair_at(std::pair<T, T>& p, unsigned int index)
{
assert(index == 0 || index == 1 && "Pair index must be 0 or 1!");
return index == 0 ? p.first: p.second;
}
// usage:
pair<int, int> my_pair(1, 2);
for (int j=0; j < 2; ++j)
++pair_at(my_pair, j);
Note that we need two versions: one for read-only pairs and one for mutable pairs.
Don't be afraid to use non-member functions liberally. As Stroustrup himself said, there is no need to model everything with an object or augment everything through inheritance. If you do want to use classes, prefer composition to inheritance.
You can also do something like this:
/// Applies func to p.first and p.second.
template <class T, class Func>
void for_each_pair(const std::pair<T, T>& p, Func func)
{
func(p.first);
func(p.second);
}
/// Applies func to p.first and p.second.
template <class T, class Func>
void for_each_pair(std::pair<T, T>& p, Func func)
{
func(p.first);
func(p.second);
}
// usage:
pair<int, int> my_pair(1, 2);
for_each_pair(my_pair, [&](int& x){
++x;
});
That isn't too unwieldy to use if you have C++11 lambdas and is at least a bit safer since it has no potential to access out of bounds.
You cannot add functions to an existing class like this. And you certainly can't do it to things in the std namespace.
So you should define your own wrapper class:
class MyPair {
private:
std::pair<int,int> p;
public:
int &operator[](int i) { return (i == 0) ? p[0] : p[1]; }
// etc.
};
You probably should check out Boost.Fusion. You can apply algorithms to sequences(which std::pair is considered a sequence). So for example you can do for_each like this:
std::pair<int, int> my_pair;
for_each(my_pair, [] (int i)
{
cout << i;
});
You can also access the index of the element like this:
int sum = at_c<0>(my_pair) + at_c<1>(my_pair);

C++ Sort vector of <T> according to vector of double

I want to sort a vector of T according to a vector of double. That is, if I have
vector<T> a;
vector<double>b;
If a is {t1, t2, t3, t4} and b is {3, 1, 5, 2}, I want to obtain {t2, t4, t1, t3}.
I don't know how to declare the template. I'm trying something like
template<vector<class T>> vector<T> sortByArray(vector<T> a, vector<double>b)
And I don't have any idea of how to write the function body either.
Thanks.
EDIT: This is the usage of my algorithm. I don't get it right.
template <typename T> struct dataPair
{
dataPair(double s, T o)
: m_sortData(s)
, m_otherData(o)
{
}
bool operator< (const dataPair &rhs) { return (m_sortData < rhs.m_sortData); }
double m_sortData;
T m_otherData;
}
template <class T> vector<T> sortByArrayStuff(vector<T> objects, vector<double> sortNumber) {
vector<dataPair<T>> v;
for (unsigned int i = 0; i < objects.size(); i++) {
v.push_back(dataPair<T>(objects[i], sortNumber[i]));
}
sort(v.begin(), v.end());
vector<T> retVal;
for (unsigned int i = 0; i < objects.size(); i++) {
retVal.push_back(dataPair<T>(objects[i], sortNumber[i]));
}
return retVal;
};
I want to use the same template for vectors of "Points" and vectors of vectors of "Points":
vector<double> sortedAreas;
vector<Point> sortedPoints = sortByArray<vector<Point>>(points, sortedAreas);
vector<vector<Point>> sortedContours = sortByArray<vector<vector<Point>>>(contours, sortedAreas);
Error is
cannot convert parameter 1 from 'dataPair<T>' to 'cv::Point &&'
with
[
_Ty=cv::Point
]
and
[
T=cv::Point
]
Reason: cannot convert from 'dataPair<T>' to 'cv::Point'
with
[
T=cv::Point
]
What you should do is create a struct or class like this:
template <typename T> struct dataPair
{
dataPair(double s, T o)
: m_sortData(s)
, m_otherData(o)
{
}
bool operator< (const dataPair &rhs) { return (m_sortData < rhs.m_sortData); }
double m_sortData;
T m_otherData;
}
Then, you create a vector of these dataPair types
{
// your code ...
// that assumes b is is a std::vector<YourType>
// create vector and populate it
std::vector<dataPair<YourType>> v;
v.push_back(dataPair<YourType>(a[0],b[0]));
v.push_back(dataPair<YourType>(a[1],b[1]));
v.push_back(dataPair<YourType>(a[2],b[2]));
v.push_back(dataPair<YourType>(a[3],b[3]));
std::sort(v.begin(),v.end());
// your code (now they will be sorted how you like in v)
}
EDIT: had some typos
EDIT2: You can also do this with functors for more efficiency, but this is the basic idea.
EDIT3: Using functors with sort is described very nicely here. See where they use the functor myclass in which they overload operator(). This allows compile-time optimizations to be made (because from std::sort's perspective the sorting criterion is a template type)
The easiest way I can think of, is if you just included the double within your class declaration of T, and used it as your sorting parameter. Sorry if my template syntax is not that great, its been a while since ive used them:
class YourClass
{
//Some stuff...
double sortVal;
};
bool std::less<YourClass>(YourClass left, YourClass right)
{
return left.sortVal < right.sortval;
}
I was just doing something like this the other day, and here was my idea. Take both vectors, and combine them into a multimap. The sorting will be done automatically just by inserting them into the map, then you extract them from the map back into the vectors. I came up with 2 function templates for this job, here they are:
// This function basically does the reverse of a transform. Whereas transform takes
// two inputs and by some method merges them into one, this function takes one input
// and by some method splits it in two.
template<typename InIt, typename Out1It, typename Out2It, typename Fn>
void fork_transform(InIt ibegin, InIt iend, Out1It o1begin, Out2It o2begin, Fn fork)
{
while(ibegin != iend)
{
fork(*ibegin, *o1begin, *o2begin);
++o1begin;
++o2begin;
++ibegin;
}
}
template<typename ItPrimary, typename ItSecondary>
void simul_sort(ItPrimary begin1, ItPrimary end1, ItSecondary begin2)
{
typedef std::iterator_traits<ItPrimary>::value_type T1;
typedef std::iterator_traits<ItSecondary>::value_type T2;
typedef std::multimap<T1,T2> Map_t;
typedef Map_t::value_type Pair_t;
Map_t m;
// this was necessary for me because of a bug in VC10, see my most recent question
auto MakePair = [](const T1 & first, const T2 & second) { return std::make_pair(first,second); };
std::transform(begin1, end1, begin2, std::inserter(m,m.begin()), MakePair);
auto Fork = [](const Pair_t & p, T1 & first, T2 & second) { first = p.first; second = p.second; };
fork_transform(m.begin(), m.end(), begin1, begin2, Fork);
}
This actually sorts both vectors simultaneously. The first is sorted normally, the second is sorted according to the order of the first:
simul_sort(b.begin(), b.end(), a.begin());
If you need generic solution for the problem then take a look on zipper template in one of answers here:
number of matches in two sequences with STL
You will need something close to that zipper - some entity that zips two sequences into one.
Here's a generic solution - a function which returns a vector of the indexes into an array. You can use these indexes on either of your a or b to get them in sorted order.
template<class RandomAccessIterator>
struct IndirectCompare : public std::binary_function<size_t, size_t, bool>
{
IndirectCompare(RandomAccessIterator first) : m_first(first)
{
}
bool operator()(const size_t &left, const size_t &right)
{
return *(m_first + left) < *(m_first + right);
}
RandomAccessIterator m_first;
};
template<class RandomAccessIterator>
std::vector<size_t> ordered_index(RandomAccessIterator first, RandomAccessIterator last)
{
size_t n = last - first;
std::vector<size_t> result;
result.reserve(n);
for (size_t i = 0; i < n; ++i)
result.push_back(i);
IndirectCompare<RandomAccessIterator> comp(first);
std::sort(result.begin(), result.end(), comp);
return result;
}
P.S. I've tested this code now, and it works.
If you want to sort two vectors simultaneously, you should rather create a std::vector (say c) of std::pair. First component must be the one which is to be sorted normally and second must be the one to be sorted accordingly.
std::vector<std::pair<double, T>> c;
std::sort(c.begin(), c.end());
Hope this helps.

Determining the unique rows of a 2D array (vector<vector<T> >)

I am using a datatype of std::vector<std::vector<T> > to store a 2D matrix/array. I would like to determine the unique rows of this matrix. I am looking for any advice or pointers on how to go about doing this operation.
I have tried two methods.
Method 1: slightly convoluted. I keep an index for each row with 0/1 indicating whether the row is a duplicate value, and work through the matrix, storing the index of each unique row in a deque. I want to store the results in a <vector<vector<T> >, and so from this deque of indices, I pre-allocate and then assign the rows from the matrix into the return value.
Method 2: Is easier to read, and in many cases faster than method 1. I keep a deque of the unique rows that have been found, and just loop through the rows and compare each row to all the entries in this deque.
I am comparing both of these methods to matlab, and these C++ routines are orders of magnitude slower. Does anyone have any clever ideas on how I might speed this operation up? I am looking to do this operation on matrices that potentially have millions of rows.
I am storing the unique rows in a deque during the loop to avoid the cost of resizing a vector, and then copying the deque to the vector<vector<T> > for the results. I've benchmarked this operation closely, and it is not anywhere near slowing operation down, it accounts for less than .5% of the runtime on a matrix with 100,000 rows for example.
Thanks,
Bob
Here is the code. If anyone is interested in a more complete example showing the usage, drop me a comment and I can put something together.
Method 1:
template <typename T>
void uniqueRows( const std::vector<std::vector<T> > &A,
std::vector<std::vector<T> > &ret) {
// Go through a vector<vector<T> > and find the unique rows
// have a value ind for each row that is 1/0 indicating if a value
// has been previously searched.
// cur : current item being compared to every item
// num : number of values searched for. Once all the values in the
// matrix have been searched, terminate.
size_t N = A.size();
size_t num=1,cur=0,it=1;
std::vector<unsigned char> ind(N,0);
std::deque<size_t> ulist; // create a deque to store the unique inds
ind[cur] = 1;
ulist.push_back(0); // ret.push_back(A[0]);
while(num < N ) {
if(it >= N ) {
++cur; // find next non-duplicate value, push back
while(ind[cur])
++cur;
ulist.push_back(cur); //ret.push_back(A[cur]);
++num;
it = cur+1; // start search for duplicates at the next row
if(it >= N && num == N)
break;
}
if(!ind[it] && A[cur]==A[it]) {
ind[it] = 1; // mark as duplicate
++num;
}
++it;
} // ~while num
// loop over the deque and .push_back the unique vectors
std::deque<size_t>::iterator iter;
const std::deque<size_t>::iterator end = ulist.end();
ret.reserve(ulist.size());
for(iter= ulist.begin(); iter != end; ++iter) {
ret.push_back(A[*iter]);
}
}
Here is the code for method 2:
template <typename T>
inline bool isInList(const std::deque< std::vector<T> > &A,
const std::vector<T> &b) {
typename std::deque<std::vector<T> >::const_iterator it;
const typename std::deque<std::vector<T> >::const_iterator end = A.end();
for(it = A.begin(); it != end; ++it) {
if(*it == b)
return true;
}
return false;
}
template <typename T>
void uniqueRows1(const::std::vector<std::vector<T> > &A,
std::vector<std::vector<T> > &ret) {
typename std::deque<std::vector<T> > ulist;
typename std::vector<std::vector<T> >::const_iterator it = A.begin();
const typename std::vector<std::vector<T> >::const_iterator end = A.end();
ulist.push_back(*it);
for(++it; it != end; ++it) {
if(!isInList(ulist,*it)) {
ulist.push_back(*it);
}
}
ret.reserve(ulist.size());
for(size_t i = 0; i != ulist.size(); ++i) {
ret.push_back(ulist[i]);
}
}
You should also consider using hashing, it preserves row ordering and could be faster (amortized O(m*n) if alteration of the original is permitted, O(2*m*n) if a copy is required) than sort/unique -- especially noticeable for large matrices (on small matrices you are probably better off with Billy's solution since his requires no additional memory allocation to keep track of the hashes.)
Anyway, taking advantage of Boost.Unordered, here's what you can do:
#include <vector>
#include <boost/foreach.hpp>
#include <boost/ref.hpp>
#include <boost/typeof/typeof.hpp>
#include <boost/unordered_set.hpp>
namespace boost {
template< typename T >
size_t hash_value(const boost::reference_wrapper< T >& v) {
return boost::hash_value(v.get());
}
template< typename T >
bool operator==(const boost::reference_wrapper< T >& lhs, const boost::reference_wrapper< T >& rhs) {
return lhs.get() == rhs.get();
}
}
// destructive, but fast if the original copy is no longer required
template <typename T>
void uniqueRows_inplace(std::vector<std::vector<T> >& A)
{
boost::unordered_set< boost::reference_wrapper< std::vector< T > const > > unique(A.size());
for (BOOST_AUTO(it, A.begin()); it != A.end(); ) {
if (unique.insert(boost::cref(*it)).second) {
++it;
} else {
A.erase(it);
}
}
}
// returning a copy (extra copying cost)
template <typename T>
void uniqueRows_copy(const std::vector<std::vector<T> > &A,
std::vector< std::vector< T > > &ret)
{
ret.reserve(A.size());
boost::unordered_set< boost::reference_wrapper< std::vector< T > const > > unique;
BOOST_FOREACH(const std::vector< T >& row, A) {
if (unique.insert(boost::cref(row)).second) {
ret.push_back(row);
}
}
}
EDIT: I forgot std::vector already defines operator< and operator== so you need not even use that:
template <typename t>
std::vector<std::vector<t> > GetUniqueRows(std::vector<std::vector<t> > input)
{
std::sort(input.begin(), input.end());
input.erase(std::unique(input.begin(), input.end()), input.end());
return input;
}
Use std::unique in concert with a custom functor which calls std::equal on the two vectors.
std::unique requires that the input be sorted first. Use a custom functor calling std::lexicographical_compare on the two vectors input. If you need to recover the unreordered output, you'll need to store the existing order somehow. This will achieve M*n log n complexity for the sort operation (where M is the length of the inner vectors, n is the number of inner vectors), while the std::unique call will take m*n time.
For comparison, both your existing approaches are m*n^2 time.
EDIT: Example:
template <typename t>
struct VectorEqual : std::binary_function<const std::vector<t>&, const std::vector<t>&, bool>
{
bool operator()(const std::vector<t>& lhs, const std::vector<t>& rhs)
{
if (lhs.size() != rhs.size()) return false;
return std::equal(lhs.first(), lhs.second(), rhs.first());
}
};
template <typename t>
struct VectorLess : std::binary_function<const std::vector<t>&, const std::vector<t>&, bool>
{
bool operator()(const std::vector<t>& lhs, const std::vector<t>& rhs)
{
return std::lexicographical_compare(lhs.first(), lhs.second(), rhs.first(), rhs.second());
}
};
template <typename t>
std::vector<std::vector<t> > GetUniqueRows(std::vector<std::vector<t> > input)
{
std::sort(input.begin(), input.end(), VectorLess<t>());
input.erase(std::unique(input.begin(), input.end(), VectorEqual<t>()), input.end());
return input;
}