How to efficiently select a random element from a std::set - c++

How can I efficiently select a random element from a std::set?
A std::set::iterator is not a random access iterator. So I can't directly index a randomly chosen element like I could for a std::deque or std::vector
I could take the iterator returned from std::set::begin() and increment it a random number of times in the range [0,std::set::size()), but that seems to be doing a lot of unnecessary work. For an "index" close to the set's size, I would end up traversing the entire first half of the internal tree structure, even though it's already known the element won't be found there.
Is there a better approach?
In the name of efficiency, I am willing to define "random" as less random than whatever approach I might have used to choose a random index in a vector. Call it "reasonably random".
Edit...
Many insightful answers below.
The short version is that even though you can find a specific element in log(n) time, you can't find an arbitrary element in that time through the std::set interface.

Use boost::container::flat_set instead:
boost::container::flat_set<int> set;
// ...
auto it = set.begin() + rand() % set.size();
Insertions and deletions become O(N) though, I don't know if that's a problem. You still have O(log N) lookups, and the fact that the container is contiguous gives an overall improvement that often outweighs the loss of O(log N) insertions and deletions.

What about a predicate for find (or lower_bound) which causes a random tree traversal? You'd have to tell it the size of the set so it could estimate the height of the tree and sometimes terminate before leaf nodes.
Edit: I realized the problem with this is that std::lower_bound takes a predicate but does not have any tree-like behavior (internally it uses std::advance which is discussed in the comments of another answer). std::set<>::lower_bound uses the predicate of the set, which cannot be random and still have set-like behavior.
Aha, you can't use a different predicate, but you can use a mutable predicate. Since std::set passes the predicate object around by value you must use a predicate & as the predicate so you can reach in and modify it (setting it to "randomize" mode).
Here's a quasi-working example. Unfortunately I can't wrap my brain around the right random predicate so my randomness is not excellent, but I'm sure someone can figure that out:
#include <iostream>
#include <set>
#include <stdlib.h>
#include <time.h>
using namespace std;
template <typename T>
struct RandomPredicate {
RandomPredicate() : size(0), randomize(false) { }
bool operator () (const T& a, const T& b) {
if (!randomize)
return a < b;
int r = rand();
if (size == 0)
return false;
else if (r % size == 0) {
size = 0;
return false;
} else {
size /= 2;
return r & 1;
}
}
size_t size;
bool randomize;
};
int main()
{
srand(time(0));
RandomPredicate<int> pred;
set<int, RandomPredicate<int> & > s(pred);
for (int i = 0; i < 100; ++i)
s.insert(i);
pred.randomize = true;
for (int i = 0; i < 100; ++i) {
pred.size = s.size();
set<int, RandomPredicate<int> >::iterator it = s.lower_bound(0);
cout << *it << endl;
}
}
My half-baked randomness test is ./demo | sort -u | wc -l to see how many unique integers I get out. With a larger sample set try ./demo | sort | uniq -c | sort -n to look for unwanted patterns.

If you could access the underlying red-black tree (assuming that one exists) then you could access a random node in O(log n) choosing L/R as the successive bits of a ceil(log2(n))-bit random integer. However, you can't, as the underlying data structure is not exposed by the standard.
Xeo's solution of placing iterators in a vector is O(n) time and space to set up, but amortized constant overall. This compares favourably to std::next, which is O(n) time.

You can use the std::advance method:
set <int> myset;
//insert some elements into myset
int rnd = rand() % myset.size();
set <int> :: const_iterator it(myset.begin());
advance(it, rnd);
//now 'it' points to your random element
Another way to do this, probably less random:
int mini = *myset().begin(), maxi = *myset().rbegin();
int rnd = rand() % (maxi - mini + 1) + mini;
int rndresult = *myset.lower_bound(rnd);

If either the set doesn't update frequently or you don't need to run this algorithm frequently, keep a mirrored copy of the data in a vector (or just copy the set to a vector on need) and randomly select from that.
Another approach, as seen in a comment, is to keep a vector of iterators into the set (they're only invalidated on element deletion for sets) and randomly select an iterator.
Finally if you don't need a tree-based set, you could use vector or deque as your underlying container and sort/unique-ify when needed.

You can do this by maintaining a normal array of values; when you insert to the set, you append the element to the end of the array (O(1)), then when you want to generate a random number you can grab it from the array in O(1) as well.
The issue comes when you want to remove elements from the array. The most naive method would take O(n), which might be efficient enough for your needs. However, this can be improved to O(log n) using the following method;
Keep, for each index i in the array, prfx[i], which represents the number of non-deleted elements in the range 0...i in the array. Keep a segment tree, where you keep the maximum prfx[i] contained in each range.
Updating the segment tree can be done in O(log n) per deletion. Now, when you want to access the random number, you query the segment tree to find the "real" index of the number (by finding the earliest range in which the maximum prfx is equal to the random index). This makes the random-number generation of complexity O(log n).

Average O(1)/O(log N) (hashable/unhashable) insert/delete/sample with off-the-shelf containers
The idea is simple: use rejection sampling while upper bounding the rejection rate, which is achievable with a amortized O(1) compaction operation.
However, unlike solutions based on augmented trees, this approach cannot be extended to support weighted sampling.
template <typename T>
class UniformSamplingSet {
size_t max_id = 0;
std::unordered_set<size_t> unused_ids;
std::unordered_map<size_t, T> id2value;
std::map<T, size_t> value2id;
void compact() {
size_t id = 0;
std::map<T, size_t> new_value2id;
std::unordered_map<size_t, T> new_id2value;
for (auto [_, value] : id2value) {
new_value2id.emplace(value, id);
new_id2value.emplace(id, value);
++id;
}
max_id = id;
unused_ids.clear();
std::swap(id2value, new_id2value);
std::swap(value2id, new_value2id);
}
public:
size_t size() {
return id2value.size();
}
void insert(const T& value) {
size_t id;
if (!unused_ids.empty()) {
id = *unused_ids.begin();
unused_ids.erase(unused_ids.begin());
} else {
id = max_id++;
}
if (!value2id.emplace(value, id).second) {
unused_ids.insert(id);
} else {
id2value.emplace(id, value);
}
}
void erase(const T& value) {
auto it = value2id.find(value);
if (it == value2id.end()) return;
unused_ids.insert(it->second);
id2value.erase(it->second);
value2id.erase(it);
if (unused_ids.size() * 2 > max_id) {
compact();
};
}
// uniform(n): uniform random in [0, n)
template <typename F>
T sample(F&& uniform) {
size_t i;
do { i = uniform(max_id); } while (unused_ids.find(i) != unused_ids.end());
return id2value.at(i);
}

Related

Iterating over the result of a member function

I have code that creates several object instances (each instance having a fitness value, among other things) from which I want to sample N unique objects using weighted selection based on their fitness values. All objects not sampled are then discarded (but they need to be initially created to determine their fitness value).
my current code looks something like this:
vector<Item> getItems(..) {
std::vector<Item> items
.. // generate N values for items
int N = items.size();
std::vector<double> fitnessVals;
for(auto it = items.begin(); it != items.end(); ++it)
fitnessVals.push_back(it->getFitness());
std::mt19937& rng = getRng();
for(int i = 0, i < N, ++i) {
std::discrete_distribution<int> dist(fitnessVals.begin() + i, fitnessVals.end());
unsigned int pick = dist(rng);
std::swap(fitnessVals.at(i), fitnessVals.at(pick));
std::swap(items.at(i), items.at(pick));
}
items.erase(items.begin() + N, items.end());
return items;
}
Typically ~10,000 instances are initially created, with N being ~200. The fitness value is non-negative, usually valued at ~70. It could go as high as ~3000, but higher values are increasingly more unlikely.
Is there an elegant way to get rid of the fitnessVals vector? Or perhaps a better way to do this in general? Efficiency is important, but I'm also wondering about good C++ coding practices.
If you're asking whether you can do this just with the items in your items vector, the answer is yes. The following is a rather hideous but none-the-less effective way to do that: I apologize in advance for the density.
This wraps the unsuspecting container iterator in another iterator of our own devices; one that pairs it with a member function of your choice. You may have to dance with const in this to get it to work correctly with your member function choice. That task i leave to you.
template<typename Iter, typename R>
struct memfn_iterator_s :
public std::iterator<std::input_iterator_tag, R>
{
using value_type = typename std::iterator_traits<Iter>::value_type;
memfn_iterator_s(Iter it, R(value_type::*fn)())
: m_it(it), mem_fn(fn) {}
R operator*()
{
return ((*m_it).*mem_fn)();
}
bool operator ==(const memfn_iterator_s& arg) const
{
return m_it == arg.m_it;
}
bool operator !=(const memfn_iterator_s& arg) const
{
return m_it != arg.m_it;
}
memfn_iterator_s& operator ++() { ++m_it; return *this; }
private:
R (value_type::*mem_fn)();
Iter m_it;
};
A generator function follows to create the above monstrosity:
template<typename Iter, typename R>
memfn_iterator_s<Iter,R> memfn_iterator(
Iter it,
R (std::iterator_traits<Iter>::value_type::*fn)())
{
return memfn_iterator_s<Iter,R>(it, fn);
}
What this buys you is the ability to do this:
auto it_end = memfn_iterator(items.end(), &Item::getFitness);
for(unsigned int i = 0; i < N; ++i)
{
auto it_begin = memfn_iterator(items.begin()+i, &Item::getFitness);
std::discrete_distribution<unsigned int> dist(it_begin, it_end);
std::swap(items.at(i), items.at(i+dist(rng)));
}
items.erase(items.begin() + N, items.end());
No temporary array is required. The member function is called for the respective item when required by the discrete distribution (which usually keeps it own vector of weights, and as such replicating that effort would be redundant).
Dunno if you'll get anything helpful or useful out of that, but it was fun to think about.
It's pretty nice that they have a discrete distribution in STL. As far as I know, the most efficient algorithm for sampling from a set of weighted objects (i.e., with probability proportional to weights) is the alias method. There's a Java implementation here: http://www.keithschwarz.com/interesting/code/?dir=alias-method
I suspect that's what the STL discrete_distribution uses anyway. If you're going to be calling your getItems function frequently, you might want to create a "FitnessSet" class or something so that you don't have to build your distribution every time you want to sample from the same set.
EDIT: Another suggestion... If you want to be able to delete items, you could instead store your objects in a binary tree. Each node would contain the sum of the weights in the subtree beneath it, and the objects themselves could be in the leaves. You could select an object through a series of log(N) coin tosses: at a given node, choose a random number between 0 and node.subtreeweight. If it's less than node.left.subtreeweight, go left; otherwise go right. Continue recursively until you reach a leaf.
I would try something like the following (see code comments):
#include <algorithm> // For std::swap and std::transform
#include <functional> // For std::mem_fun_ref
#include <random> // For std::discrete_distribution
#include <vector> // For std::vector
size_t
get_items(std::vector<Item>& results, const std::vector<Item>& items)
{
// Copy the items to the results vector. All operations should be
// done on it, rather than the original items vector.
results.assign(items.begin(), items.end());
// Create the fitness values vector, immediately allocating
// the number of doubles required to match the size of the
// input item vector.
std::vector<double> fitness_vals(results.size());
// Use some STL "magic" ...
// This will iterate over the items vector, calling the
// getFitness() method on each item, and storing the result
// in the fitness_vals vector.
std::transform(results.begin(), results.end(),
fitness_vals.begin(),
std::mem_fun_ref(&Item::getFitness));
//
std::mt19937& rng = getRng();
for (size_t i=0; i < results.size(); ++i) {
std::discrete_distribution<int> dist(fitness_vals.begin() + i, fitness_vals.end());
unsigned int pick = dist(rng);
std::swap(fitness_vals[ii], fitness_vals[pick]);
std::swap(results[i], results[pick]);
}
return (results.size());
}
Instead of returning the results vector, the caller provides a vector into which the results should be added. Also, the original vector (passed as the second parameter) remains unchanged. If this is not something that concerns you, you can always pass just the one vector and work with it directly.
I don't see a way to not have the fitness values vector; the discrete_distribution constructor needs to have the begin and end iterators, so from what I can tell, you will need to have this vector.
The rest of it is basically the same, with the return value being the number of items in the result vector, rather than the vector itself.
This example makes use of a number of STL features (algorithms, containers, functors) which I have found to be useful and part of my day-to-day development.
Edit: the call to items.erase() is superfluous; items.begin() + N where N == items.size() is equivalent to items.end(). The call to items.erase() would equate to a no-op.

What is the most efficient way of removing duplicates from a container only using almost equality criteria (no sort)

How do I remove duplicates from a non sorted container (mainly vector) when I do not have the possibility to define operator< e.g. when I can only define a fuzzy compare function.
This answer using sort does not work since I cannot define a function for ordering the data.
template <typename T>
void removeDuplicatesComparable(T& cont){
for(auto iter=cont.begin();iter!=cont.end();++iter){
cont.erase(std::remove(boost::next(iter),cont.end(),*iter),cont.end());
}
}
This is O(n²) and should be quite localized concerning cache hits.
Is there a faster or at least neater solution?
Edit: On why I cannot use sets. I do geometric comparisons. An example could be this but I have other entities different from polygons as well.
bool match(SegPoly const& left,SegPoly const& right,double epsilon){
double const cLengthCompare = 0.1; //just an example
if(!isZero(left.getLength()- right.getLength(), cLengthCompare)) return false;
double const interArea =areaOfPolygon(left.intersected(right)); //this is a geometric intersection
if(!isZero(interArea-right.getArea(),epsilon)) return false;
else return true;
}
So for such comparisons I would not know how to formulate sorting or a neat hash function.
First, don't remove elements one at a time.
Next, use a hash table (or similar structure) to detect duplicates.
If you don't need to preserve order, then copy all elements into a hashset (this destroys duplicates), then recreate the vector using the values left in the hashset.
If you need to preserve order, then:
Set read and write iterators to the beginning of the vector.
Start moving the read iterator through, checking elements against a hashset or octtree or something that allows finding nearby elements quickly.
For each element that collides with one in the hashset/octtree, advance the read iterator only.
For elements that do not collide, move from read iterator to write iterator, copy to hashset/octtree, then advance both.
When read iterator reaches the end, call erase to truncate the vector at the write iterator position.
The key advantage of the octtree is that while it doesn't let you immediately determine whether there is something close enough to be a "duplicate", it allows you to test against only near neighbors, excluding most of your dataset. So your algorithm might be O(N lg N) or even O(N lg lg N) depending on the spatial distribution.
Again, if you don't care about the ordering, you can actually move survivors into the hashset/octtree and at the end move them back into the vector (compactly).
If you don't want to rewrite your code to prevent duplicates from being placed in the vector to begin with, you can do something like this:
std::vector<Type> myVector;
// fill in the vector's data
std::unordered_set<Type> mySet(myVector.begin(), myVector.end());
myVector.assign(mySet.begin(), mySet.end());
Which will be of O(2 * n) = O(n).
std::set (or std::unordered_set - which uses a hash instead of a comparison) doesn't allow for duplicates, so it will eliminate them as the set is initialized. Then you re-assign the vector with the non-duplicated data.
Since you are insisting that you cannot create a hash, another alternative is to create a temporary vector:
std::vector<Type> vec1;
// fill vec1 with your data
std::vector<Type> vec2;
vec2.reserve(vec1.size()); // vec1.size() will be the maximum possible size for vec2
std::for_each(vec1.begin(), vec1.end(), [&](const Type& t)
{
bool is_unique = true;
for (std::vector<Type>::iterator it = vec2.begin(); it != vec2.end(); ++it)
{
if (!YourCustomEqualityFunction(s, t))
{
is_unique = false;
break;
}
}
if (is_unique)
{
vec2.push_back(t);
}
});
vec1.swap(vec2);
If copies are a concern, switch to a vector of pointers, and you can decrease the memory reallocations:
std::vector<std::shared_ptr<Type>> vec1;
// fill vec1 with your data
std::vector<std::shared_ptr<Type>> vec2;
vec2.reserve(vec1.size()); // vec1.size() will be the maximum possible size for vec2
std::for_each(vec1.begin(), vec1.end(), [&](const std::shared_ptr<Type>& t)
{
bool is_unique = true;
for (std::vector<Type>::iterator it = vec2.begin(); it != vec2.end(); ++it)
{
if (!YourCustomEqualityFunction(*s, *t))
{
is_unique = false;
break;
}
}
if (is_unique)
{
vec2.push_back(t);
}
});
vec1.swap(vec2);

Efficient way to get the indizes of the k highest values in vector<float>

How can I create a std::map<int, float> from a vector<float>, so that the map contains the k highest values from the vector with the keys beeing the index of the value in the vector.
A naive approach would be to traverse the vector (O(n)), extract and erase (O(n)) the highest element k times (O(k)), leading to a complexity of O(k*n^2), which is suboptimal, I guess.
Even better would be to just copy (O(n)) and remove the smallest until size is k. Which would lead to O(n^2). Still polynomial...
Any ideas?
Following should do the job:
#include <cstdint>
#include <algorithm>
#include <iostream>
#include <map>
#include <tuple>
#include <vector>
// Compare: greater T2 first.
struct greater_by_second
{
template <typename T1, typename T2>
bool operator () (const std::pair<T1, T2>& lhs, const std::pair<T1, T2>& rhs)
{
return std::tie(lhs.second, lhs.first) > std::tie(rhs.second, rhs.first);
}
};
std::map<std::size_t, float> get_index_pairs(const std::vector<float>& v, int k)
{
std::vector<std::pair<std::size_t, float>> indexed_floats;
indexed_floats.reserve(v.size());
for (std::size_t i = 0, size = v.size(); i != size; ++i) {
indexed_floats.emplace_back(i, v[i]);
}
std::nth_element(indexed_floats.begin(),
indexed_floats.begin() + k,
indexed_floats.end(), greater_by_second());
return std::map<std::size_t, float>(indexed_floats.begin(), indexed_floats.begin() + k);
}
Let's test it:
int main(int argc, char *argv[])
{
const std::vector<float> fs {45.67f, 12.34f, 67.8f, 4.2f, 123.4f};
for (const auto& elem : get_index_pairs(fs, 2)) {
std::cout << elem.first << " " << elem.second << std::endl;
}
return 0;
}
Output:
2 67.8
4 123.4
You can keep a list of the k-highest values so far, and update it for each of the values in your vector, which takes you down to O(n*log k) (assuming log k for each update of the list of highest values) or, for a naive list, O(kn).
You can probably get closer to O(n), but assuming k is probably pretty small, may not be worth the effort.
Your optimal solution will have a complexity of O(n+k*log(k)), since sorting the k elements can be reduced to this, and you will have to look at each of the elements at least once.
Two possible solutions come to mind:
Iterate through the vector while adding all elements to a bounded (size k) priority-queue/heap, also keeping their indices.
Create a copy of your vector with including the original indices, i.e. std::vector<std::pair<float, std::size_t>> and use std::nth_element to move the k highest values to the front using a comparator that compares only the first element. Then insert those elements into your target map. Ironically, that last step adds you the k*log(k) in the overall complexity, while nth_element is linear (but will permute your indices).
Maybe I did not get it, but in case the incremental approach is not an option, why not use std::sort std::partial_sort?
That should be an o(n log k), and since k is very likely to be small, that makes practically an o(n).
Edit: thanks to Mike Seymour for the update.
Edit (bis):
The idea is to use an intermediate vector for sorting, and then put it into the map. Trying to reduce the order of the computation would only be justified for significant amount of data, so I guess the copy time (in o(n) ) could be lost in background noise.
Edit (bis):
That's actually what the selected answer does, without the theorietical explanations :).

How to efficiently compare vectors with C++?

I need advice for micro optimization in C++ for a vector comparison function,
it compares two vectors for equality and order of elements does not matter.
template <class T>
static bool compareVectors(const vector<T> &a, const vector<T> &b)
{
int n = a.size();
std::vector<bool> free(n, true);
for (int i = 0; i < n; i++) {
bool matchFound = false;
for (int j = 0; j < n; j++) {
if (free[j] && a[i] == b[j]) {
matchFound = true;
free[j] = false;
break;
}
}
if (!matchFound) return false;
}
return true;
}
This function is used heavily and I am thinking of possible way to optimize it.
Can you please give me some suggestions? By the way I use C++11.
Thanks
It just realized that this code only does kind of a "set equivalency" check (and now I see that you actually did say that, what a lousy reader I am!). This can be achieved much simpler
template <class T>
static bool compareVectors(vector<T> a, vector<T> b)
{
std::sort(a.begin(), a.end());
std::sort(b.begin(), b.end());
return (a == b);
}
You'll need to include the header algorithm.
If your vectors are always of same size, you may want to add an assertion at the beginning of the method:
assert(a.size() == b.size());
This will be handy in debugging your program if you once perform this operation for unequal lengths by mistake.
Otherwise, the vectors can't be the same if they have unequal length, so just add
if ( a.size() != b.size() )
{
return false;
}
before the sort instructions. This will save you lots of time.
The complexity of this technically is O(n*log(n)) because it's mainly dependent on the sorting which (usually) is of that complexity. This is better than your O(n^2) approach, but might be worse due to the needed copies. This is irrelevant if your original vectors may be sorted.
If you want to stick with your approach, but tweak it, here are my thoughts on this:
You can use std::find for this:
template <class T>
static bool compareVectors(const vector<T> &a, const vector<T> &b)
{
const size_t n = a.size(); // make it const and unsigned!
std::vector<bool> free(n, true);
for ( size_t i = 0; i < n; ++i )
{
bool matchFound = false;
auto start = b.cbegin();
while ( true )
{
const auto position = std::find(start, b.cend(), a[i]);
if ( position == b.cend() )
{
break; // nothing found
}
const auto index = position - b.cbegin();
if ( free[index] )
{
// free pair found
free[index] = false;
matchFound = true;
break;
}
else
{
start = position + 1; // search in the rest
}
}
if ( !matchFound )
{
return false;
}
}
return true;
}
Another possibility is replacing the structure to store free positions. You may try a std::bitset or just store the used indices in a vector and check if a match isn't in that index-vector. If the outcome of this function is very often the same (so either mostly true or mostly false) you can optimize your data structures to reflect that. E.g. I'd use the list of used indices if the outcome is usually false since only a handful of indices might needed to be stored.
This method has the same complexity as your approach. Using std::find to search for things is sometimes better than a manual search. (E.g. if the data is sorted and the compiler knows about it, this can be a binary search).
Your can probabilistically compare two unsorted vectors (u,v) in O(n):
Calculate:
U= xor(h(u[0]), h(u[1]), ..., h(u[n-1]))
V= xor(h(v[0]), h(v[1]), ..., h(v[n-1]))
If U==V then the vectors are probably equal.
h(x) is any non-cryptographic hash function - such as MurmurHash. (Cryptographic functions would work as well but would usually be slower).
(This would work even without hashing, but it would be much less robust when the values have a relatively small range).
A 128-bit hash function would be good enough for many practical applications.
I am noticing that most proposed solution involved sorting booth of the input vectors.I think sorting the arrays compute more that what is strictly necessary for the evaluation the equality of the two vector ( and if the inputs vectors are constant, a copy needs to be made).
One other way would be to build an associative container to count the element in each vector... It's also possible to do the reduction of the two vector in parrallel.In the case of very large vector that could give a nice speed up.
template <typename T> bool compareVector(const std::vector<T> & vec1, const std::vector<T> & vec2) {
if (vec1.size() != vec2.size())
return false ;
//Here we assuame that T is hashable ...
auto count_set = std::unordered_map<T,int>();
//We count the element in each vector...
for (unsigned int count = 0 ; count < vec1.size();++count)
{
count_set[vec1[count]]++;
count_set[vec2[count]]--;
} ;
// If everything balance out we should have zero everywhere
return std::all_of(count_set.begin(),count_set.end(),[](const std::pair<T,int> p) { return p.second == 0 ;});
}
That way depend on the performance of your hashsing function , we might get linear complexity in the the length of booth vector (vs n*logn with the sorting).
NB the code might have some bug , did have time to check it ...
Benchmarking this way of comparing two vector to sort based comparison i get on ubuntu 13.10,vmware core i7 gen 3 :
Comparing 200 vectors of 500 elements by counting takes 0.184113 seconds
Comparing 200 vectors of 500 elements by sorting takes 0.276409 seconds
Comparing 200 vectors of 1000 elements by counting takes 0.359848 seconds
Comparing 200 vectors of 1000 elements by sorting takes 0.559436 seconds
Comparing 200 vectors of 5000 elements by counting takes 1.78584 seconds
Comparing 200 vectors of 5000 elements by sorting takes 2.97983 seconds
As others suggested, sorting your vectors beforehand will improve performance.
As an additional optimization you can make heaps out of the vectors to compare (with complexity O(n) instead of sorting with O(n*log(n)).
Afterwards you can pop elements from both heaps (complexity O(log(n))) until you get a mismatch.
This has the advantage that you only heapify instead of sort your vectors if they are not equal.
Below is a code sample. To know what is really fastest, you will have to measure with some sample data for your usecase.
#include <algorithm>
typedef std::vector<int> myvector;
bool compare(myvector& l, myvector& r)
{
bool possibly_equal=l.size()==r.size();
if(possibly_equal)
{
std::make_heap(l.begin(),l.end());
std::make_heap(r.begin(),r.end());
for(int i=l.size();i!=0;--i)
{
possibly_equal=l.front()==r.front();
if(!possibly_equal)
break;
std::pop_heap(l.begin(),l.begin()+i);
std::pop_heap(r.begin(),r.begin()+i);
}
}
return possibly_equal;
}
If you use this function a lot on the same vectors, it might be better to keep sorted copies for comparison.
In theory it might even be better to sort the vectors and compare sorted vectors if each one is compared just once, (sorting is O(n*log(n)), comparing sorted vector O(n), while your function is O(n^2).
But I suppose the time spent allocating memory for the sorted vectors will dwarf any theoretical gains if you don't compare the same vectors often.
As with all optimisations, profiling is the only way to make sure, I'd try some std::sort / std::equal combo.
Like stefan says you need to sort to get better complexity.
Then you can use
== operator (tnx for the correction in the comments - ste equal will also work but it is more appropriate for comparing ranges not entire containers)
If that is not fast enough only then bother with microoptimization.
Also are vectors guaranteed to be of the same size?
If not put that check at the begining.
Another possible solution (viable only if all elements are unique), which should improve somewhat the solution of #stefan (although the complexity would remain in O(NlogN)) is this:
template <class T>
static bool compareVectors(vector<T> a, const vector<T> & b)
{
// You should probably check this outside as it can
// avoid you the copy of a
if (a.size() != b.size()) return false;
std::sort(a.begin(), a.end());
for (const auto & v : b)
if ( !std::binary_search(a.begin(), a.end(), v) ) return false;
return true;
}
This should be faster since it performs the search directly as an O(NlogN) operation, instead of sorting b (O(NlogN)) and then searching both vectors (O(N)).

C++ Standard Library approach to removing one of a pair of items in a list that satisfy a criterion

Imagine you have an std::list with a set of values in it. For demonstration's sake, we'll say it's just std::list<int>, but in my case they're actually 2D points. Anyway, I want to remove one of a pair of ints (or points) which satisfy some sort of distance criterion. My question is how to approach this as an iteration that doesn't do more than O(N^2) operations.
Example
Source is a list of ints containing:
{ 16, 2, 5, 10, 15, 1, 20 }
If I gave this a distance criterion of 1 (i.e. no item in the list should be within 1 of any other), I'd like to produce the following output:
{ 16, 2, 5, 10, 20 } if I iterated forward or
{ 20, 1, 15, 10, 5 } if I iterated backward
I feel that there must be some awesome way to do this, but I'm stuck with this double loop of iterators and trying to erase items while iterating through the list.
Make a map of "regions", basically, a std::map<coordinates/len, std::vector<point>>.
Add each point to it's region, and each of the 8 neighboring regions O(N*logN). Run the "nieve" algorithm on each of these smaller lists (technically O(N^2) unless theres a maximum density, then it becomes O(N*density)). Finally: On your origional list, iterate through each point, and if it has been removed from any of the 8 mini-lists it was put in, remove it from the list. O(n)
With no limit on density, this is O(N^2), and slow. But this gets faster and faster the more spread out the points are. If the points are somewhat evenly distributed in a known boundary, you can switch to a two dimensional array, making this significantly faster, and if there's a constant limit to the density, that technically makes this a O(N) algorithm.
That is how you sort a list of two variables by the way. The grid/map/2dvector thing.
[EDIT] You mentioned you were having trouble with the "nieve" method too, so here's that:
template<class iterator, class criterion>
iterator RemoveCriterion(iterator begin, iterator end, criterion criter) {
iterator actend = end;
for(iterator L=begin; L != actend; ++L) {
iterator R(L);
for(++R; R != actend;) {
if (criter(*L, *R) {
iterator N(R);
std::rotate(R, ++N, actend);
--actend;
} else
++R;
}
}
return actend;
}
This should work on linked lists, vectors, and similar containers, and works in reverse. Unfortunately, it's kinda slow due to not taking into account the properties of linked lists. It's possible to make much faster versions that only work on linked lists in a specific direction. Note that the return value is important, like with the other mutating algorithms. It can only alter contents of the container, not the container itself, so you'll have to erase all elements after the return value when it finishes.
Cubbi had the best answer, though he deleted it for some reason:
Sounds like it's a sorted list, in which case std::unique will do the job of removing the second element of each pair:
#include <list>
#include <algorithm>
#include <iostream>
#include <iterator>
int main()
{
std::list<int> data = {1,2,5,10,15,16,20};
std::unique_copy(data.begin(), data.end(),
std::ostream_iterator<int>(std::cout, " "),
[](int n, int m){return abs(n-m)<=1;});
std::cout << '\n';
}
demo: https://ideone.com/OnGxk
That trivially extends to other types -- either by changing int to something else, or by defining a template:
template<typename T> void remove_close(std::list<T> &data, int distance)
{
std::unique_copy(data.begin(), data.end(),
std::ostream_iterator<int>(std::cout, " "),
[distance](T n, T m){return abs(n-m)<=distance;});
return data;
}
Which will work for any type that defines operator - and abs to allow finding a distance between two objects.
As a mathematician I am pretty sure there is no 'awesome' way to approaching this problem for an unsorted list. It seems to me that it is a logical necessity to check the criterion for any one element against all previous elements selected in order to determine whether insertion is viable or not. There may be a number of ways to optimize this, depending on the size of the list and the criterion.
Perhaps you could maintain a bitset based on the criterion. E.g. suppose abs(n-m)<1) is the criterion. Suppose the first element is of size 5. This is carried over into the new list. So flip bitset[5] to 1. Then, when you encounter an element of size 6, say, you need only test
!( bitset[5] | bitset[6] | bitset[7])
This would ensure no element is within magnitude 1 of the resulting list. This idea may be difficult to extend for more complicated(non discrete) criterions however.
What about:
struct IsNeighbour : public std::binary_function<int,int,bool>
{
IsNeighbour(int dist)
: distance(dist) {}
bool operator()(int a, int b) const
{ return abs(a-b) <= distance; }
int distance;
};
std::list<int>::iterator iter = lst.begin();
while(iter != lst.end())
{
iter = std::adjacent_find(iter, lst.end(), IsNeighbour(some_distance)));
if(iter != lst.end())
iter = lst.erase(iter);
}
This should have O(n). It searches for the first pair of neighbours (which are at maximum some_distance away from each other) and removes the first of this pair. This is repeated (starting from the found item and not from the beginning, of course) until no pairs are found anymore.
EDIT: Oh sorry, you said any other and not just its next element. In this case the above algorithm only works for a sorted list. So you should sort it first, if neccessary.
You can also use std::unique instead of this custom loop above:
lst.erase(std::unique(lst.begin(), lst.end(), IsNeighbour(some_distance), lst.end());
but this removes the second item of each equal pair, and not the first, so you may have to reverse the iteration direction if this matters.
For 2D points instead of ints (1D points) it is not that easy, as you cannot just sort them by their euclidean distance. So if your real problem is to do it on 2D points, you might rephrase the question to point that out more clearly and remove the oversimplified int example.
I think this will work, as long as you don't mind making copies of the data, but if it's just a pair of integer/floats, that should be pretty low-cost. You're making n^2 comparisons, but you're using std::algorithm and can declare the input vector const.
//calculates the distance between two points and returns true if said distance is
//under its threshold
bool isTooClose(const Point& lhs, const Point& rhs, int threshold = 1);
vector<Point>& vec; //the original vector, passed in
vector<Point>& out; //the output vector, returned however you like
for(b = vec.begin(), e = vec.end(); b != e; b++) {
Point& candidate = *b;
if(find_if(out.begin(),
out.end(),
bind1st(isTooClose, candidate)) == out.end())
{//we didn't find anyone too close to us in the output vector. Let's add!
out.push_back(candidate);
}
}
std::list<>.erase(remove_if(...)) using functors
http://en.wikipedia.org/wiki/Erase-remove_idiom
Update(added code):
struct IsNeighbour : public std::unary_function<int,bool>
{
IsNeighbour(int dist)
: m_distance(dist), m_old_value(0){}
bool operator()(int a)
{
bool result = abs(a-m_old_value) <= m_distance;
m_old_value = a;
return result;
}
int m_distance;
int m_old_value;
};
main function...
std::list<int> data = {1,2,5,10,15,16,20};
data.erase(std::remove_if(data.begin(), data.end(), IsNeighbour(1)), data.end());