Find the unique elements of a vector C++ - c++

Is there a fast way to find all the single elements (only appeared once) in a vector of elements? All the elements in the vector is either single or dual (appeared twice). My answer is sort all the elements and then remove double appeared elements. Any faster way to do it?

So for small enough n (<=1e8) sorting and removal (using std::sort() and std::unique) approach is still faster than hash tables.
Sample code: O(n log n)
vector<int>A = {1,2,3,1,2,5};
sort(A.begin(),A.end());
A.erase(unique(A.begin(),A.end()),A.end());
for(int&x:A)
cout<<x<<" ";

if your elements are hashable, you can use a std::unordered_map<T, int> to store the count of each element, which will take amortized linear time:
template<typename T>
std::vector<T> uniqueElements(const std::vector<T>& v) {
std::unordered_map<T, int> counts;
for(const auto& elem : v) ++counts[elem];
std::vector<T> result;
for(auto [elem, count] : counts)
if(count == 1)
result.push_back(elem);
return result;
}
For small lists, sorting and then doing a linear pass might still be faster.
Also note that this copies your elements, which might also be expensive in some cases

Related

c++ improve vector sorting by presorting with old vector

I have a a vector of pair with the following typdef
typedef std::pair<double, int> myPairType;
typedef std::vector<myPairType> myVectorType;
myVectorType myVector;
I fill this vector with double values and the int part of the pair is an index.
The vector then looks like this
0.6594 1
0.5434 2
0.5245 3
0.8431 4
...
My program has a number of time steps with slight variations in the double values and every time step I sort this vector with std::sort to something like this.
0.5245 3
0.5434 2
0.6594 1
0.8431 4
The idea is now to somehow use the vector from the last time step (the "old vector, already sorted) to presort the current vector (the new vector, not yet sorted). And use an insertions sort or tim sort to sort the "rest" of the then presorted vector.
Is this somehow possible? I couldn't find a function to order the "new" vector of pairs by one part (the int part).
And if it is possible could this be faster then sorting the whole unsorted "new" vector?
Thanks for any pointers into the right direction.
tiom
UPDATE
First of all thanks for all the suggestions and code examples. I will have a look at each of them and do some benchmarking if they will speed up the process.
Since there where some questions regarding the vectors I will try to explain in more detail what I want to accomplish.
As I said I have a number if time steps 1 to n. For every time step I have a vector of double data values with approximately 260000 elements.
In every time step I add an index to this vector which will result in a vector of pairs <double, int>. See the following code snippet.
typedef typename myVectorType::iterator myVectorTypeIterator; // iterator for myVector
std::vector<double> vectorData; // holds the double data values
myVectorType myVector(vectorData.size()); // vector of pairs <double, int>
myVectorTypeIterator myVectorIter = myVector.begin();
// generating of the index
for (int i = 0; i < vectorData.size(); ++i) {
myVectorIter->first = vectorData[i];
myVectorIter->second = i;
++myVectorIter;
}
std::sort(myVector.begin(), myVector.end() );
(The index is 0 based. Sorry for my initial mistake in the example above)
I do this for every time step and then sort this vector of pairs with std::sort.
The idea was now to use the sorted vector of pairs of time step j-1 (lets call it vectorOld) in time step j as a "presorter" for the "new" myVector since I assume the ordering of the sorted "new" myVector of time step j will only differ in some cases from the already sorted vectorOld of time step j-1.
With "presorter" I mean to rearrange the pairs in the "new" myVector into a vector presortedVector of type myVectorType by the same index order as the vectorOld and then let a tim sort or some similar sorting algorithm that is good in presorted date do the rest of the sorting.
Some data examples:
This is what the beginning of myVector looks like in time step j-1 before the sorting.
0.0688015 0
0.0832928 1
0.0482259 2
0.142874 3
0.314859 4
0.332909 5
...
And after the sorting
0.000102207 23836
0.000107378 256594
0.00010781 51300
0.000109315 95454
0.000109792 102172
...
So I in the next time step j this is my vectorOld and I like to take the element with index 23836 of the "new" myVector and put it in the first place of the presortedVector, element with index 256594 should be the second element in presortedVector and so on. But the elements have to keep their original index. So 256594 will not be index 0 but only element 0 in presortedVector still with index 256594
I hope this is a better explanation of my plan.
First, scan through the sequence to find the first element that's smaller than the preceding one (either a loop, or C++11's std::is_sorted_until). This is the start of the unsorted portion. Use std::sort on the remainder, then merge the two halves with std::inplace_merge.
template<class RandomIt, class Compare>
void sort_new_elements(RandomIt first, RandomIt last, Compare comp)
{
RandomIt mid = std::is_sorted_until(first, last, comp);
std::sort(mid, last, comp);
std::inplace_merge(first, mid, last, comp);
}
This should be more efficient than sorting the whole sequence indiscriminately, as long as the presorted sequence at the front is significantly larger than the unsorted part.
Using the sorted vector would likely result in more comparisons (just to find a matching item).
What you seem to be looking for is a self-ordering container.
You could use a set (and remove/re-insert on modification).
Alternatively you could use Boost Multi Index which affords a bit more convenience (e.g. use a struct instead of the pair)
I have no idea if this could be faster than sorting the whole unsorted "new" vector. It will depend on the data.
But this will create a sorted copy of a new vector based on the order of an old vector:
myVectorType getSorted(const myVectorType& unsorted, const myVectorType& old) {
myVectorType sorted(unsorted.size());
auto matching_value
= [&unsorted](const myPairType& value)
{ return unsorted[value.second - 1]; };
std::transform(old.begin(), old.end(), sorted.begin(), matching_value);
return sorted;
}
You will then need to "finish" sorting this vector. I don't know how much quicker (if at all) this will be than sorting it from scratch.
Live demo.
Well you can create new vector with the order of the old and then use algorithms that has good complexity for (nearly) sorted inputs for the restoration of order.
Below I put an example of how it works, with Mark's function as restore_order:
#include <iostream>
#include <algorithm>
#include <vector>
#include <utility>
using namespace std;
typedef std::pair<double, int> myPairType;
typedef std::vector<myPairType> myVectorType;
void outputMV(const myVectorType& vect, std::ostream& out)
{
for(const auto& element : vect)
out << element.first << " " << element.second << '\n';
}
//https://stackoverflow.com/a/28813905/1133179
template<class RandomIt, class Compare>
void restore_order(RandomIt first, RandomIt last, Compare comp)
{
RandomIt mid = std::is_sorted_until(first, last, comp);
std::sort(mid, last, comp);
std::inplace_merge(first, mid, last, comp);
}
int main() {
myVectorType myVector = {{3.5,0},{1.4,1},{2.5,2},{1.0,3}};
myVectorType mv2 = {{3.6,0},{1.35,1},{2.6,2},{1.36,3}};
auto comparer = [] (const auto& lhs, const auto& rhs) { return lhs.first < rhs.first;};
// make sure we didn't mess with the initial indexing
int i = 0;
for(auto& element : myVector) element.second = i++;
i = 0;
for(auto& element : mv2) element.second = i++;
//sort the initial vector
std::sort(myVector.begin(), myVector.end(), comparer);
outputMV(myVector, cout);
// this will replace each element of myVector with a corresponding
// value from mv2 using the old sorted order
std::for_each(myVector.begin(), myVector.end(),
[mv2] (auto& el) {el = mv2[el.second];}
);
// restore order in case it was different for the new vector
restore_order(myVector.begin(), myVector.end(), comparer);
outputMV(myVector, cout);
return 0;
}
This works in O(n) up to the point of restore then. Then the trick is to use good function for it. A nice candidate will have good complexity for nearly sorted inputs. I used function that Mark Ransom posted, which works, but still isn't perfect.
It could get outperformed by bubble sort inspired method. Something like, iterate over each element, if the order between current and next element is wrong recursively swap current and next. However there is a bet on how much the order changes - if the order doesn't vary much you will stay close to O(2n), if does - you will go up to O(n^2).
I think the best would be an implementation of natural merge sort. That has best case (sorted input) O(n), and worst O(n log n).

What is the most efficient way of removing duplicates from a container only using almost equality criteria (no sort)

How do I remove duplicates from a non sorted container (mainly vector) when I do not have the possibility to define operator< e.g. when I can only define a fuzzy compare function.
This answer using sort does not work since I cannot define a function for ordering the data.
template <typename T>
void removeDuplicatesComparable(T& cont){
for(auto iter=cont.begin();iter!=cont.end();++iter){
cont.erase(std::remove(boost::next(iter),cont.end(),*iter),cont.end());
}
}
This is O(n²) and should be quite localized concerning cache hits.
Is there a faster or at least neater solution?
Edit: On why I cannot use sets. I do geometric comparisons. An example could be this but I have other entities different from polygons as well.
bool match(SegPoly const& left,SegPoly const& right,double epsilon){
double const cLengthCompare = 0.1; //just an example
if(!isZero(left.getLength()- right.getLength(), cLengthCompare)) return false;
double const interArea =areaOfPolygon(left.intersected(right)); //this is a geometric intersection
if(!isZero(interArea-right.getArea(),epsilon)) return false;
else return true;
}
So for such comparisons I would not know how to formulate sorting or a neat hash function.
First, don't remove elements one at a time.
Next, use a hash table (or similar structure) to detect duplicates.
If you don't need to preserve order, then copy all elements into a hashset (this destroys duplicates), then recreate the vector using the values left in the hashset.
If you need to preserve order, then:
Set read and write iterators to the beginning of the vector.
Start moving the read iterator through, checking elements against a hashset or octtree or something that allows finding nearby elements quickly.
For each element that collides with one in the hashset/octtree, advance the read iterator only.
For elements that do not collide, move from read iterator to write iterator, copy to hashset/octtree, then advance both.
When read iterator reaches the end, call erase to truncate the vector at the write iterator position.
The key advantage of the octtree is that while it doesn't let you immediately determine whether there is something close enough to be a "duplicate", it allows you to test against only near neighbors, excluding most of your dataset. So your algorithm might be O(N lg N) or even O(N lg lg N) depending on the spatial distribution.
Again, if you don't care about the ordering, you can actually move survivors into the hashset/octtree and at the end move them back into the vector (compactly).
If you don't want to rewrite your code to prevent duplicates from being placed in the vector to begin with, you can do something like this:
std::vector<Type> myVector;
// fill in the vector's data
std::unordered_set<Type> mySet(myVector.begin(), myVector.end());
myVector.assign(mySet.begin(), mySet.end());
Which will be of O(2 * n) = O(n).
std::set (or std::unordered_set - which uses a hash instead of a comparison) doesn't allow for duplicates, so it will eliminate them as the set is initialized. Then you re-assign the vector with the non-duplicated data.
Since you are insisting that you cannot create a hash, another alternative is to create a temporary vector:
std::vector<Type> vec1;
// fill vec1 with your data
std::vector<Type> vec2;
vec2.reserve(vec1.size()); // vec1.size() will be the maximum possible size for vec2
std::for_each(vec1.begin(), vec1.end(), [&](const Type& t)
{
bool is_unique = true;
for (std::vector<Type>::iterator it = vec2.begin(); it != vec2.end(); ++it)
{
if (!YourCustomEqualityFunction(s, t))
{
is_unique = false;
break;
}
}
if (is_unique)
{
vec2.push_back(t);
}
});
vec1.swap(vec2);
If copies are a concern, switch to a vector of pointers, and you can decrease the memory reallocations:
std::vector<std::shared_ptr<Type>> vec1;
// fill vec1 with your data
std::vector<std::shared_ptr<Type>> vec2;
vec2.reserve(vec1.size()); // vec1.size() will be the maximum possible size for vec2
std::for_each(vec1.begin(), vec1.end(), [&](const std::shared_ptr<Type>& t)
{
bool is_unique = true;
for (std::vector<Type>::iterator it = vec2.begin(); it != vec2.end(); ++it)
{
if (!YourCustomEqualityFunction(*s, *t))
{
is_unique = false;
break;
}
}
if (is_unique)
{
vec2.push_back(t);
}
});
vec1.swap(vec2);

Efficient way to get the indizes of the k highest values in vector<float>

How can I create a std::map<int, float> from a vector<float>, so that the map contains the k highest values from the vector with the keys beeing the index of the value in the vector.
A naive approach would be to traverse the vector (O(n)), extract and erase (O(n)) the highest element k times (O(k)), leading to a complexity of O(k*n^2), which is suboptimal, I guess.
Even better would be to just copy (O(n)) and remove the smallest until size is k. Which would lead to O(n^2). Still polynomial...
Any ideas?
Following should do the job:
#include <cstdint>
#include <algorithm>
#include <iostream>
#include <map>
#include <tuple>
#include <vector>
// Compare: greater T2 first.
struct greater_by_second
{
template <typename T1, typename T2>
bool operator () (const std::pair<T1, T2>& lhs, const std::pair<T1, T2>& rhs)
{
return std::tie(lhs.second, lhs.first) > std::tie(rhs.second, rhs.first);
}
};
std::map<std::size_t, float> get_index_pairs(const std::vector<float>& v, int k)
{
std::vector<std::pair<std::size_t, float>> indexed_floats;
indexed_floats.reserve(v.size());
for (std::size_t i = 0, size = v.size(); i != size; ++i) {
indexed_floats.emplace_back(i, v[i]);
}
std::nth_element(indexed_floats.begin(),
indexed_floats.begin() + k,
indexed_floats.end(), greater_by_second());
return std::map<std::size_t, float>(indexed_floats.begin(), indexed_floats.begin() + k);
}
Let's test it:
int main(int argc, char *argv[])
{
const std::vector<float> fs {45.67f, 12.34f, 67.8f, 4.2f, 123.4f};
for (const auto& elem : get_index_pairs(fs, 2)) {
std::cout << elem.first << " " << elem.second << std::endl;
}
return 0;
}
Output:
2 67.8
4 123.4
You can keep a list of the k-highest values so far, and update it for each of the values in your vector, which takes you down to O(n*log k) (assuming log k for each update of the list of highest values) or, for a naive list, O(kn).
You can probably get closer to O(n), but assuming k is probably pretty small, may not be worth the effort.
Your optimal solution will have a complexity of O(n+k*log(k)), since sorting the k elements can be reduced to this, and you will have to look at each of the elements at least once.
Two possible solutions come to mind:
Iterate through the vector while adding all elements to a bounded (size k) priority-queue/heap, also keeping their indices.
Create a copy of your vector with including the original indices, i.e. std::vector<std::pair<float, std::size_t>> and use std::nth_element to move the k highest values to the front using a comparator that compares only the first element. Then insert those elements into your target map. Ironically, that last step adds you the k*log(k) in the overall complexity, while nth_element is linear (but will permute your indices).
Maybe I did not get it, but in case the incremental approach is not an option, why not use std::sort std::partial_sort?
That should be an o(n log k), and since k is very likely to be small, that makes practically an o(n).
Edit: thanks to Mike Seymour for the update.
Edit (bis):
The idea is to use an intermediate vector for sorting, and then put it into the map. Trying to reduce the order of the computation would only be justified for significant amount of data, so I guess the copy time (in o(n) ) could be lost in background noise.
Edit (bis):
That's actually what the selected answer does, without the theorietical explanations :).

How to efficiently select a random element from a std::set

How can I efficiently select a random element from a std::set?
A std::set::iterator is not a random access iterator. So I can't directly index a randomly chosen element like I could for a std::deque or std::vector
I could take the iterator returned from std::set::begin() and increment it a random number of times in the range [0,std::set::size()), but that seems to be doing a lot of unnecessary work. For an "index" close to the set's size, I would end up traversing the entire first half of the internal tree structure, even though it's already known the element won't be found there.
Is there a better approach?
In the name of efficiency, I am willing to define "random" as less random than whatever approach I might have used to choose a random index in a vector. Call it "reasonably random".
Edit...
Many insightful answers below.
The short version is that even though you can find a specific element in log(n) time, you can't find an arbitrary element in that time through the std::set interface.
Use boost::container::flat_set instead:
boost::container::flat_set<int> set;
// ...
auto it = set.begin() + rand() % set.size();
Insertions and deletions become O(N) though, I don't know if that's a problem. You still have O(log N) lookups, and the fact that the container is contiguous gives an overall improvement that often outweighs the loss of O(log N) insertions and deletions.
What about a predicate for find (or lower_bound) which causes a random tree traversal? You'd have to tell it the size of the set so it could estimate the height of the tree and sometimes terminate before leaf nodes.
Edit: I realized the problem with this is that std::lower_bound takes a predicate but does not have any tree-like behavior (internally it uses std::advance which is discussed in the comments of another answer). std::set<>::lower_bound uses the predicate of the set, which cannot be random and still have set-like behavior.
Aha, you can't use a different predicate, but you can use a mutable predicate. Since std::set passes the predicate object around by value you must use a predicate & as the predicate so you can reach in and modify it (setting it to "randomize" mode).
Here's a quasi-working example. Unfortunately I can't wrap my brain around the right random predicate so my randomness is not excellent, but I'm sure someone can figure that out:
#include <iostream>
#include <set>
#include <stdlib.h>
#include <time.h>
using namespace std;
template <typename T>
struct RandomPredicate {
RandomPredicate() : size(0), randomize(false) { }
bool operator () (const T& a, const T& b) {
if (!randomize)
return a < b;
int r = rand();
if (size == 0)
return false;
else if (r % size == 0) {
size = 0;
return false;
} else {
size /= 2;
return r & 1;
}
}
size_t size;
bool randomize;
};
int main()
{
srand(time(0));
RandomPredicate<int> pred;
set<int, RandomPredicate<int> & > s(pred);
for (int i = 0; i < 100; ++i)
s.insert(i);
pred.randomize = true;
for (int i = 0; i < 100; ++i) {
pred.size = s.size();
set<int, RandomPredicate<int> >::iterator it = s.lower_bound(0);
cout << *it << endl;
}
}
My half-baked randomness test is ./demo | sort -u | wc -l to see how many unique integers I get out. With a larger sample set try ./demo | sort | uniq -c | sort -n to look for unwanted patterns.
If you could access the underlying red-black tree (assuming that one exists) then you could access a random node in O(log n) choosing L/R as the successive bits of a ceil(log2(n))-bit random integer. However, you can't, as the underlying data structure is not exposed by the standard.
Xeo's solution of placing iterators in a vector is O(n) time and space to set up, but amortized constant overall. This compares favourably to std::next, which is O(n) time.
You can use the std::advance method:
set <int> myset;
//insert some elements into myset
int rnd = rand() % myset.size();
set <int> :: const_iterator it(myset.begin());
advance(it, rnd);
//now 'it' points to your random element
Another way to do this, probably less random:
int mini = *myset().begin(), maxi = *myset().rbegin();
int rnd = rand() % (maxi - mini + 1) + mini;
int rndresult = *myset.lower_bound(rnd);
If either the set doesn't update frequently or you don't need to run this algorithm frequently, keep a mirrored copy of the data in a vector (or just copy the set to a vector on need) and randomly select from that.
Another approach, as seen in a comment, is to keep a vector of iterators into the set (they're only invalidated on element deletion for sets) and randomly select an iterator.
Finally if you don't need a tree-based set, you could use vector or deque as your underlying container and sort/unique-ify when needed.
You can do this by maintaining a normal array of values; when you insert to the set, you append the element to the end of the array (O(1)), then when you want to generate a random number you can grab it from the array in O(1) as well.
The issue comes when you want to remove elements from the array. The most naive method would take O(n), which might be efficient enough for your needs. However, this can be improved to O(log n) using the following method;
Keep, for each index i in the array, prfx[i], which represents the number of non-deleted elements in the range 0...i in the array. Keep a segment tree, where you keep the maximum prfx[i] contained in each range.
Updating the segment tree can be done in O(log n) per deletion. Now, when you want to access the random number, you query the segment tree to find the "real" index of the number (by finding the earliest range in which the maximum prfx is equal to the random index). This makes the random-number generation of complexity O(log n).
Average O(1)/O(log N) (hashable/unhashable) insert/delete/sample with off-the-shelf containers
The idea is simple: use rejection sampling while upper bounding the rejection rate, which is achievable with a amortized O(1) compaction operation.
However, unlike solutions based on augmented trees, this approach cannot be extended to support weighted sampling.
template <typename T>
class UniformSamplingSet {
size_t max_id = 0;
std::unordered_set<size_t> unused_ids;
std::unordered_map<size_t, T> id2value;
std::map<T, size_t> value2id;
void compact() {
size_t id = 0;
std::map<T, size_t> new_value2id;
std::unordered_map<size_t, T> new_id2value;
for (auto [_, value] : id2value) {
new_value2id.emplace(value, id);
new_id2value.emplace(id, value);
++id;
}
max_id = id;
unused_ids.clear();
std::swap(id2value, new_id2value);
std::swap(value2id, new_value2id);
}
public:
size_t size() {
return id2value.size();
}
void insert(const T& value) {
size_t id;
if (!unused_ids.empty()) {
id = *unused_ids.begin();
unused_ids.erase(unused_ids.begin());
} else {
id = max_id++;
}
if (!value2id.emplace(value, id).second) {
unused_ids.insert(id);
} else {
id2value.emplace(id, value);
}
}
void erase(const T& value) {
auto it = value2id.find(value);
if (it == value2id.end()) return;
unused_ids.insert(it->second);
id2value.erase(it->second);
value2id.erase(it);
if (unused_ids.size() * 2 > max_id) {
compact();
};
}
// uniform(n): uniform random in [0, n)
template <typename F>
T sample(F&& uniform) {
size_t i;
do { i = uniform(max_id); } while (unused_ids.find(i) != unused_ids.end());
return id2value.at(i);
}

C++, fast remove elements from vector unique to another vector

There are 2 unsorted vectors of int and vector of pairs int, int
std::vector <int> v1;
std::vector <std::pair<int, float> > v2;
containing millions of items.
How to remove as fast as possible such items from v1, that are unique to v2.first (ie not included in v2.first)?
Example:
v1: 5 3 2 4 7 8
v2: {2,8} {7,10} {5,0} {8,9}
----------------------------
v1: 3 4
There are two tricks I would use to do this as quickly as possible:
Use some sort of associative container (probably std::unordered_set) to store all of the integers in the second vector to make it dramatically more efficient to look up whether some integer in the first vector should be removed.
Optimize the way in which you delete elements from the initial vector.
More concretely, I'd do the following. Begin by creating a std::unordered_set and adding all of the integers that are the first integer in the pair from the second vector. This gives (expected) O(1) lookup time to check whether or not a specific int exists in the set.
Now that you've done that, use the std::remove_if algorithm to delete everything from the original vector that exists in the hash table. You can use a lambda to do this:
std::unordered_set<int> toRemove = /* ... */
v1.erase(std::remove_if(v1.begin(), v1.end(), [&toRemove] (int x) -> bool {
return toRemove.find(x) != toRemove.end();
}, v1.end());
This first step of storing everything in the unordered_set takes expected O(n) time. The second step does a total of expected O(n) work by bunching all the deletes up to the end and making lookups take small time. This gives a total of expected O(n)-time, O(n) space for the entire process.
If you are allowed to sort the second vector (the pairs), then you could alternatively do this in O(n log n) worst-case time, O(log n) worst-case space by sorting the vector by the key, then using std::binary_search to check whether a particular int from the first vector should be eliminated or not. Each binary search takes O(log n) time, so the total time required is O(n log n) for the sorting, O(log n) time per element in the first vector (for a total of O(n log n)), and O(n) time for the deletion, giving a total of O(n log n).
Hope this helps!
Assuming that neither container is sorted and that sorting is actually too expensive or memory is scarce:
v1.erase(std::remove_if(v1.begin(), v1.end(),
[&v2](int i) {
return std::find_if(v2.begin(), v2.end(),
[](const std::pair<int, float>& p) {
return p.first == i; })
!= v2.end() }), v1.end());
Alternatively sort v2 on first and use a binary search instead. If there is enough memory use an unordered_set to sort the first of v2.
Complete C++03 version:
#include <iostream>
#include <vector>
#include <utility>
#include <algorithm>
struct find_func {
find_func(int i) : i(i) {}
int i;
bool operator()(const std::pair<int, float>& p) {
return p.first == i;
}
};
struct remove_func {
remove_func(std::vector< std::pair<int, float> >* v2)
: v2(v2) {}
std::vector< std::pair<int, float> >* v2;
bool operator()(int i) {
return std::find_if(v2->begin(), v2->end(), find_func(i)) != v2->end();
}
};
int main()
{
// c++11 here
std::vector<int> v1 = {5, 3, 2, 4, 7, 8};
std::vector< std::pair<int, float> > v2 = {{2,8}, {7,10}, {5,0}, {8,9}};
v1.erase(std::remove_if(v1.begin(), v1.end(), remove_func(&v2)), v1.end());
// and here
for(auto x : v1) {
std::cout << x << std::endl;
}
return 0;
}