Removing duplicates from a non-sortable vector

Removing duplicates from a non-sortable vector - c++

I'm looking for a way to remove duplicates from a vector (lets call him theGreatVector :D).
I can't use std::sort followed by std::unique because there is no way to sort my objects.
theGreatVector contains some vector<Item*> (smallVectors)
I got an overload of == for vector<Item*> so i can use it
I'm able de create something in O(n²) but i need time efficiency
(theGreatVector.size() could be 10⁵ or 10⁶)
Right now what i got is something like that
(i fill my vector only if smallOne isnt in it) :
for(i=0;i<size;i++)
{
vector<Item*>smallOne = FindFacets(i)
if(smallOne doesnt belong to GreatOne) // this line already in O(n) :/
{
theGreatOne.push_back(smallOne);
}
}
If there is a way to do that even in nlog(n) + n or anything lower than n², that'd be great !
Thanks a lot
Azh

You can always std::tie every data member into a std::tuple and use lexicographic ordering on that to sort a vector of pointers to your big data structure. You can then do std::unique on that data structure before copying the output. With a small modification you could also remove the duplicates in place by sorting the big Item vector directly.
#include <tuple>
#include <memory>
#include <vector>
// tuples have builtin lexicographic ordering,
// I'm assuming all your Item's data members also have operator<
bool operator<(Item const& lhs, Item const& rhs)
{
return std::tie(lhs.first_data, /*...*/ lhs.last_data) < std::tie(rhs.first_data, /*...*/ rhs.last_Data);
}
int main()
{
// In the Beginning, there was some data
std::vector<Item> vec;
// fill it
// init helper vector with addresses of vec, complexity O(N)
std::vector<Item*> pvec;
pvec.reserve(vec.size());
std::transform(std::begin(vec), std::end(vec), std::back_inserter(pvec), std::addressof<Item>);
// sort to put duplicates in adjecent positions, complexity O(N log N)
std::sort(std::begin(pvec), std::end(pvec), [](Item const* lhs, Item const* rhs){
return *lhs < *rhs; // delegates to operator< for Item
});
// remove duplicates, complexity O(N)
auto it = std::unique(std::begin(pvec), std::end(pvec), [](Item const* lhs, Item const* rhs){
return *lhs == *rhs; // assumes Item has operator==, if not use std::tuple::operator==
});
pvec.erase(it, std::end(pvec));
// copy result, complexity O(N)
std::vector<Item> result;
result.reserve(pvec.size());
std::transform(std::begin(pvec), std::end(pvec), std::back_inserter(result), [](Item const* pelem){
return *pelem;
});
// And it was good, and done in O(N log N) complexity
}

Take a look at unordered set:
http://www.cplusplus.com/reference/unordered_set/unordered_set/
it seems to do what you want. Insertions for single elements are done in O(1) on average, O(n) in worst case, only equality operator needs to be provided.

Related

Sorting a Vector of Vector in Cpp

Say I have this vector of vector [[5,10],[2,5],[4,7],[3,9]] and I want to sort it using the sort() method of cpp, such that it becomes this [[5,10],[3,9],[4,7],[2,5]] after sorting. That is I want to sort based on the second index.
Now I have written this code to sort this vector of vector, but it is not working correctly.
static bool compareInterval( vector<vector<int>> &v1, vector<vector<int>> &v2)
{
return (v1[0][1]>v2[0][1]);
}
sort(boxTypes.begin(), boxTypes.end(), compareInterval);
Can anyone tell me where I am going wrong and hwo can I correct it. Thanks in advance.

Your sort could look like
std::sort(boxTypes.begin(), boxTypes.end(), [](auto const& lhs, auto const& rhs) {
return lhs[1] > rhs[1];
});
in other words sorting by the [1] element of each vector and using > to sort in descending order. Note that in the lambda function lhs and rhs are of type const std::vector<int>&.

When your code is sorting vector of vectors then to the boolean function it passes two vectors (not vector of vectors), and compares them to determine if they need to be interchanged, or are they in correct positions relative to each other.
Hence, here you only need to compare 2 vectors (you have tried to compare vector of vectors).
The change you need to make in compareInterval is:
static bool compareInterval( vector<int> &v1, vector<int> &v2)
{
return (v1[1]>v2[1]);
}
Find my testing code below:
#include <bits/stdc++.h>
using namespace std;
static bool compareInterval( vector<int> &v1, vector<int> &v2)
{
return (v1[1]>v2[1]);
}
int main() {
vector<vector<int>> boxTypes = {{5,10},{2,5},{4,7},{3,9}};
sort(boxTypes.begin(), boxTypes.end(), compareInterval);
for(int i=0;i<4;i++)
cout<<boxTypes[i][0]<<" "<<boxTypes[i][1]<<"\n";
}

Range projections will come somewhat handy for this.
ranges::sort algorithm would receive:
just the vector to sort; no iterators to the begin and end.
(optionally) the function you want to use for sorting, greater in this case.
(optionally) the projection: for every element t of the original vector, which happens to be another vector of two elements, get its second element, i.e. t[1], and sort on that.
std::ranges::sort(boxTypes, std::ranges::greater{}, [](auto&& bt) { return bt[1]; });
Note I have only been able to have this compiling on msvc, not on gcc or clang (and with /std:c++latest, not even with /std:c++20; https://godbolt.org/z/9Kqfa9vhx).

What is the most efficient way of removing duplicates from a container only using almost equality criteria (no sort)

How do I remove duplicates from a non sorted container (mainly vector) when I do not have the possibility to define operator< e.g. when I can only define a fuzzy compare function.
This answer using sort does not work since I cannot define a function for ordering the data.
template <typename T>
void removeDuplicatesComparable(T& cont){
for(auto iter=cont.begin();iter!=cont.end();++iter){
cont.erase(std::remove(boost::next(iter),cont.end(),*iter),cont.end());
}
}
This is O(n²) and should be quite localized concerning cache hits.
Is there a faster or at least neater solution?
Edit: On why I cannot use sets. I do geometric comparisons. An example could be this but I have other entities different from polygons as well.
bool match(SegPoly const& left,SegPoly const& right,double epsilon){
double const cLengthCompare = 0.1; //just an example
if(!isZero(left.getLength()- right.getLength(), cLengthCompare)) return false;
double const interArea =areaOfPolygon(left.intersected(right)); //this is a geometric intersection
if(!isZero(interArea-right.getArea(),epsilon)) return false;
else return true;
}
So for such comparisons I would not know how to formulate sorting or a neat hash function.

First, don't remove elements one at a time.
Next, use a hash table (or similar structure) to detect duplicates.
If you don't need to preserve order, then copy all elements into a hashset (this destroys duplicates), then recreate the vector using the values left in the hashset.
If you need to preserve order, then:
Set read and write iterators to the beginning of the vector.
Start moving the read iterator through, checking elements against a hashset or octtree or something that allows finding nearby elements quickly.
For each element that collides with one in the hashset/octtree, advance the read iterator only.
For elements that do not collide, move from read iterator to write iterator, copy to hashset/octtree, then advance both.
When read iterator reaches the end, call erase to truncate the vector at the write iterator position.
The key advantage of the octtree is that while it doesn't let you immediately determine whether there is something close enough to be a "duplicate", it allows you to test against only near neighbors, excluding most of your dataset. So your algorithm might be O(N lg N) or even O(N lg lg N) depending on the spatial distribution.
Again, if you don't care about the ordering, you can actually move survivors into the hashset/octtree and at the end move them back into the vector (compactly).

If you don't want to rewrite your code to prevent duplicates from being placed in the vector to begin with, you can do something like this:
std::vector<Type> myVector;
// fill in the vector's data
std::unordered_set<Type> mySet(myVector.begin(), myVector.end());
myVector.assign(mySet.begin(), mySet.end());
Which will be of O(2 * n) = O(n).
std::set (or std::unordered_set - which uses a hash instead of a comparison) doesn't allow for duplicates, so it will eliminate them as the set is initialized. Then you re-assign the vector with the non-duplicated data.
Since you are insisting that you cannot create a hash, another alternative is to create a temporary vector:
std::vector<Type> vec1;
// fill vec1 with your data
std::vector<Type> vec2;
vec2.reserve(vec1.size()); // vec1.size() will be the maximum possible size for vec2
std::for_each(vec1.begin(), vec1.end(), [&](const Type& t)
{
bool is_unique = true;
for (std::vector<Type>::iterator it = vec2.begin(); it != vec2.end(); ++it)
{
if (!YourCustomEqualityFunction(s, t))
{
is_unique = false;
break;
}
}
if (is_unique)
{
vec2.push_back(t);
}
});
vec1.swap(vec2);
If copies are a concern, switch to a vector of pointers, and you can decrease the memory reallocations:
std::vector<std::shared_ptr<Type>> vec1;
// fill vec1 with your data
std::vector<std::shared_ptr<Type>> vec2;
vec2.reserve(vec1.size()); // vec1.size() will be the maximum possible size for vec2
std::for_each(vec1.begin(), vec1.end(), [&](const std::shared_ptr<Type>& t)
{
bool is_unique = true;
for (std::vector<Type>::iterator it = vec2.begin(); it != vec2.end(); ++it)
{
if (!YourCustomEqualityFunction(*s, *t))
{
is_unique = false;
break;
}
}
if (is_unique)
{
vec2.push_back(t);
}
});
vec1.swap(vec2);

how to find common words between two vectors of std::string

I am trying to find common words between 2 vectors of std::string. I want to get those into a sorted list which is sorted by length, and then words of each length to be sorted alphabetically. I need to use stl functions and functors.
My thoughts:
using a for_each go through first vector and for each word, compare it to the other vector using a functor (if common, append to a list in functor). Then the resulting list will have only common words in it. Here is where I am stuck, I know how to sort alphabetically, but how do I sort them by length and then sort the same length chunks alphabetically? I have looked around stl, but I am not finding what I need. Or, I am just thinking about this the wrong way. Any ideas?
Example:
vec1: "and", "thus", "it", "has", "a", "beginning", "and", "end"
vec2: "and," "therefore", "stars", "are", "beginning", "to","fall","to","their", "end"
result: "and", "end", "beginning"

If you are allowed to sort vec1 and vec2, you can use std::set_intersection to sort the vectors according to the criteria you specify and obtain the common elements, ordered by the same criteria:
#include <algorithm>
#include <iterator>
std::sort(vec1.begin(), vec1.end(), funny_comp);
std::sort(vec2.begin(), vec2.end(), funny_comp);
std::list<std::string> intersection;
std::set_intersection(vec1.begin(), vec1.end(),
vec2.begin(), vec2.end(),
std::back_inserter(intersection),
funny_comp);
where funny_comp compares by string length, and performs a lexicographical comparison of strings if these have the same length:
bool funny_comp(const std::string &lhs, const std::string &rhs)
{
return (lhs.size()) == rhs.size()) ? lhs < rhs
: lhs.size() < rhs.size();
}
See working demo here.

If the vectors are sorted you can use std::set_intersection() to find the words common to each. std::set_intersection() is O(N) time on the number of items. Sort of course, is O(N log N).

Your solution is O(n^2). This means if the length of the vectors is n, you're doing n*n operations: going over one vector, and for each element, going over the other vector to look for it.
If you can sort the vectors (using the sort function. No need for fancy sort like you mentioned), the time is O(n). using set_intersection. Even if you can't sort them - copy them into new vectors and sort those new vectors. It's sill much faster than what you're proposing.

To sort by length, then lexically, you need to define a comparison function (or functor) to do that:
struct by_len_lex {
bool operator()(std::string const &a, std::string const &b) {
if (a.length() < b.length())
return true;
if (a.length() > b.length())
return false;
return a < b;
}
};
// ...
std::sort(strings1.begin(), strings1.end(), by_len_lex());
std::sort(strings2.begin(), strings2.end(), by_len_lex());
// find intersection:
std::set_intersection(strings1.begin(), strings1.end(),
strings2.begin(), strings2.end(),
std::back_inserter(results),
by_len_lex());
Note that since you're defining the sort criteria, you need to specify the same criteria both when sorting and when doing the intersection.

This might not be the best solution, but can use map like following :
#include <iostream>
#include<vector>
#include<map>
#include<algorithm>
using namespace std;
int main()
{
vector <string> v1{"and", "thus", "it", "has",
"a", "beginning", "and", "end"};
vector <string> v2{"and" ,"therefore", "stars",
"are", "beginning", "to","fall","to",
"their", "end"};
map <string,int> m;
auto check=[&](const string& x) { return m.find(x) != m.end() ; } ;
for_each(v1.begin(),
v1.end(),
[&](const string& x){
m[x] =1;
}
);
for_each(v2.begin(),
v2.end(),
[&](const string& x){
if(check(x))
cout<<x<<endl;
}
);
}

How to efficiently compare vectors with C++?

I need advice for micro optimization in C++ for a vector comparison function,
it compares two vectors for equality and order of elements does not matter.
template <class T>
static bool compareVectors(const vector<T> &a, const vector<T> &b)
{
int n = a.size();
std::vector<bool> free(n, true);
for (int i = 0; i < n; i++) {
bool matchFound = false;
for (int j = 0; j < n; j++) {
if (free[j] && a[i] == b[j]) {
matchFound = true;
free[j] = false;
break;
}
}
if (!matchFound) return false;
}
return true;
}
This function is used heavily and I am thinking of possible way to optimize it.
Can you please give me some suggestions? By the way I use C++11.
Thanks

It just realized that this code only does kind of a "set equivalency" check (and now I see that you actually did say that, what a lousy reader I am!). This can be achieved much simpler
template <class T>
static bool compareVectors(vector<T> a, vector<T> b)
{
std::sort(a.begin(), a.end());
std::sort(b.begin(), b.end());
return (a == b);
}
You'll need to include the header algorithm.
If your vectors are always of same size, you may want to add an assertion at the beginning of the method:
assert(a.size() == b.size());
This will be handy in debugging your program if you once perform this operation for unequal lengths by mistake.
Otherwise, the vectors can't be the same if they have unequal length, so just add
if ( a.size() != b.size() )
{
return false;
}
before the sort instructions. This will save you lots of time.
The complexity of this technically is O(n*log(n)) because it's mainly dependent on the sorting which (usually) is of that complexity. This is better than your O(n^2) approach, but might be worse due to the needed copies. This is irrelevant if your original vectors may be sorted.
If you want to stick with your approach, but tweak it, here are my thoughts on this:
You can use std::find for this:
template <class T>
static bool compareVectors(const vector<T> &a, const vector<T> &b)
{
const size_t n = a.size(); // make it const and unsigned!
std::vector<bool> free(n, true);
for ( size_t i = 0; i < n; ++i )
{
bool matchFound = false;
auto start = b.cbegin();
while ( true )
{
const auto position = std::find(start, b.cend(), a[i]);
if ( position == b.cend() )
{
break; // nothing found
}
const auto index = position - b.cbegin();
if ( free[index] )
{
// free pair found
free[index] = false;
matchFound = true;
break;
}
else
{
start = position + 1; // search in the rest
}
}
if ( !matchFound )
{
return false;
}
}
return true;
}
Another possibility is replacing the structure to store free positions. You may try a std::bitset or just store the used indices in a vector and check if a match isn't in that index-vector. If the outcome of this function is very often the same (so either mostly true or mostly false) you can optimize your data structures to reflect that. E.g. I'd use the list of used indices if the outcome is usually false since only a handful of indices might needed to be stored.
This method has the same complexity as your approach. Using std::find to search for things is sometimes better than a manual search. (E.g. if the data is sorted and the compiler knows about it, this can be a binary search).

Your can probabilistically compare two unsorted vectors (u,v) in O(n):
Calculate:
U= xor(h(u[0]), h(u[1]), ..., h(u[n-1]))
V= xor(h(v[0]), h(v[1]), ..., h(v[n-1]))
If U==V then the vectors are probably equal.
h(x) is any non-cryptographic hash function - such as MurmurHash. (Cryptographic functions would work as well but would usually be slower).
(This would work even without hashing, but it would be much less robust when the values have a relatively small range).
A 128-bit hash function would be good enough for many practical applications.

I am noticing that most proposed solution involved sorting booth of the input vectors.I think sorting the arrays compute more that what is strictly necessary for the evaluation the equality of the two vector ( and if the inputs vectors are constant, a copy needs to be made).
One other way would be to build an associative container to count the element in each vector... It's also possible to do the reduction of the two vector in parrallel.In the case of very large vector that could give a nice speed up.
template <typename T> bool compareVector(const std::vector<T> & vec1, const std::vector<T> & vec2) {
if (vec1.size() != vec2.size())
return false ;
//Here we assuame that T is hashable ...
auto count_set = std::unordered_map<T,int>();
//We count the element in each vector...
for (unsigned int count = 0 ; count < vec1.size();++count)
{
count_set[vec1[count]]++;
count_set[vec2[count]]--;
} ;
// If everything balance out we should have zero everywhere
return std::all_of(count_set.begin(),count_set.end(),[](const std::pair<T,int> p) { return p.second == 0 ;});
}
That way depend on the performance of your hashsing function , we might get linear complexity in the the length of booth vector (vs n*logn with the sorting).
NB the code might have some bug , did have time to check it ...
Benchmarking this way of comparing two vector to sort based comparison i get on ubuntu 13.10,vmware core i7 gen 3 :
Comparing 200 vectors of 500 elements by counting takes 0.184113 seconds
Comparing 200 vectors of 500 elements by sorting takes 0.276409 seconds
Comparing 200 vectors of 1000 elements by counting takes 0.359848 seconds
Comparing 200 vectors of 1000 elements by sorting takes 0.559436 seconds
Comparing 200 vectors of 5000 elements by counting takes 1.78584 seconds
Comparing 200 vectors of 5000 elements by sorting takes 2.97983 seconds

As others suggested, sorting your vectors beforehand will improve performance.
As an additional optimization you can make heaps out of the vectors to compare (with complexity O(n) instead of sorting with O(n*log(n)).
Afterwards you can pop elements from both heaps (complexity O(log(n))) until you get a mismatch.
This has the advantage that you only heapify instead of sort your vectors if they are not equal.
Below is a code sample. To know what is really fastest, you will have to measure with some sample data for your usecase.
#include <algorithm>
typedef std::vector<int> myvector;
bool compare(myvector& l, myvector& r)
{
bool possibly_equal=l.size()==r.size();
if(possibly_equal)
{
std::make_heap(l.begin(),l.end());
std::make_heap(r.begin(),r.end());
for(int i=l.size();i!=0;--i)
{
possibly_equal=l.front()==r.front();
if(!possibly_equal)
break;
std::pop_heap(l.begin(),l.begin()+i);
std::pop_heap(r.begin(),r.begin()+i);
}
}
return possibly_equal;
}

If you use this function a lot on the same vectors, it might be better to keep sorted copies for comparison.
In theory it might even be better to sort the vectors and compare sorted vectors if each one is compared just once, (sorting is O(n*log(n)), comparing sorted vector O(n), while your function is O(n^2).
But I suppose the time spent allocating memory for the sorted vectors will dwarf any theoretical gains if you don't compare the same vectors often.
As with all optimisations, profiling is the only way to make sure, I'd try some std::sort / std::equal combo.

Like stefan says you need to sort to get better complexity.
Then you can use
== operator (tnx for the correction in the comments - ste equal will also work but it is more appropriate for comparing ranges not entire containers)
If that is not fast enough only then bother with microoptimization.
Also are vectors guaranteed to be of the same size?
If not put that check at the begining.

Another possible solution (viable only if all elements are unique), which should improve somewhat the solution of #stefan (although the complexity would remain in O(NlogN)) is this:
template <class T>
static bool compareVectors(vector<T> a, const vector<T> & b)
{
// You should probably check this outside as it can
// avoid you the copy of a
if (a.size() != b.size()) return false;
std::sort(a.begin(), a.end());
for (const auto & v : b)
if ( !std::binary_search(a.begin(), a.end(), v) ) return false;
return true;
}
This should be faster since it performs the search directly as an O(NlogN) operation, instead of sorting b (O(NlogN)) and then searching both vectors (O(N)).

vector sort and erase won't work

When using this code to remove duplicates I get invalid operands to binary expression errors. I think that this is down to using a vector of a struct but I am not sure I have Googled my question and I get this code over and over again which suggests that this code is right but it isn't working for me.
std::sort(vec.begin(), vec.end());
vec.erase(std::unique(vec.begin(), vec.end()), vec.end());
Any help will be appreciated.
EDIT:
fileSize = textFile.size();
vector<wordFrequency> words (fileSize);
int index = 0;
for(int i = 0; i <= fileSize - 1; i++)
{
for(int j = 0; j < fileSize - 1; j++)
{
if(string::npos != textFile[i].find(textFile[j]))
{
words[i].Word = textFile[i];
words[i].Times = index++;
}
}
index = 0;
}
sort(words.begin(), words.end());
words.erase(unique(words.begin(), words.end(), words.end()));

First problem.
unique used wrongly
unique(words.begin(), words.end(), words.end()));
You are calling the three operand form of unique, which takes a start, an end, and a predicate. The compiler will pass words.end() as the predicate, and the function expects that to be your comparison functor. Obviously, it isn't one, and you enter the happy world of C++ error messages.
Second problem.
either use the predicate form or define an ordering
See the definitions of sort and unique.
You can either provide a
bool operator< (wordFrequency const &lhs, wordFrequency const &rhs)
{
return lhs.val_ < rhs.val_;
}
, but only do this if a less-than operation makes sense for that type, i.e. if there is a natural ordering, and if it's not just arbitrary (maybe you want other sort orders in the future?).
In the general case, use the predicate forms for sorting:
auto pred = [](wordFrequency const &lhs, wordFrequency const &rhs)
{
return lhs.foo < rhs.foo;
};
sort (words.begin(), words.end(), pred);
words.erase (unique (words.begin(), words.end(), pred));
If you can't C++11, write a functor:
struct FreqAscending { // should make it adaptible with std::binary_function
bool operator() (wordFrequency const &lhs, wordFrequency const &rhs) const
{ ... };
};
I guess in your case ("frequency of words"), operator<makes sense.
Also note vector::erase: This will remove the element indicated by the passed iterator. But, see also std::unique, unique returns an iterator to the new end of the range, and I am not sure if you really want to remove the new end of the range. Is this what you mean?
words.erase (words.begin(),
unique (words.begin(), words.end(), pred));
Third problem.
If you only need top ten, don't sort
C++ comes with different sorting algorithms (based on this). For top 10, you can use:
nth_element: gives you the top elements without sorting them
partial_sort: gives you the top elements, sorted
This wastes less watts on your CPU, will contribute to overall desktop performance, and your laptop batteries last longer so can do even more sorts.

The most probable answer is that operator< is not declared for the type of object vec contains. Have you overloaded it? It should look something like that:
bool operator<(const YourType& _a, const YourType& _b)
{
//... comparison check here
}

That code should work, as std::unique returns an iterator pointing to the beginning of the duplicate elements. What type is your vector containing? Perhaps you need to implement the equality operator.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Removing duplicates from a non-sortable vector - c++

Take a look at unordered set: http://www.cplusplus.com/reference/unordered_set/unordered_set/ it seems to do what you want. Insertions for single elements are done in O(1) on average, O(n) in worst case, only equality operator needs to be provided.

Related

Sorting a Vector of Vector in Cpp

What is the most efficient way of removing duplicates from a container only using almost equality criteria (no sort)

how to find common words between two vectors of std::string

How to efficiently compare vectors with C++?

vector sort and erase won't work

Categories

Resources