How to remove almost duplicates from a vector in C++

How to remove almost duplicates from a vector in C++ - c++

I have an std::vector of floats that I want to not contain duplicates but the math that populates the vector isn't 100% precise. The vector has values that differ by a few hundredths but should be treated as the same point. For example here's some values in one of them:
...
X: -43.094505
X: -43.094501
X: -43.094498
...
What would be the best/most efficient way to remove duplicates from a vector like this.

First sort your vector using std::sort. Then use std::unique with a custom predicate to remove the duplicates.
std::unique(v.begin(), v.end(),
[](double l, double r) { return std::abs(l - r) < 0.01; });
// treats any numbers that differ by less than 0.01 as equal
Live demo

Sorting is always a good first step. Use std::sort().
Remove not sufficiently unique elements: std::unique().
Last step, call resize() and maybe also shrink_to_fit().
If you want to preserve the order, do the previous 3 steps on a copy (omit shrinking though).
Then use std::remove_if with a lambda, checking for existence of the element in the copy (binary search) (don't forget to remove it if found), and only retain elements if found in the copy.

I say std::sort() it, then go through it one by one and remove the values within certain margin.
You can have a separate write iterator to the same vector and one resize operation at the end - instead of calling erase() for each removed element or having another destination copy for increased performance and smaller memory usage.

If your vector cannot contain duplicates, it may be more appropriate to use an std::set. You can then use a custom comparison object to consider small changes as being inconsequential.

Hi you could comprare like this
bool isAlmostEquals(const double &f1, const double &f2)
{
double allowedDif = xxxx;
return (abs(f1 - f2) <= allowedDif);
}
but it depends of your compare range and the double precision is not on your side
if your vector is sorted you could use std::unique with the function as predicate

I would do the following:
Create a set<double>
go through your vector in a loop or using a functor
Round each element and insert into the set
Then you can swap your vector with an empty vector
Copy all elements from the set to the empty vector
The complexity of this approach will be n * log(n) but it's simpler and can be done in a few lines of code. The memory consumption will double from just storing the vector. In addition set consumes slightly more memory per each element than vector. However, you will destroy it after using.
std::vector<double> v;
v.push_back(-43.094505);
v.push_back(-43.094501);
v.push_back(-43.094498);
v.push_back(-45.093435);
std::set<double> s;
std::vector<double>::const_iterator it = v.begin();
for(;it != v.end(); ++it)
s.insert(floor(*it));
v.swap(std::vector<double>());
v.resize(s.size());
std::copy(s.begin(), s.end(), v.begin());

The problem with most answers so far is that you have an unusual "equality". If A and B are similar but not identical, you want to treat them as equal. Basically, A and A+epsilon still compare as equal, but A+2*epsilon does not (for some unspecified epsilon). Or, depending on your algorithm, A*(1+epsilon) does and A*(1+2*epsilon) does not.
That does mean that A+epsilon compares equal to A+2*epsilon. Thus A = B and B = C does not imply A = C. This breaks common assumptions in <algorithm>.
You can still sort the values, that is a sane thing to do. But you have to consider what to do with a long range of similar values in the result. If the range is long enough, the difference between the first and last can still be large. There's no simple answer.

Related

Insert to beginning of copied vector

I have a std:;vector<double> that's the output from a simulation code. The size can be anywhere from O(10^1) to O(10^4). I need to create a new vector that's a copy of this vector with an additional element at the beginning, so I can either write:
// old_vec is some std::vector<double> from a simulation code
auto new_vec = old_vec;
double val = 1.0;
new_vec.insert(new_vec.begin(), val);
or
std::vector<double> new_vec{val};
new_vec.insert(new_vec.end(), old_vec.begin(), old_vec.end());
I believe the first approach will cause a reallocation due to the insertion at the beginning of a vector, whereas the second one will just append everything to the end, so the latter seems better? Is there any guarantee that the compiler may optimize the first code into the second code?

I wouldn't trust directly using the "=" operator to copy the vector, but more of a combination between your two methods. List-initialization may be safer first, then use insert() to add the first element:
vector <double> new_vec = {old_vec.begin(), old_vec.end()};
new_vec.insert(new_vec.begin(), val);
Your suspicions of problems may vary across different compilers, so you may or may not get an error. However, if you would like a foolproof way, that would be outright inserting and copying:
vector <double> new_vec; new_vec.push_back(val);
for (double i : old_vec) { new_vec.push_back(i); }

Stable sorting a vector using std::sort

So I have some code like this, I want to sort the vector based on id and put the last overridden element first:
struct Data {
int64_t id;
double value;
};
std::vector<Data> v;
// add some Datas to v
// add some 'override' Datas with duplicated `id`s
std::sort(v.begin(), v.end(),
[](const Data& a, const Data& b) {
if (a.id < b.id) {
return true;
} else if (b.id < a.id) {
return false;
}
return &a > &b;
});
Since vectors are contiguous, &a > &b should work to put the appended overrides first in the sorted vector, which should be equivalent to using std::stable_sort, but I am not sure if there is a state in the std::sort implementation where the equal values would be swapped such that the address of an element that appeared later in the original vector is earlier now. I don't want to use stable_sort because it is significantly slower for my use case. I have also considered adding a field to the struct that keeps track of the original index, but I will need to copy the vector for that.
It seems to work here: https://onlinegdb.com/Hk8z1giqX

std::sort gives no guarantees whatsoever on when elements are compared, and in practice, I strongly suspect most implementations will misbehave for your comparator.
The common std::sort implementation is either plain quicksort or a hybrid sort (quicksort switching to a different sort for small ranges), implemented in-place to avoid using extra memory. As such, the comparator will be invoked with the same element at different memory addresses as the sort progresses; you can't use memory addresses to implement a stable sort.
Either add the necessary info to make the sort innately stable (e.g. the suggested initial index value) or use std::stable_sort. Using memory addresses to stabilize the sort won't work.
For the record, having experimented a bit, I suspect your test case is too small to trigger the issue. At a guess, the hybrid sorting strategy works coincidentally for smallish vectors, but breaks down when the vector gets large enough for an actual quicksort to occur. Once I increase your vector size with some more filler, the stability disappears, Try it online!

How to use lower_bound on vector of vectors?

I am relative new at C++ and I have little problem. I have vector and in that vector are vectors with 3 integers.
Inner vector represents like one person. 3 integers inside that inner vector represents distance from start, velocity and original index (because in input integers aren't sorted and in output I need to print original index not index in this sorted vector).
Now I have given some points representing distance from start and I need to find which person will be first at that point so I have been thinking that my first step would be that I would find closest person to the given point so basically I need to find lower_bound/upper_bound.
How can I use lower_bound if I want to find the lower_bound of first item in inner vectors? Or should I use struct/class instead of inner vectors?

You would use the version of std::lower_bound which takes a custom comparator (the versions marked "(2)" at the link); and you would write a comparator of vectors which compares vectors by their first item (or whatever other way you like).
Howerver:
As #doctorlove points out, std::lower_bound doesn't compare the vectors to each other, it compares them to a given value (be it a vector or a scalar). So it's possible you actually want to do something else.
It's usually not a good idea to keep fixed-length sequences of elements in std::vector's. Have you considered std::array?
It's very likely that your "vectors with 3 integers" actually stand for something else, e.g. points in a 3-dimensional geometric space; in which case, yes, they should be in some sort of class.

I am not sure that your inner things should be std::vector-s of 3 elements.
I believe that they should std::array-s of 3 elements (because you know that the size is 3 and won't change).
So you probably want to have
typedef std::array<double,3> element_ty;
then use std::vector<element_ty> and for the rest (your lower_bound point) do like in einpoklum's answer.
BTW, you probably want to use std::min_element with an explicit compare.
Maybe you want something like:
std::vector<element_ty> vec;
auto minit =
std::min_element(vec.begin(), vec.end(),
[](const element_ty& x, const element_ty&y) {
return x[0] < y[0]));

checking for difference between two vector<T>

Suppose you have 2 vectors say v1 and v2 with the following values:
v1 = {8,4,9,9,1,3};
v2 = {9,4,3,8,1,9};
What is the most STL approach to check if they are "equal"? I am defining "equal" to mean the contents are the same regardless of the order. I would prefer to do this without sorting.
I was leaning towards building two std::map<double, int> to count up each of the vector's elements.
All, I need is a boolean Yes/No from the algorithm.
What say you?
Other conversations on Stack Overflow resort to sorting the vectors, I'd prefer to avoid that. Hence this new thread.

I was leaning towards building two std::map to count up each of the vector's elements.
This will be far slower than just creating sorted vectors. (Note also that std::map is powered by sorting; it just does so using red-black trees or AVL trees) Maps are data structures optimized for an even mix of inserts and lookups; but your use case is a whole bunch of inserts followed by a whole bunch of lookups with no overlap.
I would just sort the vectors (or make copies and sort those, if you are not allowed to destroy the source copies) and then use vector's built in operator ==.

Sorting the vectors and call set_difference is still the best way.
If the copy is heavy for you, the comparison between two unsorted arrays is even worse?
If you want current array untouched, you can make a copy of current arrays?
v1 = {8,4,9,9,1,3};
v2 = {9,4,3,8,1,9};
// can trade before copy/sort heavy work
if (v1.size() != v2.size()){
}
std::vector<int> v3(v1);
std::vector<int> v4(v2);
sort(v3.begin(), v3.end());
sort(v4.begin(), v4.end());
return v3 == v4;

I assume for some reason you can't sort the vectors, most likely because you still need them in their original order or they're expensive to copy. Otherwise, just sort them.
Create a "view" into each vector that allows you to see the vector in any order. You can do this with a vector of pointers that starts out pointing to the elements in order. Then sort the two views, producing a sorted view into each vector. Then compare the two views, comparing the two vectors in their view order. This avoids sorting the vectors themselves.

Was originally thinking of working in terms of sets since that's what you're actually thinking in terms of but that does necessitate sorting. This can be done in O(n) by converting both to hashmaps and checking for equality there.

just take the first vector and compare it with each element in the second vector.
If one value from the first one couldnt be find in the second the vectors are different.
In the worst case it takes O(n*m) time which n = size of first vector and m = size second vector.

This util method will help you to compare 2 int[], let me know in case of any issues
public static boolean compareArray(int[] v1, int[] v2){
boolean returnValue = false;
if(v1.length != v2.length)
returnValue = false;
if(v1.length == 0 || v2.length == 0)
returnValue = false;
List<Integer> intList = Ints.asList(v2);
for(int element : v1){
if(!intList.contains(element)){
returnValue = false;
break;
}else{
returnValue = true;
}
}

An fast algorithm for sorting and shuffling equal valued entries (preferably by STL's)

I'm currently developing stochastic optimization algorithms and have encountered the following issue (which I imagine appears also in other places): It could be called totally unstable partial sort:
Given a container of size n and a comparator, such that entries may be equally valued.
Return the best k entries, but if values are equal, it should be (nearly) equally probable to receive any of them.
(output order is irrelevant to me, i.e. equal values completely among the best k need not be shuffled. To even have all equal values shuffled is however a related, interesting question and would suffice!)
A very (!) inefficient way would be to use shuffle_randomly and then partial_sort, but one actually only needs to shuffle the block of equally valued entries "at the selection border" (resp. all blocks of equally valued entries, both is much faster). Maybe that Observation is where to start...
I would very much prefer, if someone could provide a solution with STL algorithms (or at least to a large portion), both because they're usually very fast, well encapsulated and OMP-parallelized.
Thanx in advance for any ideas!

You want to partial_sort first. Then, while elements are not equal, return them. If you meet a sequence of equal elements which is larger than the remaining k, shuffle and return first k. Else return all and continue.

Not fully understanding your issue, but if you it were me solving this issue (if I am reading it correctly) ...
Since it appears you will have to traverse the given object anyway, you might as well build a copy of it for your results, sort it upon insert, and randomize your "equal" items as you insert.
In other words, copy the items from the given container into an STL list but overload the comparison operator to create a B-Tree, and if two items are equal on insert randomly choose to insert it before or after the current item.
This way it's optimally traversed (since it's a tree) and you get the random order of the items that are equal each time the list is built.
It's double the memory, but I was reading this as you didn't want to alter the original list. If you don't care about losing the original, delete each item from the original as you insert into your new list. The worst traversal will be the first time you call your function since the passed in list might be unsorted. But since you are replacing the list with your sorted copy, future runs should be much faster and you can pick a better pivot point for your tree by assigning the root node as the element at length() / 2.
Hope this is helpful, sounds like a neat project. :)

If you really mean that output order is irrelevant, then you want std::nth_element, rather than std::partial_sort, since it is generally somewhat faster. Note that std::nth_element puts the nth element in the right position, so you can do the following, which is 100% standard algorithm invocations (warning: not tested very well; fencepost error possibilities abound):
template<typename RandomIterator, typename Compare>
void best_n(RandomIterator first,
RandomIterator nth,
RandomIterator limit,
Compare cmp) {
using ref = typename std::iterator_traits<RandomIterator>::reference;
std::nth_element(first, nth, limit, cmp);
auto p = std::partition(first, nth, [&](ref a){return cmp(a, *nth);});
auto q = std::partition(nth + 1, limit, [&](ref a){return !cmp(*nth, a);});
std::random_shuffle(p, q); // See note
}
The function takes three iterators, like nth_element, where nth is an iterator to the nth element, which means that it is begin() + (n - 1)).
Edit: Note that this is different from most STL algorithms, in that it is effectively an inclusive range. In particular, it is UB if nth == limit, since it is required that *nth be valid. Furthermore, there is no way to request the best 0 elements, just as there is no way to ask for the 0th element with std::nth_element. You might prefer it with a different interface; do feel free to do so.
Or you might call it like this, after requiring that 0 < k <= n:
best_n(container.begin(), container.begin()+(k-1), container.end(), cmp);
It first uses nth_element to put the "best" k elements in positions 0..k-1, guaranteeing that the kth element (or one of them, anyway) is at position k-1. It then repartitions the elements preceding position k-1 so that the equal elements are at the end, and the elements following position k-1 so that the equal elements are at the beginning. Finally, it shuffles the equal elements.
nth_element is O(n); the two partition operations sum up to O(n); and random_shuffle is O(r) where r is the number of equal elements shuffled. I think that all sums up to O(n) so it's optimally scalable, but it may or may not be the fastest solution.
Note: You should use std::shuffle instead of std::random_shuffle, passing a uniform random number generator through to best_n. But I was too lazy to write all the boilerplate to do that and test it. Sorry.

If you don't mind sorting the whole list, there is a simple answer. Randomize the result in your comparator for equivalent elements.
std::sort(validLocations.begin(), validLocations.end(),
[&](const Point& i_point1, const Point& i_point2)
{
if (i_point1.mX == i_point2.mX)
{
return Rand(1.0f) < 0.5;
}
else
{
return i_point1.mX < i_point2.mX;
}
});

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to remove almost duplicates from a vector in C++ - c++

First sort your vector using std::sort. Then use std::unique with a custom predicate to remove the duplicates. std::unique(v.begin(), v.end(), [](double l, double r) { return std::abs(l - r) < 0.01; }); // treats any numbers that differ by less than 0.01 as equal Live demo

If your vector cannot contain duplicates, it may be more appropriate to use an std::set. You can then use a custom comparison object to consider small changes as being inconsequential.

Related

Insert to beginning of copied vector

Stable sorting a vector using std::sort

How to use lower_bound on vector of vectors?

checking for difference between two vector<T>

An fast algorithm for sorting and shuffling equal valued entries (preferably by STL's)

Categories

Resources