How do I remove duplicates from a C++ array?

How do I remove duplicates from a C++ array? - c++

I have an array of structs; the array is of size N.
I want to remove duplicates from the array; that is, do an in-place change, converting the array to have a single appearance of each struct. Additionally, I want to know the new size M (highest index in the reduced array).
The structs include primitives so it's trivial to compare them.
How can I do that efficiently in C++?
I have implemented the following operators:
bool operator==(const A &rhs1, const A &rhs2)
{
return ( ( rhs1.x== rhs2.x ) &&
( rhs1.y == rhs2.y ) );
}
bool operator<(const A &rhs1, const A &rhs2)
{
if ( rhs1.x == rhs2.x )
return ( rhs1.y < rhs2.y );
return ( rhs1.x < rhs2.x );
}
However, I get an error when running:
std::sort(array, array+ numTotalAvailable);
* array will have all elements here valid.
std::unique_copy(
array,
array+ numTotalAvailable,
back_inserter(uniqueElements));
* uniqueElements will have non-valid elements.
What is wrong here?

You could use a combination of the std::sort and std::unique algorithms to accomplish this:
std::sort(elems.begin(), elems.end()); // Now in sorted order.
iterator itr = std::unique(elems.begin(), elems.end()); // Duplicates overwritten
elems.erase(itr, elems.end()); // Space reclaimed
If you are working with a raw array (not, say, a std::vector), then you can't actually reclaim the space without copying the elements over to a new range. However, if you're okay starting off with a raw array and ending up with something like a std::vector or std::deque, you can use unique_copy and an iterator adapter to copy over just the unique elements:
std::sort(array, array + size); // Now in sorted order
std::vector<T> uniqueElements;
std::unique_copy(array, array + size,
back_inserter(uniqueElements)); // Append unique elements
At this point, uniqueElements now holds all the unique elements.
Finally, to more directly address your initial question: if you want to do this in-place, you can get the answer by using the return value from unique to determine how many elements remain:
std::sort(elems, elems + N); // Now in sorted order.
T* endpoint = std::unique(elems, elems + N);// Duplicates overwritten
ptrdiff_t M = endpoint - elems; // Find number of elements left
Hope this helps!

std::set<T> uniqueItems(v.begin(), v.end());
Now uniqueItems contains only the unique items. Do whatever you want to do with it. Maybe, you would like v to contain all the unique items. If so, then do this:
//assuming v is std::vector<T>
std::vector<T>(uniqueItems.begin(), uniqueItems.end()).swap(v);
Now v contains all the unique items. It also shrinks v to a minimum size. It makes use of Shrink-to-fit idiom.

You could use the flyweight pattern. Easiest way to do so, would be using the Boost Flyweight library.
Edit: I'm not sure if there is some way to find out how many objects are stored by the Boost flyweight implementation, if there is, I can't seem to find it in the documentation.

An alternative approach to applying algorithms to your array would be to insert its elements in a std::set. Whether it is reasonable to do it this way depends on how you plan to use your items.

Related

Fast search and delete in a std::list of objects

I have a very large list of objects (nodes), and I want to be able to remove/delete elements of the list based on a set of values inside of them.
Preferably in constant time...
The objects (among other things) has values like:
long long int nodeID;
int depth;
int numberOfClusters;
double [] points;
double [][] clusters;
What I need to do is to look through the list, and check if there are any elements that has the same values in all fields except for nodeID.
Right now I'm doing something like this:
for(i = nodes.begin(); i != nodes.end(); i++)
{
for(j = nodes.begin(); j != nodes.end(); j++)
{
if(i != j)
{
if(compareNodes((*i), (*j)))
{
j = nodes.erase (j);
}
}
}
}
Where compareNodes() compares the values inside the two nodes. But this is wildly inefficient.
I'm using erasebecause that seems to be the only way to delete an element in the middle of a std::list.
Optimally, I would like to be able to find an element based on these values, and remove it from the list if it exists.
I am thinking some sort of hash map to find the element (a pointer to the element) in constant time, but even if I can do that, I can't find a way to remove the element without iterating through the list.
It seemes that I have to use erase , but that requires iterating through the list, which means linear complexity in the list size.
There is also remove_if but again, same problem linear complexity in list size.
Is there no way to get remove an element from a std::list without iterating through the whole list?

First off, you can speed up your existing solution by starting j at std::next(i) instead of nodes.begin() (assuming your compareNodes function is commutative).
Second, the hashmap approach sounds viable. But why keep a pointer to the element as a value in the map, when you can keep an iterator? They're both "a thing which references the element," but you can use the iterator to erase the element. And std::list iterators don't invalidate when the list is modified (they're most probably just pointers under the hood).
Thirdly, if you want to encapsulate/automate the lookup & sequential access, you can look into Boost.Multi-index to build a container with both sequential and hashed access.

How to remove almost duplicates from a vector in C++

I have an std::vector of floats that I want to not contain duplicates but the math that populates the vector isn't 100% precise. The vector has values that differ by a few hundredths but should be treated as the same point. For example here's some values in one of them:
...
X: -43.094505
X: -43.094501
X: -43.094498
...
What would be the best/most efficient way to remove duplicates from a vector like this.

First sort your vector using std::sort. Then use std::unique with a custom predicate to remove the duplicates.
std::unique(v.begin(), v.end(),
[](double l, double r) { return std::abs(l - r) < 0.01; });
// treats any numbers that differ by less than 0.01 as equal
Live demo

Sorting is always a good first step. Use std::sort().
Remove not sufficiently unique elements: std::unique().
Last step, call resize() and maybe also shrink_to_fit().
If you want to preserve the order, do the previous 3 steps on a copy (omit shrinking though).
Then use std::remove_if with a lambda, checking for existence of the element in the copy (binary search) (don't forget to remove it if found), and only retain elements if found in the copy.

I say std::sort() it, then go through it one by one and remove the values within certain margin.
You can have a separate write iterator to the same vector and one resize operation at the end - instead of calling erase() for each removed element or having another destination copy for increased performance and smaller memory usage.

If your vector cannot contain duplicates, it may be more appropriate to use an std::set. You can then use a custom comparison object to consider small changes as being inconsequential.

Hi you could comprare like this
bool isAlmostEquals(const double &f1, const double &f2)
{
double allowedDif = xxxx;
return (abs(f1 - f2) <= allowedDif);
}
but it depends of your compare range and the double precision is not on your side
if your vector is sorted you could use std::unique with the function as predicate

I would do the following:
Create a set<double>
go through your vector in a loop or using a functor
Round each element and insert into the set
Then you can swap your vector with an empty vector
Copy all elements from the set to the empty vector
The complexity of this approach will be n * log(n) but it's simpler and can be done in a few lines of code. The memory consumption will double from just storing the vector. In addition set consumes slightly more memory per each element than vector. However, you will destroy it after using.
std::vector<double> v;
v.push_back(-43.094505);
v.push_back(-43.094501);
v.push_back(-43.094498);
v.push_back(-45.093435);
std::set<double> s;
std::vector<double>::const_iterator it = v.begin();
for(;it != v.end(); ++it)
s.insert(floor(*it));
v.swap(std::vector<double>());
v.resize(s.size());
std::copy(s.begin(), s.end(), v.begin());

The problem with most answers so far is that you have an unusual "equality". If A and B are similar but not identical, you want to treat them as equal. Basically, A and A+epsilon still compare as equal, but A+2*epsilon does not (for some unspecified epsilon). Or, depending on your algorithm, A*(1+epsilon) does and A*(1+2*epsilon) does not.
That does mean that A+epsilon compares equal to A+2*epsilon. Thus A = B and B = C does not imply A = C. This breaks common assumptions in <algorithm>.
You can still sort the values, that is a sane thing to do. But you have to consider what to do with a long range of similar values in the result. If the range is long enough, the difference between the first and last can still be large. There's no simple answer.

checking for difference between two vector<T>

Suppose you have 2 vectors say v1 and v2 with the following values:
v1 = {8,4,9,9,1,3};
v2 = {9,4,3,8,1,9};
What is the most STL approach to check if they are "equal"? I am defining "equal" to mean the contents are the same regardless of the order. I would prefer to do this without sorting.
I was leaning towards building two std::map<double, int> to count up each of the vector's elements.
All, I need is a boolean Yes/No from the algorithm.
What say you?
Other conversations on Stack Overflow resort to sorting the vectors, I'd prefer to avoid that. Hence this new thread.

I was leaning towards building two std::map to count up each of the vector's elements.
This will be far slower than just creating sorted vectors. (Note also that std::map is powered by sorting; it just does so using red-black trees or AVL trees) Maps are data structures optimized for an even mix of inserts and lookups; but your use case is a whole bunch of inserts followed by a whole bunch of lookups with no overlap.
I would just sort the vectors (or make copies and sort those, if you are not allowed to destroy the source copies) and then use vector's built in operator ==.

Sorting the vectors and call set_difference is still the best way.
If the copy is heavy for you, the comparison between two unsorted arrays is even worse?
If you want current array untouched, you can make a copy of current arrays?
v1 = {8,4,9,9,1,3};
v2 = {9,4,3,8,1,9};
// can trade before copy/sort heavy work
if (v1.size() != v2.size()){
}
std::vector<int> v3(v1);
std::vector<int> v4(v2);
sort(v3.begin(), v3.end());
sort(v4.begin(), v4.end());
return v3 == v4;

I assume for some reason you can't sort the vectors, most likely because you still need them in their original order or they're expensive to copy. Otherwise, just sort them.
Create a "view" into each vector that allows you to see the vector in any order. You can do this with a vector of pointers that starts out pointing to the elements in order. Then sort the two views, producing a sorted view into each vector. Then compare the two views, comparing the two vectors in their view order. This avoids sorting the vectors themselves.

Was originally thinking of working in terms of sets since that's what you're actually thinking in terms of but that does necessitate sorting. This can be done in O(n) by converting both to hashmaps and checking for equality there.

just take the first vector and compare it with each element in the second vector.
If one value from the first one couldnt be find in the second the vectors are different.
In the worst case it takes O(n*m) time which n = size of first vector and m = size second vector.

This util method will help you to compare 2 int[], let me know in case of any issues
public static boolean compareArray(int[] v1, int[] v2){
boolean returnValue = false;
if(v1.length != v2.length)
returnValue = false;
if(v1.length == 0 || v2.length == 0)
returnValue = false;
List<Integer> intList = Ints.asList(v2);
for(int element : v1){
if(!intList.contains(element)){
returnValue = false;
break;
}else{
returnValue = true;
}
}

Using boost::random to select from an std::list where elements are being removed

See this related question on more generic use of the Boost Random library.
My questions involves selecting a random element from an std::list, doing some operation, which could potentally include removing the element from the list, and then choosing another random element, until some condition is satisfied.
The boost code and for loop look roughly like this:
// create and insert elements into list
std::list<MyClass> myList;
//[...]
// select uniformly from list indices
boost::uniform_int<> indices( 0, myList.size()-1 );
boost::variate_generator< boost::mt19937, boost::uniform_int<> >
selectIndex(boost::mt19937(), indices);
for( int i = 0; i <= maxOperations; ++i ) {
int index = selectIndex();
MyClass & mc = myList.begin() + index;
// do operations with mc, potentially removing it from myList
//[...]
}
My problem is as soon as the operations that are performed on an element result in the removal of an element, the variate_generator has the potential to select an invalid index in the list. I don't think it makes sense to completely recreate the variate_generator each time, especially if I seed it with time(0).

I assume that MyClass & mc = myList.begin() + index; is just pseudo code, as begin returns an iterator and I don't think list iterators (non-random-access) support operator+.
As far as I can tell, with variate generator your three basic options in this case are:
Recreate the generator when you remove an item.
Do filtering on the generated index and if it's >= the current size of the list, retry until you get a valid index. Note that if you remove a lot of indexes this could get pretty inefficient as well.
Leave the node in the list but mark it invalid so if you try to operate on that index it safely no-ops. This is just a different version of the second option.
Alternately you could devise a different index generation algorithm that's able to adapt to the container changing size.

You could create your own uniform_contained_int distribution class, that accept a container in its constructor, aggregates a uniform_int, and recreates the uniform_distribution each time the container changes size. Look at the description of the uniform_int which methods you need to implement to create your distribution.

I think you have more to worry about performance-wise. Particularly this:
std::list<MyClass> myList;
myList.begin() + index;
is not a particularly fast way of geting index-th element.
I would transform it into something like this (which should operate on a random subsequence of the list):
X_i ~ U(0, 1) for all i
left <- max_ops
N <- list size
for each element
if X_i < left/N
process element
left--
N--
provided you don't need the random permutation of the elements.

Container with two indexes (or a compound index)

I have a class like this
class MyClass
{
int Identifier;
int Context;
int Data;
}
and I plan to store it in a STL container like
vector<MyClass> myVector;
but I will need to access it either by the extenal Index (using myVector[index]); and the combination of Identifier and Context which in this case I would perform a search with something like
vector<MyClass>::iterator myIt;
for( myIt = myVector.begin(); myIt != myVector.end(); myIt++ )
{
if( ( myIt->Idenfifier == target_id ) &&
( myIt->Context == target_context ) )
return *myIt; //or do something else...
}
Is there a better way to store or index the data?

Boost::Multi-Index has this exact functionality if you can afford the boost dependency (header only). You would use a random_access index for the array-like index, and either hashed_unique, hashed_non_unique, ordered_unique, or ordered_non_unique (depending on your desired traits) with a functor that compares Identifier and Context together.

We need to know your usage. Why do you need to be able to get them by index, and how often do you need to search the container for a specific element.
If you store it in an std::set, your search time with be O(ln n), but you cannot reference them by index.
If you use an std::vector, you can index them, but you have to use std::find to get a specific element, which will be O(n).
But if you need an index to pass it around to other things, you could use a pointer. That is, use a set for faster look-up, and pass pointers (not index's) to specific elements.

Yes, but if you want speed, you'll need to sacrifice space. Store it in a collection (like an STL set) with the identifier/context as key, and simultaneously store it in a vector. Of course, you don't want two copies of the data itself, so store it in the set using a smart pointer (auto_ptr or variant) and store it in the vector using a dumb pointer.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How do I remove duplicates from a C++ array? - c++

You could use the flyweight pattern. Easiest way to do so, would be using the Boost Flyweight library. Edit: I'm not sure if there is some way to find out how many objects are stored by the Boost flyweight implementation, if there is, I can't seem to find it in the documentation.

An alternative approach to applying algorithms to your array would be to insert its elements in a std::set. Whether it is reasonable to do it this way depends on how you plan to use your items.

Related

Fast search and delete in a std::list of objects

How to remove almost duplicates from a vector in C++

checking for difference between two vector<T>

Using boost::random to select from an std::list where elements are being removed

Container with two indexes (or a compound index)

Categories

Resources