Checking for duplicates in a vector [duplicate] - c++

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Determining if an unordered vector<T> has all unique elements
I have to check a vector for duplicates. What is the best way to approach this:
I take the first element, compare it against all other elements in the vector. Then take the next element and do the same and so on.
Is this the best way to do it, or is there a more efficient way to check for dups?

If your vector is an STL container, the solution is easy:
std::sort(myvec.begin(), myvec.end());
std::erase(std::unique(myvec.begin(), myvec.end()), myvec.end());
According to cppreference (https://en.cppreference.com/w/cpp/algorithm/unique), the elements are shifted around so that the values from myvec.begin() to the return value of std::unique are all unique. The elements after the iterator returned by std::unique are unspecified (useless in every use-case I've seen) so remove them from the std::vector<A> using std::vector<A>::erase.

Use a hash table in which you insert each element. Before you insert an element, check if it's already there. If it is, you have yourself a duplicate. This is O(n) on average, but the worst case is just as bad as your current method.
Alternatively, you can use a set to do the same thing in O(n log n) worst case. This is as good as the sorting solution, except it doesn't change the order of the elements (uses more memory though since you create a set).
Another way is to copy your vector to another vector, sort that and check the adjacent elements there. I'm not sure if this is faster than the set solution, but I think sorting adds less overhead than the balanced search trees a set uses so it should be faster in practice.
Of course, if you don't care about keeping the original order of the elements, just sort the initial vector.

If you don't care about an occasional false positive, you can use a Bloom Filter to detect probable duplicates in the collection. If false positives can't be accepted, take the values that fail the filter and run a second detection pass on those. The list of failed values should be fairly small, although they will need to be checked against the full input.

Sorting and then comparing adjacent elements is the way to go. A sort takes O(n log n) comparisons, an then an additional n-1 to compare adjacent elements.
The scheme in the question would take (n^2)/2 comparisons.

You can also use binary_search.
Here are two good examples that will help you:
http://www.cplusplus.com/reference/algorithm/binary_search/
http://www.cplusplus.com/reference/algorithm/unique_copy/

Related

slow std::map for large entries

We have 48,16,703 entries in this format.
1 abc
2 def
...
...
4816702 blah
4816703 blah_blah
Since the number of entries are quite big, I am worried that std::map would take much time during insertion since it need to do the balancing as well for each insertion.
Only inserting these entries into the map takes a lot of time. I am doing
map[first] = second;
Two questions:
1. Am I correct in using std::map for these kind of cases?
2. Am I correct in inserting like the above way. OR Should I use map.insert()
I am sorry for not doing the experiments and writing the absolute numbers but we want an general consensus if we are doing the right thing or not.
Also, they keys are not consecutive always..
P.S. Of-course, later we will need to access that map as well to get the values corresponding to the keys.
If you don’t need to insert into the map afterwards, you can construct an unsorted vector of your data, sort it according to the key, and then search using functions like std::equal_range.
It’s the same complexity as std::map, but far less allocations.
Use an std::unordered_map, which has much better insertion time complexity than std::map, as the reference mentions:
Complexity
Single element insertions:
Average case: constant.
Worst case: linear in container size.
Multiple elements insertion:
Average case: linear in the number of elements inserted.
Worst case: N*(size+1): number of elements inserted times the container size plus one.
May trigger a rehash (not included in the complexity above).
That's better than the logarithmic time complexity of std::map's insertion.
Note: std::map's insertion can enjoy "amortized constant if a hint is given and the position given is the optimal.". If that's the case for you, then use a map (if a vector is not applicable).
#n.m. provides a representative Live demo

Which container should I use to quickly find an element, knowing that by construction it's already ordered?

I have to use a container that by construction is already ordered and I need a quick find-function.
My idea is to use a set to exploit its set.find, which should be quick enough, but constructing my set with set.insert could be slow because I already know that the elements are ordered!
For this part, I could just use a vector with vector.push_back, but then I should write the find function myself and it would also be slower than the set::find... What do you suggest me?
EDIT:
The container needs to be modified, but will always be ordered, and its size will always be the same (but i don't know it a priori)
No repeated values.
A lot more finds than inserts.
I just need to know if it exists, not its position.
I found the std::binary_search. Is it good?
Constructing a set from a pair of iterators is guaranteed to be O(N) if the input range is already sorted, so you could just use that. (See e.g. http://en.cppreference.com/w/cpp/container/set/set.)
Subsequent inserts are O(log N).
Although if the total size is not that large a sorted vector tends to beat everything else because it is so cache friendly. std::binary_search is O(log N), the same as set::find.
std::unordered_set doesn't care about your ordering, but does give you O(1) lookup (on average, no guarantees).
std::vector with std::make_heap(), push_heap(), pop_heap() and sort_heap() could help you to deal with it as fast as std::set. However, you'll need to use std::binary_search manually.

Remove duplicated members from vector while maintaining order [duplicate]

This question already has answers here:
How to make elements of vector unique? (remove non adjacent duplicates)
(11 answers)
Closed 9 years ago.
I know this question has been asked a lot, but I could not find the best (the most efficient) way to remove duplicated members (type double) from vector while maintaining 1 copy and order of the original vector.
If your data was not doubles, simply doing a pass with an unordered_set keeping track of what you have already seen via the remove_if-erase idiom would work.
doubles are, however, bad news when checking for equality: two derivations that you may think should produce the same value may generate different results. A set would allow looking for nearby values. Simply use equal_range with a plus minus epsilon instead of find to see if there is another value approximetally equal to yours earlier in the vector, and use the same remove erase idiom.
The remove erase idiom looks like:
vec.erase( std::remove_if( vec.begin(), vec.end(), [&](double x)->bool{
// return true if you want to remove the double x
}, vec.end());
in C++03 it cannot be done inline.
The lambda above will be called for each element in order, like the body of a loop.
If you have to/wish to use a vector* then it might be easiest to trap duplicates on insertion - if the point to be inserted is already there, bin it.
Another approach for a really large collection would be to do a sort and search for duplicates after every N insertions, where N is the perfect number of insertions to wait before doing a sort and search for duplicates. (Calculating N is left as an exercise for the reader.)
Your approach, and the value of N if relevant, depends on the number of elements, how often the array is changed, how often the contents are examined, and the likelihood of duplicates occuring.
(*apparently, vectors are great as their disadvantages lie where modern computers tend to kick butt so hard it doesn't matter, and are blisteringly fast with linear searches. At least I think that's what Bjarn's saying here comparing a vector to a linked list.)

How to efficiently compare two vectors in C++, whose content can't be meaningfully sorted?

How to efficiently compare two vectors in C++, whose content can't be meaningfully sorted?
I read many posts but most talk about first sorting the two vectors and then comparing
the elements. But in my case I can't sort the vectors in a meaningful way that can
help the comparison. That means I will have to do an O(N^2) operation rather than O(N),
I will have to compare each element from the first vector and try to find a unique match
for it in the second vector. So I will have to match up all the elements and if I can
find a unique match for each then the vectors are equal.
Is there an efficient and simple way to do this? Will I have to code it myself?
Edit: by meaningful sorting I mean a way to sort them so that later you can
compare them in linear fashion.
Thank you
If the elements can be hashed in some meaningful way, you can get expected O(n) performance by using a hashmap: Insert all elements from list A into the hashmap, and for each element in list B, check if it exists in the hashmap.
In C++, I believe that unordered_map is the standard hashmap implementation (though I haven't used it myself).
Put all elements of vector A into a hash table, where the element is the key, and the value is a counter of how many times you’ve added the element. [O(n) expected]
Iterate over vector B and decrease the counters in the hash table for each element. [O(n) expected]
Iterate over the hash table and check that each counter is 0. [O(n) expected]
= O(n) expected runtime.
No reason to use a map, since there's no values to associate with, just keys (the elements themselves). In this case you should look at using a set or an unordered_set. Put all elements from A into a new set, and then loop through B. For each element, if( set.find(element) == set.end() ) return false;
If you're set on sorting the array against some arbitrary value, you might want to look at C++11's hash, which returns a size_t. Using this you can write a comparator which hashes the two objects and compares them, which you can use with std::sort to perform a O(n log n) sort on it.
If you really can't sort the vectors you could try this. C++ gurus please free to point out flaws, obvious failures to exploit STL and other libraries, failures to comprehend previous answers, etc, etc :) Apologies in advance as necessary.
Have a vector of ints, 0..n, called C. These ints are the indices of each element in vector B. For each element in vector A compare it against elements in B according to the B indices that are in C. If you find a match remove that index from C. C is now one shorter. For the next A you're again searching B according to indices in C, which being one shorter, takes less time. And if you get lucky it will be quite quick.
Or you could build up a vector of B indices that you have already checked so that you ignore those B's next time round the loop. Saves building a complete C first.

std::list or std::multimap

Hey, I right now have a list of a struct that I made, I sort this list everytime I add a new object, using the std::list sort method.
I want to know what would be faster, using a std::multimap for this or std::list,
since I'm iterating the whole list every frame (I am making a game).
I would like to hear your opinion, for what should I use for this incident.
std::multimap will probably be faster, as it is O(log n) per insertion, whereas an insert and sort of the list is O(n log n).
Depending on your usage pattern, you might be better off with sorted vectors. If you insert a whole bunch of items at once and then do a bunch of reads -- i.e. reads and writes aren't interleaved -- then you'll have better performance with vector, std::sort, and std::binary_search.
You might consider using the lower_bound algorithm to find where to insert into your list. http://stdcxx.apache.org/doc/stdlibref/lower-bound.html
Edit: In light of Neil's comment, note that this will work with any sequence container (vector, deque, etc.)
If you do not need Key/Value pairs std::set or std::multiset is probably better than using std::multimap.
Reference for std::set:
http://www.cplusplus.com/reference/stl/set/
Reference for std::multiset:
http://www.cplusplus.com/reference/stl/multiset/
Edit: (seems like it was unclear before)
It is in general better to use a container like std::(multi)set or std:(multi)map than using std::list and sorting it afterwards everytime an element is inserted because std::list does not perform very good in inserting elements in the middle of the container.
Generally speaking, iterating over a container is likely to take about as much time as iterating over another, so if you keep adding to a container and then iterating over it, it's mainly a question of picking a container that avoids constantly having to reallocate memory and inserts the way you want quickly.
Both list and multimap will avoid having to reallocate themselves simply from adding an element (like you could get with a vector), so it's primarily a question of how long it takes to insert. Adding to the end of a list will be O(1) while adding to a multimap will be O(log n). However, the multimap will insert the elements in sorted order, while if you want to have the list be sorted, you're going to have to either sort the list in O(n log n) or insert the element in a sorted manner with something like lower_bound which would be O(n). In either case, it will be far worse (in the worst case at least) to use the list.
Generally, if you're maintaining a container in sorted order and continually adding to it rather than creating it and sorting it once, sets and maps are more efficient since they're designed to be sorted. Of course, as always, if you really care about performance, profiling your specific application and seeing which works better is what you need to do. However, in this case, I'd say that it's almost a guarantee that multimap will be faster (especially if you have very many elements at all).