Remove duplicated members from vector while maintaining order [duplicate] - c++

This question already has answers here:
How to make elements of vector unique? (remove non adjacent duplicates)
(11 answers)
Closed 9 years ago.
I know this question has been asked a lot, but I could not find the best (the most efficient) way to remove duplicated members (type double) from vector while maintaining 1 copy and order of the original vector.

If your data was not doubles, simply doing a pass with an unordered_set keeping track of what you have already seen via the remove_if-erase idiom would work.
doubles are, however, bad news when checking for equality: two derivations that you may think should produce the same value may generate different results. A set would allow looking for nearby values. Simply use equal_range with a plus minus epsilon instead of find to see if there is another value approximetally equal to yours earlier in the vector, and use the same remove erase idiom.
The remove erase idiom looks like:
vec.erase( std::remove_if( vec.begin(), vec.end(), [&](double x)->bool{
// return true if you want to remove the double x
}, vec.end());
in C++03 it cannot be done inline.
The lambda above will be called for each element in order, like the body of a loop.

If you have to/wish to use a vector* then it might be easiest to trap duplicates on insertion - if the point to be inserted is already there, bin it.
Another approach for a really large collection would be to do a sort and search for duplicates after every N insertions, where N is the perfect number of insertions to wait before doing a sort and search for duplicates. (Calculating N is left as an exercise for the reader.)
Your approach, and the value of N if relevant, depends on the number of elements, how often the array is changed, how often the contents are examined, and the likelihood of duplicates occuring.
(*apparently, vectors are great as their disadvantages lie where modern computers tend to kick butt so hard it doesn't matter, and are blisteringly fast with linear searches. At least I think that's what Bjarn's saying here comparing a vector to a linked list.)

Related

Appropiate std container for (only) insertion and iteration [duplicate]

This question already has answers here:
General use cases for C++ containers
(2 answers)
Closed 8 years ago.
Which will be the std container to use if I only want to insert a lot (10.000-100.000) of small (int,floats and doubles) items and then iterate over all of them (order is not important)? (Note: the number of items is unknown at the start)
I have noticed that unsorted_set, list and forward_list have O(1) for insertion and O(n) for iteration. Is there any other which also has that complexity? Which one of those is the fastest? (if there is significant differences in memory use i will also be interested in knowing about them.
(I'm only interested in std containers, not Boost or other libraries ones)
My bet is std::vector with a call to std::vector::reserve().
vector might be a good choice. It is same as array in memory consumption. O(1) insertion(If you are inserting at the end) and O(n) iteration. It's simplest of containers and is a good choice if deletion or random insertion is not on your list.
All of the classes in your question have optimal complexity for what you want, but std::vector may be faster, in particular if you know how many elements you want to insert approximately, i.e. you have an upper bound available.
I suppose that performance may differ depending on your compiler, compiler version and the number of elements you want to insert. Sorry for not providing more specific help.
Well according to me it depends on what you want to do with the elements in the container.
If you want to insert elements at some positions, then list is the fastest because arrays and vectors have to move all the items after the insert position, one step forward.
On the other hand, if you want to access an item at a position, then arrays and vectors are faster.
Moreover, other than these there are container adapters as well like queue, stack, priority_queue etc.
So again, it all depends on your implementation and what you want from the container of elements.

c++ multiset iterator sorting

I have a multiset mymulti where I sort according to a class member m_a.
I want then to check for all sorted elements, if the difference in m_a for neighbour fields of mymulti is less than my given threshold, say 0.001. If so, then I want to prefer that which has a smaller another class member, m_b.
Here I am stuck, I have no experience with multiset or iterators. I don't know how to compare iterators from two iterations. If you can provide me with a right code for this what I want to do, will be very grateful!
My try, not too much, just my concept:
//all before I got stuck
for(it = mymulti.begin(); it!= mymulti.end(); ++it) //or it++?
if( (it+1)->mymulti.m_a - (it)->mymulti.m_a < 0.001)
if ((it+1)->mymulti.m_b < (it)->mymulti.m_b)
//swap them. but how to swap two fields in a multiset, not two multisets?
// otherwise do nothing
You cannot (or if you can, depending on your STL implementation, should not) modify items once they have been inserted into a multiset, as it could violate the provided ordering of the items in the multiset. So swapping would be a bad idea, even if you could do it.
See https://stackoverflow.com/a/2038534/713961 and http://www.cplusplus.com/reference/set/multiset/
If you would like to remove items, use multiset::erase, which takes an iterator. I believe the standard practice for "modifying" an item in a multiset is to remove it, then insert the modified version.
As a side note, I noticed you're checking if two floating point numbers are close enough in value by using a fixed epsilon (0.001). As explained in this article, this only works if all the floats you are comparing are sufficiently small. See the article for a comparison that works equally well for large and small floating-point values.

C++ Set: No match for - operator

I have a set, namely of type multiset , I'm trying to use the upper_bound function to find the index of the element returned by the iterator. Usually with vectors, it works if I get the iterator and subtract vector.begin() from it to get the answer.
However, when I try this with a set it gives an STL error, saying "no match for operator -' in ...(omitting STL details)
Is there a fundamental reason for this ( sets being implemented as RB-trees and all). If so, can anyone suggest an alternate to this? ( I'm trying to solve a question on a programming site)
Thanks!
Yes, there are different types of iterators and operator- is not supported for set iterators which are not random access.
You can use std::distance( mySet.begin(), iter );
I think that for std::set (and multiset) this is likely to be an O(log N) operation compared to it being constant time for vector and linear for list.
Are you sure you want to be storing your data in a std::multiset? You could use a sorted vector instead. Where the vector would be slower is if it is regularly edited, i.e. you are trying to insert and remove elements from anywhere, whilst retaining its sorted state.
If the data is built once then accessed many times, a sorted vector can sometimes be more efficient.
IF the data set is very large, consider using std::deque rather than std::vector because deque is more scalable in not requiring a contiguous memory block.

Choosing between std::map and std::unordered_map [duplicate]

This question already has answers here:
Is there any advantage of using map over unordered_map in case of trivial keys?
(15 answers)
Closed 4 years ago.
Now that std has a real hash map in unordered_map, why (or when) would I still want to use the good old map over unordered_map on systems where it actually exists? Are there any obvious situations that I cannot immediately see?
As already mentioned, map allows to iterate over the elements in a sorted way, but unordered_map does not. This is very important in many situations, for example displaying a collection (e.g. address book). This also manifests in other indirect ways like: (1) Start iterating from the iterator returned by find(), or (2) existence of member functions like lower_bound().
Also, I think there is some difference in the worst case search complexity.
For map, it is O( lg N )
For unordered_map, it is O( N ) [This may happen when the hash function is not good leading to too many hash collisions.]
The same is applicable for worst case deletion complexity.
In addition to the answers above you should also note that just because unordered_map is constant speed (O(1)) doesn't mean that it's faster than map (of order log(N)). The constant may be bigger than log(N) especially since N is limited by 232 (or 264).
So in addition to the other answers (map maintains order and hash functions may be difficult) it may be that map is more performant.
For example in a program I ran for a blog post I saw that for VS10 std::unordered_map was slower than std::map (although boost::unordered_map was faster than both).
Note 3rd through 5th bars.
This is due to Google's Chandler Carruth in his CppCon 2014 lecture
std::map is (considered by many to be) not useful for performance-oriented work: If you want O(1)-amortized access, use a proper associative array (or for lack of one, std::unorderded_map); if you want sorted sequential access, use something based on a vector.
Also, std::map is a balanced tree; and you have to traverse it, or re-balance it, incredibly often. These are cache-killer and cache-apocalypse operations respectively... so just say NO to std::map.
You might be interested in this SO question on efficient hash map implementations.
(PS - std::unordered_map is cache-unfriendly because it uses linked lists as buckets.)
I think it's obvious that you'd use the std::map you need to iterate across items in the map in sorted order.
You might also use it when you'd prefer to write a comparison operator (which is intuitive) instead of a hash function (which is generally very unintuitive).
Say you have very large keys, perhaps large strings. To create a hash value for a large string you need to go through the whole string from beginning to end. It will take at least linear time to the length of the key. However, when you only search a binary tree using the > operator of the key each string comparison can return when the first mismatch is found. This is typically very early for large strings.
This reasoning can be applied to the find function of std::unordered_map and std::map. If the nature of the key is such that it takes longer to produce a hash (in the case of std::unordered_map) than it takes to find the location of an element using binary search (in the case of std::map), it should be faster to lookup a key in the std::map. It's quite easy to think of scenarios where this would be the case, but they would be quite rare in practice i believe.

Checking for duplicates in a vector [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Determining if an unordered vector<T> has all unique elements
I have to check a vector for duplicates. What is the best way to approach this:
I take the first element, compare it against all other elements in the vector. Then take the next element and do the same and so on.
Is this the best way to do it, or is there a more efficient way to check for dups?
If your vector is an STL container, the solution is easy:
std::sort(myvec.begin(), myvec.end());
std::erase(std::unique(myvec.begin(), myvec.end()), myvec.end());
According to cppreference (https://en.cppreference.com/w/cpp/algorithm/unique), the elements are shifted around so that the values from myvec.begin() to the return value of std::unique are all unique. The elements after the iterator returned by std::unique are unspecified (useless in every use-case I've seen) so remove them from the std::vector<A> using std::vector<A>::erase.
Use a hash table in which you insert each element. Before you insert an element, check if it's already there. If it is, you have yourself a duplicate. This is O(n) on average, but the worst case is just as bad as your current method.
Alternatively, you can use a set to do the same thing in O(n log n) worst case. This is as good as the sorting solution, except it doesn't change the order of the elements (uses more memory though since you create a set).
Another way is to copy your vector to another vector, sort that and check the adjacent elements there. I'm not sure if this is faster than the set solution, but I think sorting adds less overhead than the balanced search trees a set uses so it should be faster in practice.
Of course, if you don't care about keeping the original order of the elements, just sort the initial vector.
If you don't care about an occasional false positive, you can use a Bloom Filter to detect probable duplicates in the collection. If false positives can't be accepted, take the values that fail the filter and run a second detection pass on those. The list of failed values should be fairly small, although they will need to be checked against the full input.
Sorting and then comparing adjacent elements is the way to go. A sort takes O(n log n) comparisons, an then an additional n-1 to compare adjacent elements.
The scheme in the question would take (n^2)/2 comparisons.
You can also use binary_search.
Here are two good examples that will help you:
http://www.cplusplus.com/reference/algorithm/binary_search/
http://www.cplusplus.com/reference/algorithm/unique_copy/