How to erase duplicate element in vector, efficiently - c++

I have
vector<string> data ; // I hold some usernames in it
In that vector, I have duplicate element(s), so I want erase this/these element(s).Are there any algorithm or library function to erase duplicate element(s)?
ex :
In data;
abba, abraham, edie, Abba, edie
After operation;
abba, abraham, edie, Abba

If you can sort the elements in the container, the straightforward and relatively efficient solution would be:
std::sort(data.begin(), data.end());
data.erase(std::unique(data.begin(), data.end()), data.end());

I'm not sure there is a really good way to do it.
What I would do is sort (in a different array, if you need the original in tact) and then run through it.

"set" does not allow duplicates. You can use that to filter out duplicates.
Create a set
Add all usernames to set
Create a new vector
Add all elements from set to vector

If you really need to do it efficiently, then should do an in place sort first, and then go through the container by yourself instead of using std::unique, fetch unique items into a new vector, at then end do a swap.
I just checked the source code of std::unique, it will do a lot move when finding one duplicate, move hurts vector's performance.

Related

How to move all the pointers from one vector to another?

Basically what I want to do is remove some of the pointers inside my vector, but I found out that it can be quite slow to do that in the middle of the vector.
So I have a vector that already has data inside:
std::vector<Class*> vec1; // This already contains pointers
I'll iterate through vec1 and will add some of the pointers to another vector (vec2): vec2.push_back(vec1.at(index))
Now I would like to do is something like vec1 = vec2 but I don't know if this is the better (effecient) way to do that.
What would be the best way to do that?
I tried:
While looping through vec1 simply erasing what I need to remove from it:
it = vec1.erase(it)
While looping through vec1 moving the last item to the actual index and poping_back
vec1.at(index) = vec1.back();
vec1.pop_back();
Setting some attribute on the object the pointer is pointing while looping through vec1 and than using std::remove_if
vec1.erase(std::remove_if(vec1.begin(), vec1.end(), shouldBeRemoved), vec1.end());
Now I'm trying to generate a new vector while looping through vec1 and adding the pointers I want to keep, then "swapping" or "moving" the contents of this new vector to vec1.
Apparently when doing it the 4th way, the pointers get invalidated :(
I would love to see what you guys suggest me. A big thank you to everyone that is willing to help!
You can just use std::remove_if to conditionally remove items from a vector. This algorithm will shift items that need to be kept over to the front. Follow it up with a std::vector::erase call to actually remove the items not shifted to the front.
This is similar to your option 3, but you don't need to set an attribute first - just use a predicate that determines if the item should be kept or not, and avoid having to pass over the vector twice.
If you don't want to do it in-place, but want to fill a new vector, then std::copy_if does that.
Removing things from a vector should be done with the erase remove idiom
It is well covered here: https://en.wikipedia.org/wiki/Erase%E2%80%93remove_idiom
The basic idea is to shift the elements first and then erase the unneeded items which is faster than erasing and shifting each individual element which is done as from the example with:
v.erase( std::remove( v.begin(), v.end(), 5 ), v.end() );
But in general: If you have a lot of add/erase steps in your algorithm, you should use a std::list where removing elements in the middle is much cheaper at all.
Your attempt #2 suggests that you're not interested in the order of the elements. remove_if will suffer performance problems as it will maintain the order of the items that you don't delete; meaning you could do a substantial number of shifts to maintain this order.
The swapping and popping will suffer the problem that repeatedly popping the back isn't required - it could resize the vector or do other things.
As such, by combining the ideas - of swapping the "last not swapped out" (ie last the first time, 2nd last the 2nd etc) and then at the end erasing the end items once complete; you should have the fastest algorithm.
Some of the comments suggest that a copy is faster than a swap; and while true when doing a single copy; for a vector when you're copying multiple elements multiple times; the swap will be significantly faster.

How to sort a multimap of strings as the value

Right now I have a `
multimap<size_t, <string>> mymap`;
It stores words keyed by the word size.
I want to be able to grab all the words with key = 5. Then I want to sort the strings with that key value from lowest to highest.
How would I do this and is a multimap the most efficient way?
I mean is it possible to use a different container where I could sort them by string values as well as categorize them by key values?
Basically I have a vector where the first element must not be moved. But the rest of the vector should be organized alphabetically. How would I do this? I would then want to organize the vectors once sorted again, alphabeticallly using the first word only to organize them inside the multimap. Any ideas?
The most efficient way depends on how you will use this container.
If you want keep strings sorted while you inserting/deleting them, then most efficient way is std::unordered_map<std::size_t, set<string> >.
However, if it is possible to collect all the data and then sort all strings, then the most efficient way is to use std::unordered_map<std::size_t, vector<string> >.

Keeping vector of iterators of the data

I have a function :
void get_good_items(const std::vector<T>& data,std::vector<XXX>& good_items);
This function should check all data and find items that satisfies a condition and return where they are in good_items.
what is best instead of std::vector<XXX>?
std::vector<size_t> that contains all good indices.
std::vector<T*> that contain a pointers to the items.
std::vector<std::vector<T>::iterator> that contains iterators to the items.
other ??
EDIT:
What will I do with the good_items?
Many things... one of them is to delete them from the vector and save them in other place. maybe something else later
EDIT 2:
One of the most important for me is how will accessing the items in data will be fast depending on the struct of good_items?
EDIT 3:
I have just relized that my thought was wrong. Is not better to keep raw pointers(or smart) as items of the vector so I can keep the real values of the vector (which are pointers) and I do not afraid of heavy copy because they are just pointers?
If you remove items from the original vector, every one of the methods you listed will be a problem.
If you add items to the original vector, the second and third will be problematic. The first one won't be a problem if you use push_back to add items.
All of them will be fine if you don't modify the original vector.
Given that, I would recommend using std::vector<size_t>.
I would go with std::vector<size_t> or std::vector<T*> because they are easier to type. Otherwise, those three vectors are pretty much equivalent, they all identify positions of elements.
std::vector<size_t> can be made to use a smaller type for indexes if you know the limits.
If you expect that there are going to be many elements in this vector, you may like to consider using boost::dynamic_bitset instead to save memory and increase CPU cache utilization. A bit per element, bit position being the index into the original vector.
If you intend to remove the elements that statisfy the predicate, then erase-remove idiom is the simplest solution.
If you intend to copy such elements, then std::copy_if is the simplest solution.
If you intend to end up with two partitions of the container i.e. one container has the good ones and another the bad ones, then std::partition_copy is a good choice.
For generally allowing the iteration of such elements, an efficient solution is returning a range of such iterators that will check the predicate while iterating. I don't think there are such iterators in the standard library, so you'll need to implement them yourself. Luckily boost already has done that for you: http://www.boost.org/doc/libs/release/libs/iterator/doc/filter_iterator.html
The problem you are solving, from my understanding, is the intersection of two sets, and I would go for the solution from standard library: std::set_intersection

Fast string search?

I have a vector of strings and have to check if each element in vector is present in a given list of 5000 words.
Besides the mundane method of two nested loops, is there any faster way to do this in C++?
You should put the list of strings into an std::set. It's a data structure optimized for searching. Finding if a given element is in the set or not is an operation which is much faster than iterating all entries.
When you are already using C++11, you can also use the std::unordered_set which is even faster for lookup, because it's implemented as a hash table.
Should this be for school/university: Be prepared to explain how these data structures manage to be faster. When your instructor asks you to explain why you used them, "some guys on the internet told me" is unlikely to earn you a sticker in the class book.
You could put the list of words in an std::unordered_set. Then, for each element in the vector, you just have to test if it is in the unordered_set in O(1). You would have an expected complexity of O(n) (look at the comment to see why it is only expected).
You could sort the vector, then you can solve this with one "loop" (taken that your dictionary is sorted too) which means O(n) not counting in the cost of the sort.
So you have a vector of strings, with each string having one or more words, and you have a vector that's a dictionary, and you're supposed to determine which words in the vector of strings are also in the dictionary? The vector of strings is an annoyance, since you need to look at each word. I'd start by creating a new vector, splitting each string into words, and pushing each word into the new vector. Then sort the new vector and run it through the std::unique algorithm to eliminate duplicates. Then sort the dictionary. Then run both ranges through std::set_intersection to write the result.

A data type like vector but sorted

Is there a data type like vector or queue where you can easily add items, but when they get added, they are automatically inserted in the right order?
Also is there an easy way to delete an item from a vector or queue if you know what it is, without having to actually search through and find it?
Sounds like you want std::set or std::multi_set.
I know no such container.
std::sort exists, in which you can specifiy the sorting function, but it is often even more efficient to actually insert items in the right place directly.
If you always do that, the only "problem" you have to solve is to add an item into an already sorted list, which can be done at worst in linear time.
Note that std::vector<T>::insert() takes an iterator as a parameter, to indicate where to do the insertion. You might want to write a findPosition() methods that returns such an iterator. Then, writing an sorted_insert() method is trivial and becomes something like:
std::vector<int>::iterator findPosition(int v);
void sorted_insert(std::vector<int>& vec, int v) { vec.insert(findPosition(v), v); }
void foo()
{
std::vector<int> vec;
sorted_insert(vec, 4);
}
It sounds like you are looking for a set and not a vector. It will be sorted according to the natural ordering (the < operator). To remove an element by value call erase.
Alternately you can just use sort on a vector to sort the elements. If you need random access to the elements then you will want this approach; the sorted containers do not provide random access.
To find an element in a sorted vector you can use binary_search.
Depends on what you need and why do you need it.
No, in the standard lib, there's no "sorted" vector or queue. You have 2 options, if you want to use only the standard library:
sort the container(vector or queue) on each insertion/deletion
implement your own insert/delete, by implementing the insertion sort
Other option is to use map or set, if they'll be OK for your problem (as we don't know what it is)
The other option is to look for some 3rd party lib - I guess boost will have such container, but I don't really know that.
"Also is there an easy way to delete an item from a vector or queue if you know what it is"
- what do you mean by that? You have an iterator to it or ? Or its index? I'll edit my answer when you update (: