C++ std::map or std::set - efficiently insert duplicates - c++

I have a bunch of data full of duplicates and I want to eliminate the duplicates. You know, e.g. [1, 1, 3, 5, 5, 5, 7] becomes [1, 3, 5, 7].
It looks like I can use either std::map or std::set to handle this. However I'm not sure whether it's faster to (a) simply insert all the values into the container, or (b) check whether they already exist in the container and only insert if they don't - are inserts very efficient? Even if there's a better way... can you suggest a fast way to do this?
Another question - if the data I'm storing in them isn't as trivial as integers, and instead is a custom class, how does the std::map manage to properly store (hash?) the data for fast access via operator[]?

std::map doesn't use hashing. std::unordered_map does, but that's C++11. std::map and std::set both use a comparator that you provide. The class templates have defaults for this comparator, which boils down to an operator< comparison, but you can provide your own.
If you don't need both a key and a value to be stored (looks like you don't) you should just use a std::set, as that's more appropriate.
The Standard doesn't say what data structures maps and sets use under the hood, only that certian actions have certain time complexities. In reality, most implementations I'm aware of use a tree.
It makes no difference time-complexity-wise if you use operator[] or insert, but I would use insert or operator[] before I did a search followed by an insert if the item isn't found. The later would imply two seperate searches to insert an item in to the set.

An insert() on any of the associated containers does a find() to see if the object exists and then inserts the object. Simply inserting the elements into an std::set<T> should get rid of the duplicates reasonably efficiently.
Depending on the size of your set and the ratio of duplicates to unique values, it may be faster to put the objects into std::vector<T>, std::sort() then, and then use std::unique() together with std::vector<T>::erase() to get rid of the duplicates.

How many times should you do it?
If insert is usual:
//*/
std::set<int> store;
/*/
// for hash:
std::unordered_set<int> store;
//*/
int number;
if ( store.insert(number).second )
{
// was not in store
}
If you fill once:
std::vector<int> store;
int number;
store.push_back(number);
std::sort(store.begin(),store.end());
store.erase(std::unique(store.begin(),store.end()),store.end() );
// elements are unique

Assuming the common implementation strategy for std::map and std::set, i.e. balanced binary search trees, both insertion and lookup have to do a tree traversal to find the spot where the key should be. So failed lookup followed by insertion would be roughly twice as slow as just inserting.
how does the std::map manage to properly store (hash?) the data for fast access via operator[]?
By means of a comparison function that you specify (or std::less, which works if you overload operator< on your custom type). In any case, std::map and std::set are not hash tables.

std::set and std::map are both implemented as red black tree as far as I know. And probably using only insertion would be faster (then both because you would be doubling the lookup time).
Also map and set use operator <. As long as your class has defined operator < It would be able to use them as keys.

Related

Best STL container for fast lookups

I need to store a list of integers, and very quickly determine if an integer is already in the list. No duplicates will be stored.
I won't be inserting or deleting any values, but I will be appending values (which might be implemented as inserting).
What is the best STL container class for this? I found std::multimap on Google, but I that seems to require a key/value pair, which I don't have.
I'm fairly new to STL. Can I get some recommendations?
Instead of a map, you can use a set when the value and the key aren't separate.
Instead of a multiset/-map, you can use the non-multi version which doesn't store duplicates.
Instead of a set, you have the std::unordered_set as an alternative. It may be faster for your use case.
There are other, less generic, data structures that can be used to represent sets of integers, and some of those may be more efficient depending on the use case. But those other data structures aren't necessarily provided for you by the standard library.
But I'm not clear which have the fastest lookup.
Unordered set has better asymptotic complexity for lookups than the ordered one. Whether it is faster for your use case, you can find out by measuring.
not likely to exceed a thousand or so
In that case, asymptotic complexity is not necessarily relevant.
Especially for small-ish sets like this, a sorted vector can be quite efficient. Given that you "won't be inserting or deleting any values", the vector shouldn't have significant downsides either. The standard library doesn't provide a set container implemented internally using a sorted vector, but it does provide a vector container as well as all necessary algorithms.
I don't know how the containers compute hashes.
You can provide a hash function for the unordered container. By default it uses std::hash. std::hash uses an implementation defined hash function.
std::unordered_set<int> is a good choice for keeping track of duplicates of ints, since both insertion and lookup can be achieved in constant time on average.
Insertion
Inserting a new int into the collection can be achieved with the insert() member function:
std::unordered_set<int> s;
s.insert(7);
Lookup
Checking whether a given int is present in the collection can be done with the find() member function:
bool is_present(const std::unordered_set<int>& s, int value) {
return s.find(value) != s.end();
}

Keep the order of unordered_map as we insert a new key

I inserted the elements to the unordered_map with this code:
myMap.insert(std::make_pair("A", 10));
myMap.insert(std::make_pair("B", 11));
myMap.insert(std::make_pair("C", 12));
myMap.insert(std::make_pair("D", 13));
But when I used this command to print the keys
for (const auto i : myMap)
{
cout << i.first << std::endl;
}
they are not in the same order as I inserted them.
Is it possible to keep the order?
No, it is not possible.
Usage of std::unordered_map doesn't give you any guarantee on element order.
If you want to keep elements sorted by map keys (as seems from your example) you should use std::map.
If you need to keep list of ordered pairs you can use std::vector<std::pair<std::string,int>>.
Not with an unordered associative data structure. However, other data structures preserve order, such as std::map which keeps the data sorted by their keys. If you search Stackoverflow a little, you will find many solutions for a data structure with fast key-based lookup and ordered access, e.g. using boost::multi_index.
If it is just about adding values to a container, and taking them out in the order of insertion, then you can go with something that models a queue, e.g. std::dequeue. Just push_back to add a new value, and pop_front to remove the oldest value. If there is no need to remove the values from the container then just go with a std::vector.
Remarkably common request without too many clean,simple solutions. But here they are:
The big library: https://www.boost.org/doc/libs/1_71_0/libs/multi_index/doc/tutorial/index.html Use unique index std::string (or is it char?) and sequenced index to retain the order.
Write it yourself using 2 STL containers: Use a combination of std::unordered_map (or std::map if you want sorted key order traverse as well) and a vector to retain the sequenced order. There are several ways to set this up, depending on the types on your keys/values. Usually keys in the map and the values in the vector. then the map is map<"key_type",int> where the int points to the element in the vector and therefore the value.
I might have a play with a simple template for a wrapper to tie the 2 STL containers together and post it here later...
I have put a proof of concept up for review here:
I went with std::list to store the order in the end, because I wanted efficient delete. But You might choose std::vector if you wanted random access by insertion order.
I used a a list of Pair pointers to avoid duplicate key storage.

How to retrieve the elements from map in the order of insertion?

I have a map which stores <int, char *>. Now I want to retrieve the elements in the order they have been inserted. std::map returns the elements ordered by the key instead. Is it even possible?
If you are not concerned about the order based on the int (IOW you only need the order of insertion and you are not concerned with the normal key access), simply change it to a vector<pair<int, char*>>, which is by definition ordered by the insertion order (assuming you only insert at the end).
If you want to have two indices simultaneously, you need, well, Boost.MultiIndex or something similar. You'd probably need to keep a separate variable that would only count upwards (would be a steady counter), though, because you could use .size()+1 as new "insertion time key" only if you never erased anything from your map.
Now I want to retrieve the elements in the order they have been inserted. [...] Is it even possible?
No, not with std::map. std::map inserts the element pair into an already ordered tree structure (and after the insertion operation, the std::map has no way of knowing when each entry was added).
You can solve this in multiple ways:
use a std::vector<std::pair<int,char*>>. This will work, but not provide the automatic sorting that a map does.
use Boost stuff (#BartekBanachewicz suggested Boost.MultiIndex)
Use two containers and keep them synchronized: one with a sequential insert (e.g. std::vector) and one with indexing by key (e.g. std::map).
use/write a custom container yourself, so that supports both type of indexing (by key and insert order). Unless you have very specific requirements and use this a lot, you probably shouldn't need to do this.
A couple of options (other than those that Bartek suggested):
If you still want key-based access, you could use a map, along with a vector which contains all the keys, in insertion order. This gets inefficient if you want to delete elements later on, though.
You could build a linked list structure into your values: instead of the values being char*s, they're structs of a char* and the previously- and nextly-inserted keys**; a separate variable stores the head and tail of the list. You'll need to do the bookkeeping for that on your own, but it gives you efficient insertion and deletion. It's more or less what boost.multiindex will do.
** It would be nice to store map iterators instead, but this leads to circular definition problems.

key-value data structure to search key in O(1) and get Max value in O(1)

I need to implement a key-value data structure that search for a unique key in O(lgn) or O(1) AND get max value in O(1). I am thinking about
boost::bimap< unordered_set_of<key> ,multiset_of<value> >
Note that there is no duplicated key in my key-value data sets. However, two keys may have the same value. Therefore I used multiset to store values.
I need to insert/remove/update key-value pair frequently
How does it sounds ?
It depends on what you want to do. So it is clear that you want to use it for some get-the-maximum-values-in-an-iteration construction.
Question 1: Do you access the elements by their keys as well?
If yes, I can think of two solutions for you:
use boost::bimap - easy, tested solution with logarithmic runtimes.
create a custom container that contains an std::map (or for even faster by key access std::unordered_map) and a heap implementation (e.g. std::priority_queue<std::map<key, value>::iterator> + custom comparator) as well, keeps them in sync, etc. This is the hard way, but maybe faster. Most operations on both will be O(logn) (insert, get&pop max, get by key, remove) but the constant sometimes do matter. Although is you use std::unordered_map the access by key operation will be O(1).
You may want to write tests for the new container as well and optimize it for the operation you use the most.
If no, you really just access using elements using the maximum value
Question 2: do you insert/remove/update elements randomly or you first put in all elements in one round and then remove them one by one?
for random insert/remove/updates use std::priority_queue<std::pair<value, key>>
if you put in the elements first, and then remove them one-by-one, use and std::vector<std::pair<value, key>> and std::sort() it before the first remove operation. Do not actually remove the elements, just iterate on them.
You could build this using a std::map and a std::set.
One map to hold the actual values, i.e. std::map<key, value> m;. This is where your values are stored. Inserting elements into a map is O(log n).
A set of iterators pointing into the map; this set is sorted by the value of the respective map entry, i.e. std::set<std::map<key, value>::iterator, CompareBySecondField> s; with something like (untested):
template <class It>
struct CompareBySecondField : std::binary_function<It, It, bool> {
bool operator() ( const T &lhs, const T &rhs ) const {
return lhs->second > rhs->second;
}
};
You can then get an iterator to the map entry with the largest value using *s.begin();.
This is rather easy to build, but you have to make sure to update both containers whenever you add/remove elements.

c++ std::map question about iterator order

I am a C++ newbie trying to use a map so I can get constant time lookups for the find() method.
The problem is that when I use an iterator to go over the elements in the map, elements do not appear in the same order that they were placed in the map.
Without maintaining another data structure, is there a way to achieve in order iteration while still retaining the constant time lookup ability?
Please let me know.
Thanks,
jbu
edit: thanks for letting me know map::find() isn't constant time.
Without maintaining another data structure, is there a way to achieve in order iteration while still retaining the constant time lookup ability?
No, that is not possible. In order to get an efficient lookup, the container will need to order the contents in a way that makes the efficient lookup possible. For std::map, that will be some type of sorted order; for std::unordered_map, that will be an order based on the hash of the key.
In either case, the order will be different then the order in which they were added.
First of all, std::map guarantees O(log n) lookup time. You might be thinking about std::tr1::unordered_map. But that by definitions sacrifices any ordering to get the constant-time lookup.
You'd have to spend some time on it, but I think you can bash boost::multi_index_container to do what you want.
What about using a vector for the keys in the original order and a map for the fast access to the data?
Something like this:
vector<string> keys;
map<string, Data*> values;
// obtaining values
...
keys.push_back("key-01");
values["key-01"] = new Data(...);
keys.push_back("key-02");
values["key-02"] = new Data(...);
...
// iterating over values in original order
vector<string>::const_iterator it;
for (it = keys.begin(); it != keys.end(); it++) {
Data* value = values[*it];
}
I'm going to actually... go backward.
If you want to preserve the order in which elements were inserted, or in general to control the order, you need a sequence that you will control:
std::vector (yes there are others, but by default use this one)
You can use the std::find algorithm (from <algorithm>) to search for a particular value in the vector: std::find(vec.begin(), vec.end(), value);.
Oh yes, it has linear complexity O(N), but for small collections it should not matter.
Otherwise, you can start looking up at Boost.MultiIndex as already suggested, but for a beginner you'll probably struggle a bit.
So, shirk the complexity issue for the moment, and come up with something that work. You'll worry about speed when you are more familiar with the language.
Items are ordered by operator< (by default) when applied to the key.
PS. std::map does not gurantee constant time look up.
It gurantees max complexity of O(ln(n))
First off, std::map isn't constant-time lookup. It's O(log n). Just thought I should set that straight.
Anyway, you have to specify your own comparison function if you want to use a different ordering. There isn't a built-in comparison function that can order by insertion time, but, if your object holds a timestamp field, you can arrange to set the timestamp at the time of insertion, and using a by-timestamp comparison.
Map is not meant for placing elements in some order - use vector for that.
If you want to find something in map you should "search" by the key using [the operator
If you need both: iteration and search by key see this topic
Yes you can create such a data structure, but not using the standard library... the reason is that standard containers can be nested but cannot be mixed.
There is no problem implementing for example a map data structure where all the nodes are also in a doubly linked list in order of insertion, or for example a map where all nodes are in an array. It seems to me that one of these structures could be what you're looking for (depending which operation you prefer to be fast), but neither of them is trivial to build using standard containers because every standard container (vector, list, set, ...) wants to be the one and only way to access contained elements.
For example I found useful in many cases to have nodes that were at the same time in multiple doubly-linked lists, but you cannot do that using std::list.