What is the most efficient std container for non-duplicated items? - c++

What is the most efficient way of adding non-repeated elements into STL container and what kind of container is the fastest? I have a large amount of data and I'm afraid each time I try to check if it is a new element or not, it takes a lot of time. I hope map be very fast.
// 1- Map
map<int, int> Map;
...
if(Map.find(Element)!=Map.end()) Map[Element]=ID;
// 2-Vector
vector<int> Vec;
...
if(find(Vec.begin(), Vec.end(), Element)!=Vec.end()) Vec.push_back(Element);
// 3-Set
// Edit: I made a mistake: set::find is O(LogN) not O(N)

Both set and map has O(log(N)) performance for looking up keys. vector has O(N).
The difference between set and map, as far as you should be concerned, is whether you need to associate a key with a value, or just store a value directly. If you need the former, use a map, if you need the latter, use a set.
In both cases, you should just use insert() instead of doing a find().
The reason is insert() will insert the value into the container if and only if the container does not already contain that value (in the case of map, if the container does not contain that key). This might look like
Map.insert(std::make_pair(Element, ID));
for a map or
Set.insert(Element);
for a set.
You can consult the return value to determine whether or not an insertion was actually performed.
If you're using C++11, you have two more choices, which are std::unordered_map and std::unordered_set. These both have amortized O(1) performance for insertions and lookups. However, they also require that the key (or value, in the case of set) be hashable, which means you'll need to specialize std::hash<> for your key. Conversely, std::map and std::set require that your key (or value, in the case of set) respond to operator<().

If you're using C++11, you can use std::unordered_set. That would allow you O(1) existence-checking (technically amortized O(1) -- O(n) in the worst case).
std::set would probably be your second choice with O(lg n).
Basically, std::unordered_set is a hash table and std::set is a tree structure (a red black tree in every implementation I've ever seen)1.
Depending on how well your hashes distribute and how many items you have, a std::set might actually be faster. If it's truly performance critical, then as always, you'll want to do benchmarking.
1) Technically speaking, I don't believe either are required to be implemented as a hash table or as a balanced BST. If I remember correctly, the standard just mandates the run time bounds, not the implementation -- it just turns out that those are the only viable implementations that fit the bounds.

You should use a std::set; it is a container designed to hold a single (equivalent) copy of an object and is implemented as a binary search tree. Therefore, it is O(log N), not O(N), in the size of the container.
std::set and std::map often share a large part of their underlying implementation; you should check out your local STL implementation.
Having said all this, complexity is only one measure of performance. You may have better performance using a sorted vector, as it keeps the data local to one another and, therefore, more likely to hit the caches. Cache coherence is a large part of data structure design these days.

Sounds like you want to use a std::set. It's elements are unique, so you don't need to care about uniqueness when adding elements, and a.find(k) (where a is an std::set and k is a value) is defined as being logarithmic in complexity.

if your elements can be hashed for O(1), then better to use an index in a unordered_map or unordered_set (not in a map/set because they use RB tree in implementation which is O(logN) find complexity)

Your examples show a definite pattern:
check if the value is already in container
if not, add the value to the container.
Both of these operation would potentially take some time. First, looking up an element can be done in O(N) time (linear search) if the elements are not arranged in any particular manner (e.g., just a plain std::vector), it could be done in O(logN) time (binary search) if the elements are sorted (e.g., either std::map or std::set), and it could be done in O(1) time if the elements are hashed (e.g., either std::unordered_map or std::unordered_set).
The insertion will be O(1) (amortized) for a plain vector or an unordered container (hash container), although the hash container will be a bit slower. For a sorted container like set or map, you'll have log-time insertions because it needs to look for the place to insert it before inserting it.
So, the conclusion, use std::unordered_set or std::unordered_map (if you need the key-value feature). And you don't need to check before doing the insertion, these are unique-key containers, they don't allow duplicates.
If std::unordered_set / std::unordered_map (from C++11) or std::tr1::unordered_set / std::tr1::unordered_map (since 2007) are not available to you (or any equivalent), then the next best alternative is std::set / std::map.

Related

What is the fastest C++ container for lookup in a collection of <integer, std::string>

Lookup is to be done either by integer or by string.
Both integer and strings are guaranteed to be unique in the collection.
The size of the collection is ~500 of those pairs.
The interface of the collection would be 2 functions, one to translate the integer to string and one to translate the string to the corresponding integer.
It seems there is no methodology to pick up the "proper" container.
An idea would be to use std::vector, which is the default and since it will be stored in a contiguous memory we ensure cache hits? But I know the size of the collection at compile time, so std::array?
Or since order is not important but the look up needs to be fast we can use a hash based solution like an std::map? But then what is the key and what is the value? Or one can have 2 std::maps (std::map<int, std::string> and std::map(std::string, int)) with duplicated the information, since anyway the collection is small.
I know that the ultimate answer would be to benchmark it, but I am wondering if there is any actual methodology to know what to pick based on principle.
std::map is a sorted associative container where keys are sorted. Its implemented using height balanced tree where search, removal, and insertion operations have logarithmic complexity.
As key ordering is not required, using std::unordered_map will be more efficient. Unordered map is an associative container where search, insertion, and removal of elements have average constant-time complexity.
Having 2 unordered_maps would give O(1) time for each search operation.

What map-like std container performs the fewest copies during insert and erase?

I'm trying to get a better understanding of the implementations of map-like std container. By map-like, I mean something with a key/value pair. Which container(s) performs the fewest copies during insert and erase (or which is better at each if it's not the same)?
There are two standard map-like containers. std::map and std::unordered_map. There're also their multi-map variants, but I presume that those don't count as "map-like" for the purpose of this question.
Neither standard map like container performs any copies of elements during insert and move. They will perform operations, such as copies on their internal structure however.
std::map insert complexity is logarithmic unless you use a good hint in which case it's amortised constant time. std::map erase complexity is constant with an iterator; otherwise logarithmic. std::unordered_map insert and erase complexity is linear in worst case; constant on average.

In practice, when `std::unordered_map` must be used instead of `std::map`?

In practice, is there any circumstance which std::unordered_map must be used instead of std::map?
I know the differences between them, say internal implementation,time complexity for searching element and so on.
But I really can't find a circumstance where std::unordered_map could not be replaced by std::map indeed.
Yes, for example if the key type does not have a sensible strict weak ordering but does have sensible equality and is hashable.
A strict weak order is required for the key type on the ordered associative containers std::set and std::map.
I know the difference between them, say internal implementation,time complexity for searching element
In that case, you should know that that the average asymptotic element lookup time complexity of unordered map is constant, while the complexity of ordered map is logarithmic. This means that there is some size of container at which point the lookup will be faster when using unordered map.
But I really can't find a circumstance where std::unordered_map could not be replaced by std::map indeed.
If the container is large enough, and if you cannot afford the cost of ordered map lookup, then you cannot choose to replace the faster unordered map lookup.
Another case where ordered map cannot be used is where there doesn't exist a cheap function to compare relative order of the key.
My opinion is that you should change question in:
when std::map must be used instead of std::unordered_map?
Indeed, insertion, deletion, search of std::unordered_map are less complex than std::map. The table of this question resumes the complexity for each operation.
So, using std::map is recommended in two cases at least:
When you need ordering
std::unordered_map are hash-based. When you have too many collisions and you can't find a suitable hash function, you may go for a std::map.
However, in normal conditions, for a single element operation, I recommend std::unordered_map.

When should I use unordered_map and not std::map

I'm wondering in which case I should use unordered_map instead of std::map.
I have to use unorderd_map each time I don't pay attention of order of element in the map ?
map
Usually implemented using red-black tree.
Elements are sorted.
Relatively small memory usage (doesn't need additional memory for the hash-table).
Relatively fast lookup: O(log N).
unordered_map
Usually implemented using hash-table.
Elements are not sorted.
Requires additional memory to keep the hash-table.
Fast lookup O(1), but constant time depends on the hash-function which could be relatively slow. Also keep in mind that you could meet with the Birthday problem.
Compare hash table (undorded_map) vs. binary tree (map), remember your CS classes and adjust accordingly.
The hash map usually has O(1) on lookups, the map has O(logN). It can be a real difference if you need many fast lookups.
The map keeps the order of the elements, which is also useful sometimes.
map allows to iterate over the elements in a sorted way, but unordered_map does not.
So use the std::map when you need to iterate across items in the map in sorted order.
The reason you'd choose one over the other is performance. Otherwise they'd only have created std::map, since it does more for you :)
Use std::map when you need elements to automatically be sorted. Use std::unordered_map other times.
See the SGI STL Complexity Specifications rationale.
unordered_map is O(1) but quite high constant overhead for lookup, insertion, and deletion. map is O(log(n)), so pick the complexity that best suits your needs. In addition, not all keys can be placed into both kinds of map.

c++ container for checking whether ordered data is in a collection

I have data that is a set of ordered ints
[0] = 12345
[1] = 12346
[2] = 12454
etc.
I need to check whether a value is in the collection in C++, what container will have the lowest complexity upon retrieval? In this case, the data does not grow after initiailization. In C# I would use a dictionary, in c++, I could either use a hash_map or set. If the data were unordered, I would use boost's unordered collections. However, do I have better options since the data is ordered? Thanks
EDIT: The size of the collection is a couple of hundred items
Just to detail a bit over what have already been said.
Sorted Containers
The immutability is extremely important here: std::map and std::set are usually implemented in terms of binary trees (red-black trees for my few versions of the STL) because of the requirements on insertion, retrieval and deletion operation (and notably because of the invalidation of iterators requirements).
However, because of immutability, as you suspected there are other candidates, not the least of them being array-like containers. They have here a few advantages:
minimal overhead (in term of memory)
contiguity of memory, and thus cache locality
Several "Random Access Containers" are available here:
Boost.Array
std::vector
std::deque
So the only thing you actually need to do can be broken done in 2 steps:
push all your values in the container of your choice, then (after all have been inserted) use std::sort on it.
search for the value using std::binary_search, which has O(log(n)) complexity
Because of cache locality, the search will in fact be faster even though the asymptotic behavior is similar.
If you don't want to reinvent the wheel, you can also check Alexandrescu's [AssocVector][1]. Alexandrescu basically ported the std::set and std::map interfaces over a std::vector:
because it's faster for small datasets
because it can be faster for frozen datasets
Unsorted Containers
Actually, if you really don't care about order and your collection is kind of big, then a unordered_set will be faster, especially because integers are so trivial to hash size_t hash_method(int i) { return i; }.
This could work very well... unless you're faced with a collection that somehow causes a lot of collisions, because then unsorted containers will search over the "collisions" list of a given hash in linear time.
Conclusion
Just try the sorted std::vector approach and the boost::unordered_set approach with a "real" dataset (and all optimizations on) and pick whichever gives you the best result.
Unfortunately we can't really help more there, because it heavily depends on the size of the dataset and the repartition of its elements
If the data is in an ordered random-access container (e.g. std::vector, std::deque, or a plain array), then std::binary_search will find whether a value exists in logarithmic time. If you need to find where it is, use std::lower_bound (also logarithmic).
Use a sorted std::vector, and use a std::binary_search to search it.
Your other options would be a hash_map (not in the C++ standard yet but there are other options, e.g. SGI's hash_map and boost::unordered_map), or an std::map.
If you're never adding to your collection, a sorted vector with binary_search will most likely have better performance than a map.
I'd suggest using a std::vector<int> to store them and a std::binary_search or std::lower_bound to retrieve them.
Both std::unordered_set and std::set add significant memory overhead - and even though the unordered_set provides O(1) lookup, the O(logn) binary search will probably outperform it given that the data is stored contiguously (no pointer following, less chance of a page fault etc.) and you don't need to calculate a hash function.
If you already have an ordered array or std::vector<int> or similar container of the data, you can just use std::binary_search to probe each value. No setup time, but each probe will take O(log n) time, where n is the number of ordered ints you've got.
Alternately, you can use some sort of hash, such as boost::unordered_set<int>. This will require some time to set up, and probably more space, but each probe will take O(1) time on the average. (For small n, this O(1) could be more than the previous O(log n). Of course, for small n, the time is negligible anyway.)
There is no point in looking at anything like std::set or std::map, since those offer no advantage over binary search, given that the list of numbers to match will not change after being initialized.
So, the questions are the approximate value of n, and how many times you intend to probe the table. If you aren't going to check many values to see if they're in the ints provided, then setup time is very important, and std::binary_search on the sorted container is the way to go. If you're going to check a lot of values, it may be worth setting up a hash table. If n is large, the hash table will be faster for probing than binary search, and if there's a lot of probes this is the main cost.
So, if the number of ints to compare is reasonably small, or the number of probe values is small, go with the binary search. If the number of ints is large, and the number of probes is large, use the hash table.