Best STL container for fast lookups - c++

I need to store a list of integers, and very quickly determine if an integer is already in the list. No duplicates will be stored.
I won't be inserting or deleting any values, but I will be appending values (which might be implemented as inserting).
What is the best STL container class for this? I found std::multimap on Google, but I that seems to require a key/value pair, which I don't have.
I'm fairly new to STL. Can I get some recommendations?

Instead of a map, you can use a set when the value and the key aren't separate.
Instead of a multiset/-map, you can use the non-multi version which doesn't store duplicates.
Instead of a set, you have the std::unordered_set as an alternative. It may be faster for your use case.
There are other, less generic, data structures that can be used to represent sets of integers, and some of those may be more efficient depending on the use case. But those other data structures aren't necessarily provided for you by the standard library.
But I'm not clear which have the fastest lookup.
Unordered set has better asymptotic complexity for lookups than the ordered one. Whether it is faster for your use case, you can find out by measuring.
not likely to exceed a thousand or so
In that case, asymptotic complexity is not necessarily relevant.
Especially for small-ish sets like this, a sorted vector can be quite efficient. Given that you "won't be inserting or deleting any values", the vector shouldn't have significant downsides either. The standard library doesn't provide a set container implemented internally using a sorted vector, but it does provide a vector container as well as all necessary algorithms.
I don't know how the containers compute hashes.
You can provide a hash function for the unordered container. By default it uses std::hash. std::hash uses an implementation defined hash function.

std::unordered_set<int> is a good choice for keeping track of duplicates of ints, since both insertion and lookup can be achieved in constant time on average.
Insertion
Inserting a new int into the collection can be achieved with the insert() member function:
std::unordered_set<int> s;
s.insert(7);
Lookup
Checking whether a given int is present in the collection can be done with the find() member function:
bool is_present(const std::unordered_set<int>& s, int value) {
return s.find(value) != s.end();
}

Related

What is the most efficient std container for non-duplicated items?

What is the most efficient way of adding non-repeated elements into STL container and what kind of container is the fastest? I have a large amount of data and I'm afraid each time I try to check if it is a new element or not, it takes a lot of time. I hope map be very fast.
// 1- Map
map<int, int> Map;
...
if(Map.find(Element)!=Map.end()) Map[Element]=ID;
// 2-Vector
vector<int> Vec;
...
if(find(Vec.begin(), Vec.end(), Element)!=Vec.end()) Vec.push_back(Element);
// 3-Set
// Edit: I made a mistake: set::find is O(LogN) not O(N)
Both set and map has O(log(N)) performance for looking up keys. vector has O(N).
The difference between set and map, as far as you should be concerned, is whether you need to associate a key with a value, or just store a value directly. If you need the former, use a map, if you need the latter, use a set.
In both cases, you should just use insert() instead of doing a find().
The reason is insert() will insert the value into the container if and only if the container does not already contain that value (in the case of map, if the container does not contain that key). This might look like
Map.insert(std::make_pair(Element, ID));
for a map or
Set.insert(Element);
for a set.
You can consult the return value to determine whether or not an insertion was actually performed.
If you're using C++11, you have two more choices, which are std::unordered_map and std::unordered_set. These both have amortized O(1) performance for insertions and lookups. However, they also require that the key (or value, in the case of set) be hashable, which means you'll need to specialize std::hash<> for your key. Conversely, std::map and std::set require that your key (or value, in the case of set) respond to operator<().
If you're using C++11, you can use std::unordered_set. That would allow you O(1) existence-checking (technically amortized O(1) -- O(n) in the worst case).
std::set would probably be your second choice with O(lg n).
Basically, std::unordered_set is a hash table and std::set is a tree structure (a red black tree in every implementation I've ever seen)1.
Depending on how well your hashes distribute and how many items you have, a std::set might actually be faster. If it's truly performance critical, then as always, you'll want to do benchmarking.
1) Technically speaking, I don't believe either are required to be implemented as a hash table or as a balanced BST. If I remember correctly, the standard just mandates the run time bounds, not the implementation -- it just turns out that those are the only viable implementations that fit the bounds.
You should use a std::set; it is a container designed to hold a single (equivalent) copy of an object and is implemented as a binary search tree. Therefore, it is O(log N), not O(N), in the size of the container.
std::set and std::map often share a large part of their underlying implementation; you should check out your local STL implementation.
Having said all this, complexity is only one measure of performance. You may have better performance using a sorted vector, as it keeps the data local to one another and, therefore, more likely to hit the caches. Cache coherence is a large part of data structure design these days.
Sounds like you want to use a std::set. It's elements are unique, so you don't need to care about uniqueness when adding elements, and a.find(k) (where a is an std::set and k is a value) is defined as being logarithmic in complexity.
if your elements can be hashed for O(1), then better to use an index in a unordered_map or unordered_set (not in a map/set because they use RB tree in implementation which is O(logN) find complexity)
Your examples show a definite pattern:
check if the value is already in container
if not, add the value to the container.
Both of these operation would potentially take some time. First, looking up an element can be done in O(N) time (linear search) if the elements are not arranged in any particular manner (e.g., just a plain std::vector), it could be done in O(logN) time (binary search) if the elements are sorted (e.g., either std::map or std::set), and it could be done in O(1) time if the elements are hashed (e.g., either std::unordered_map or std::unordered_set).
The insertion will be O(1) (amortized) for a plain vector or an unordered container (hash container), although the hash container will be a bit slower. For a sorted container like set or map, you'll have log-time insertions because it needs to look for the place to insert it before inserting it.
So, the conclusion, use std::unordered_set or std::unordered_map (if you need the key-value feature). And you don't need to check before doing the insertion, these are unique-key containers, they don't allow duplicates.
If std::unordered_set / std::unordered_map (from C++11) or std::tr1::unordered_set / std::tr1::unordered_map (since 2007) are not available to you (or any equivalent), then the next best alternative is std::set / std::map.

Dynamic array width id?

I need some sort of dynamic array in C++ where each element have their own id represented by an int.
The datatype needs these functions:
int Insert() - return ID
Delete(int ID)
Get(ID) - return Element
What datatype should I use? I'we looked at Vector and List, but can't seem to find any sort of ID. Also I'we looked at map and hastable, these may be usefull. I'm however not sure what to chose.
I would probably use a vector and free id list to handle deletions, then the index is the id. This is really fast to insert and get and fairly easy to manage (the only trick is the free list for deleted items).
Otherwise you probably want to use a map and just keep track of the lowest unused id and assign it upon insertion.
A std::map could work for you, which allows to associate a key to a value. The key would be your ID, but you should provide it yourself when adding an element to the map.
An hash table is a sort of basic mechanism that can be used to implement an unordered map. It corresponds to std::unordered_map.
It seems that the best container to use is unordered_map.
It is based on hash. You can insert, delete or searche for elements in O(n).
Currently unordered_map is not in STL. If you want to use STL container use std::map.
It is based on tree. Inserts, deletes and searches for elements in O(n*log(n)).
Still the container choice depends much on the usage intensity. For example, if you will find for elements rare, vector and list could be ok. These containers do not have find method, but <algorithm> library include it.
A vector gives constant-time random access, the "id" can simply be the offset (index) into the vector. A deque is similar, but doesn't store all items contiguously.
Either of these would be appropriate, if the ID values can start at 0 (or a known offset from 0 and increment monotonically). Over time if there are a large amount of removals, either vector or deque can become sparsely populated, which may be detrimental.
std::map doesn't have the problem of becoming sparsely populated, but look ups move from constant time to logarithmic time, which could impact performance.
boost::unordered_map may be the best yet, as the best case scenario as a hash table will likely have the best overall performance characteristics given the question. However, usage of the boost library may be necessary -- but there are also unordered container types in std::tr1 if available in your STL implementation.

Maintaining an Ordered collection of objects

I have the following requirements for a collection of objects:
Dynamic size (in theory unlimited, but in practice a couple of thousand should be more than enough)
Ordered, but allowing reorder and insertion at arbitrary locations.
Allows for deletion
Indexed Access - Random Access
Count
The objects I am storing are not large, a couple of properties and a small array or two (256 booleans)
Is there any built in classes I should know about before I go writing a linked list?
Original answer: That sounds like std::list (a doubly linked list) from the Standard Library.
New answer:
After your change to the specs, a std::vector might work as long as there aren't more than a few thousand elements and not a huge number of insertions and deletions in the middle of the vector. The linear complexity of insertion and deletion in the middle may be outweighed by the low constants on the vector operations. If you are doing a lot of insertions and deletions just at the beginning and end, std::deque might work as well.
-Insertion and Deletion: This is possible for any STL container, but the question is how long it takes to do it. Any linked-list container (list, map, set) will do this in constant time, while array-like containers (vector) will do it in linear time (with constant-amortized allocation).
-Sorting: Considering that you can keep a sorted collection at all times, this isn't much of an issue, any STL container will allow that. For map and set, you don't have to do anything, they already take care of keeping the collection sorted at all times. For vector or list, you have to do that work, i.e. you have to do binary search for the place where the new elements go and insert them there (but STL Algorithms has all the pieces you need for that).
-Resorting: If you need to take the current collection you have sorted with respect to rule A, and resort the collection with respect to rule B, this might be a problem. Containers like map and set are parametrized (as a type) by the sorting rule, this means that to resort it, you would have to extract every element from the original collection and insert them in a new collection with a new sorting rule. However, if you use a vector container, you can just use the STL sort function anytime to resort with whatever rule you like.
-Random Access: You said you needed random access. Personally, my definition of random access means that any element in the collection can be accessed (by index) in constant time. With that definition (which I think is quite standard), any linked-list implementation does not qualify and it leaves you with the only option of using an array-like container (e.g. std::vector).
Conclusion, to have all those properties, it would probably be best to use a std::vector and implement your own sorted insertion and sorted deletion (performing binary search into the vector to find the element to delete or the place to insert the new element). If your objects that you need to store are of significant size, and the data according to which they are sorted (name, ID, etc.) is small, you might consider splitting the problem by holding a unsorted linked-list of objects (with full information) and keeping a sorted vector of keys along with a pointer to the corresponding node in the linked-list (in that case, of course, use std::list for the former, and std::vector for the latter).
BTW, I'm no grand expert with STL containers, so I might have been wrong in the above, just think for yourself. Explore the STL for yourself, I'm sure you will find what you need, or at least all the pieces that you need. Maybe, look at Boost libraries too.
You haven't specified enough of your requirements to select the best container.
Dynamic size (in theory unlimited, but in practice a couple of thousand should be more than enough)
STL containers are designed to grow as needed.
Ordered, but allowing reorder and insertion at arbitrary locations.
Allowing reorder? A std::map can't be reordered: you can delete from one std::map and insert into another using a different ordering, but as different template instantiations the two variables will have different types. std::list has a sort() member function [thanks Blastfurnace for pointing this out], particularly efficient for large objects. A std::vector can be resorted easily using the non-member std::sort() function, particularly efficient for tiny objects.
Efficient insertion at arbitrary locations can be done in a map or list, but how will you find those locations? In a list, searching is linear (you must start from somewhere you already know about and scan forwards or backwards element by element). std::map provides efficient searching, as does an already-sorted vector, but inserting into a vector involves shifting (copying) all the subsequent elements to make space: that's a costly operation in the scheme of things.
Allows for deletion
All containers allow for deletion, but you have the exact-same efficiency issues as you do for insertion (i.e. fast for list if you already know the location, fast for map, deletion in vectors is slow, though you can "mark" elements deleted without removing them, e.g. making a string empty, having a boolean flag in a struct).
Indexed Access - Random Access
vector is indexed numerically (but can be binary searched), map by an arbitrary key (but no numerical index). list is not indexed and must be searched linearly from a known element.
Count
std::list provides an O(n) size() function (so that it can provide O(1) splice), but you can easily track the size yourself (assuming you won't splice). Other STL containers already have O(1) time for size().
Conclusions
Consider whether using a std::list will result in lots of inefficient linear searches for the element you need. If not, then a list does give you efficient insertion and deletion. Resorting is good.
A map or hash map will allow quick lookup and easy insertion/deletion, can't be resorted, but you can easily move the data out to another map with another sort criteria (with moderate efficiency.
A vector allows fast searching and in-place resorting, but the worst insert/deletion. It's the fastest for random-access lookup using the element index.

c++ container for checking whether ordered data is in a collection

I have data that is a set of ordered ints
[0] = 12345
[1] = 12346
[2] = 12454
etc.
I need to check whether a value is in the collection in C++, what container will have the lowest complexity upon retrieval? In this case, the data does not grow after initiailization. In C# I would use a dictionary, in c++, I could either use a hash_map or set. If the data were unordered, I would use boost's unordered collections. However, do I have better options since the data is ordered? Thanks
EDIT: The size of the collection is a couple of hundred items
Just to detail a bit over what have already been said.
Sorted Containers
The immutability is extremely important here: std::map and std::set are usually implemented in terms of binary trees (red-black trees for my few versions of the STL) because of the requirements on insertion, retrieval and deletion operation (and notably because of the invalidation of iterators requirements).
However, because of immutability, as you suspected there are other candidates, not the least of them being array-like containers. They have here a few advantages:
minimal overhead (in term of memory)
contiguity of memory, and thus cache locality
Several "Random Access Containers" are available here:
Boost.Array
std::vector
std::deque
So the only thing you actually need to do can be broken done in 2 steps:
push all your values in the container of your choice, then (after all have been inserted) use std::sort on it.
search for the value using std::binary_search, which has O(log(n)) complexity
Because of cache locality, the search will in fact be faster even though the asymptotic behavior is similar.
If you don't want to reinvent the wheel, you can also check Alexandrescu's [AssocVector][1]. Alexandrescu basically ported the std::set and std::map interfaces over a std::vector:
because it's faster for small datasets
because it can be faster for frozen datasets
Unsorted Containers
Actually, if you really don't care about order and your collection is kind of big, then a unordered_set will be faster, especially because integers are so trivial to hash size_t hash_method(int i) { return i; }.
This could work very well... unless you're faced with a collection that somehow causes a lot of collisions, because then unsorted containers will search over the "collisions" list of a given hash in linear time.
Conclusion
Just try the sorted std::vector approach and the boost::unordered_set approach with a "real" dataset (and all optimizations on) and pick whichever gives you the best result.
Unfortunately we can't really help more there, because it heavily depends on the size of the dataset and the repartition of its elements
If the data is in an ordered random-access container (e.g. std::vector, std::deque, or a plain array), then std::binary_search will find whether a value exists in logarithmic time. If you need to find where it is, use std::lower_bound (also logarithmic).
Use a sorted std::vector, and use a std::binary_search to search it.
Your other options would be a hash_map (not in the C++ standard yet but there are other options, e.g. SGI's hash_map and boost::unordered_map), or an std::map.
If you're never adding to your collection, a sorted vector with binary_search will most likely have better performance than a map.
I'd suggest using a std::vector<int> to store them and a std::binary_search or std::lower_bound to retrieve them.
Both std::unordered_set and std::set add significant memory overhead - and even though the unordered_set provides O(1) lookup, the O(logn) binary search will probably outperform it given that the data is stored contiguously (no pointer following, less chance of a page fault etc.) and you don't need to calculate a hash function.
If you already have an ordered array or std::vector<int> or similar container of the data, you can just use std::binary_search to probe each value. No setup time, but each probe will take O(log n) time, where n is the number of ordered ints you've got.
Alternately, you can use some sort of hash, such as boost::unordered_set<int>. This will require some time to set up, and probably more space, but each probe will take O(1) time on the average. (For small n, this O(1) could be more than the previous O(log n). Of course, for small n, the time is negligible anyway.)
There is no point in looking at anything like std::set or std::map, since those offer no advantage over binary search, given that the list of numbers to match will not change after being initialized.
So, the questions are the approximate value of n, and how many times you intend to probe the table. If you aren't going to check many values to see if they're in the ints provided, then setup time is very important, and std::binary_search on the sorted container is the way to go. If you're going to check a lot of values, it may be worth setting up a hash table. If n is large, the hash table will be faster for probing than binary search, and if there's a lot of probes this is the main cost.
So, if the number of ints to compare is reasonably small, or the number of probe values is small, go with the binary search. If the number of ints is large, and the number of probes is large, use the hash table.

Dynamically sorted STL containers

I'm fairly new to the STL, so I was wondering whether there are any dynamically sortable containers? At the moment my current thinking is to use a vector in conjunction with the various sort algorithms, but I'm not sure whether there's a more appropriate selection given the (presumably) linear complexity of inserting entries into a sorted vector.
To clarify "dynamically", I am looking for a container that I can modify the sorting order at runtime - e.g. sort it in an ascending order, then later re-sort in a descending order.
You'll want to look at std::map
std::map<keyType, valueType>
The map is sorted based on the < operator provided for keyType.
Or
std::set<valueType>
Also sorted on the < operator of the template argument, but does not allow duplicate elements.
There's
std::multiset<valueType>
which does the same thing as std::set but allows identical elements.
I highly reccomend "The C++ Standard Library" by Josuttis for more information. It is the most comprehensive overview of the std library, very readable, and chock full of obscure and not-so-obscure information.
Also, as mentioned by 17 of 26, Effective Stl by Meyers is worth a read.
If you know you're going to be sorting on a single value ascending and descending, then set is your friend. Use a reverse iterator when you want to "sort" in the opposite direction.
If your objects are complex and you're going to be sorting in many different ways based on the member fields within the objects, then you're probably better off with using a vector and sort. Try to do your inserts all at once, and then call sort once. If that isn't feasible, then deque may be a better option than the vector for large collections of objects.
I think that if you're interested in that level of optimization, you had better be profiling your code using actual data. (Which is probably the best advice anyone here can give: it may not matter that you call sort after each insert if you're only doing it once in a blue moon.)
It sounds like you want a multi-index container. This allows you to create a container and tell that container the various ways you may want to traverse the items in it. The container then keeps multiple lists of the items, and those lists are updated on each insert/delete.
If you really want to re-sort the container, you can call the std::sort function on any std::deque, std::vector, or even a simple C-style array. That function takes an optional third argument to determine how to sort the contents.
The stl provides no such container. You can define your own, backed by either a set/multiset or a vector, but you are going to have to re-sort every time the sorting function changes by either calling sort (for a vector) or by creating a new collection (for set/multiset).
If you just want to change from increasing sort order to decreasing sort order, you can use the reverse iterator on your container by calling rbegin() and rend() instead of begin() and end(). Both vector and set/multiset are reversible containers, so this would work for either.
std::set is basically a sorted container.
You should definitely use a set/map. Like hazzen says, you get O(log n) insert/find. You won't get this with a sorted vector; you can get O(log n) find using binary search, but insertion is O(n) because inserting (or deleting) an item may cause all existing items in the vector to be shifted.
It's not that simple. In my experience insert/delete is used less often than find. Advantage of sorted vector is that it takes less memory and is more cache-friendly. If happen to have version that is compatible with STL maps (like the one I linked before) it's easy to switch back and forth and use optimal container for every situation.
in theory an associative container (set, multiset, map, multimap) should be your best solution.
In practice it depends by the average number of the elements you are putting in.
for less than 100 elements a vector is probably the best solution due to:
- avoiding continuous allocation-deallocation
- cache friendly due to data locality
these advantages probably will outperform nevertheless continuous sorting.
Obviously it also depends on how many insertion-deletation you have to do. Are you going to do per-frame insertion/deletation?
More generally: are you talking about a performance-critical application?
remember to not prematurely optimize...
The answer is as always it depends.
set and multiset are appropriate for keeping items sorted but are generally optimised for a balanced set of add, remove and fetch. If you have manly lookup operations then a sorted vector may be more appropriate and then use lower_bound to lookup the element.
Also your second requirement of resorting in a different order at runtime will actually mean that set and multiset are not appropriate because the predicate cannot be modified a run time.
I would therefore recommend a sorted vector. But remember to pass the same predicate to lower_bound that you passed to the previous sort as the results will be undefined and most likely wrong if you pass the wrong predicate.
Set and multiset use an underlying binary tree; you can define the <= operator for your own use. These containers keep themselves sorted, so may not be the best choice if you are switching sort parameters. Vectors and lists are probably best if you are going to be resorting quite a bit; in general list has it's own sort (usually a mergesort) and you can use the stl binary search algorithm on vectors. If inserts will dominate, list outperforms vector.
STL maps and sets are both sorted containers.
I second Doug T's book recommendation - the Josuttis STL book is the best I've ever seen as both a learning and reference book.
Effective STL is also an excellent book for learning the inner details of STL and what you should and shouldn't do.
For "STL compatible" sorted vector see A. Alexandrescu's AssocVector from Loki.