Is there a data structure in C++ with O(1) lookup?
A std::map has O(log(n)) lookup time (right?).
I'm looking from something in std preferably (so not Boost pls). Also, if there is, how does it work?
EDIT: Ok, I wasn't clear enough I guess. I want to associate values, kind of like in a map. So I want something like std::map<int,string>, and find and insert should take O(1).
Arrays have O(1) lookup.
Hashtable (std::unordered_map) for c++11 has O(1) lookup. (Amortized, but more or less constant.)
I would also like to mention that tree based data structures like maps come with great advantages and are only log(n) which is more often than not sufficient.
Answer to your edit -> You can literally associate an index of an array to one of the values. Also hash tables are associative but perfect hash (each key maps to exactly 1 value) is really difficult to get.
One more thing worth mentioning: Arrays have great cache performance (due to locality, aka. elements being right next to each other so they can be prefetched to cache by the prefecthing engine). Trees, not so much. With reasonable amount of elements, hash performance can be more critical than asymptotic performance.
Data structures with O(1) lookup (ignoring the size of the key) include:
arrays
hash tables
For complex types, balanced trees will be fine at O(log n), or sometimes you can get away with a patricia trie at O(k).
For reference:complexity of search structures
An array has O(1) lookup.
Related
Is the insertion/deletion/lookup time of a C++ std::map O(log n)? Is it possible to implement an O(1) hash table?
Is the insertion/deletion/lookup time of a C++ map O(log n)?
Yes.
Is it possible to implement an O(1) hash table?
Definitely. The standard library also provides one as std::unordered_map.
C++ has a unordered_map type. The STL also contains a hash_map type, although this is not in the C++ standard library.
Now, for a bit of algorithmic theory. It is possible to implement an O(1) hash table under perfect conditions, and technically, hash tables are O(1) insertion and lookup. The perfect conditions in this case are that the hash function must be perfect (i.e. collision free), and you have infinite storage.
In practise, let's take a dumb hash table. For any input key, it returns 1. In this case, when there is a collision (i.e. on the second and subsequent insertions), it will have to chain further to find some free space. It can either go to the next storage location, or use a linked list for this.
In any case, in the best case, yes, hash tables are O(1) (until you have exhausted all of your hash values, of course, since it is impractical to have a hash function with an infinite amount of output). In the worst case (e.g. with my completely dumb hash function), hash tables are O(n), since you will have to traverse over the storage in order to find your actual value from the given hash, since the initial value is not the correct value.
The implementation of std::map is a tree. This is not directly specified in the standard, but as some good books are saying: "It is difficult to imagine that it can be anything else". This means that the insertion/deletion/lookup time for map is O(log n).
Classic hash tables have lookup time O(n/num_slots). Once the expected number of items in the table is comparable with the number of slots, you will have saturated O(1).
I'm wondering in which case I should use unordered_map instead of std::map.
I have to use unorderd_map each time I don't pay attention of order of element in the map ?
map
Usually implemented using red-black tree.
Elements are sorted.
Relatively small memory usage (doesn't need additional memory for the hash-table).
Relatively fast lookup: O(log N).
unordered_map
Usually implemented using hash-table.
Elements are not sorted.
Requires additional memory to keep the hash-table.
Fast lookup O(1), but constant time depends on the hash-function which could be relatively slow. Also keep in mind that you could meet with the Birthday problem.
Compare hash table (undorded_map) vs. binary tree (map), remember your CS classes and adjust accordingly.
The hash map usually has O(1) on lookups, the map has O(logN). It can be a real difference if you need many fast lookups.
The map keeps the order of the elements, which is also useful sometimes.
map allows to iterate over the elements in a sorted way, but unordered_map does not.
So use the std::map when you need to iterate across items in the map in sorted order.
The reason you'd choose one over the other is performance. Otherwise they'd only have created std::map, since it does more for you :)
Use std::map when you need elements to automatically be sorted. Use std::unordered_map other times.
See the SGI STL Complexity Specifications rationale.
unordered_map is O(1) but quite high constant overhead for lookup, insertion, and deletion. map is O(log(n)), so pick the complexity that best suits your needs. In addition, not all keys can be placed into both kinds of map.
I am a complete novice in C++. I am trying to read a file and build a lookup table (more like a hashtable just to check the existence of a string value). The file has about 300 thousand entries that I will use to build a lookup table. And after this, I will be performing some 1 million lookups on this. What is the most efficient way of doing this? Is it the map (google's first result) or is there a better structure for this purpose?
Based on the scenario, you probably also want to look at Tries
What you need is TRIE data structure. The dictionary is implemented widely using this data structure. Moreover it has O(n) lookup time where n is the length of the string and occupies less space. Trie has the abilities to quickly search for, insert, and delete entries.
map has log(n) lookups, but you can achieve O(1) with a hash table, as you suggested. It looks like STL implements one, called hash_map.
C++ std::map is not a hash table, but you could use it for a lookup table if you wanted.
Its performance characteristics as guaranteed by the C++ standard are:
O(log n) for searching for an element
O(log n) for inserting a new element
O(log n) for removing an element
There will definitely be memory overhead because the std::map is generally implemented with trees (and quite possibly a red-black tree), and pointers will be kept for each node in the map.
For better performance characteristics, you might want to look into Google's Sparsehash
Try: http://en.wikipedia.org/wiki/Unordered_map_%28C%2B%2B%29
In general hash tables are good, but if you want "the most efficient way" you'll have to provide more details.
If you want to check just the existance of a string value set is suffiecient as you don't have any key-value pairs. See here for documentation.
If your biggest concern is look up time (and it sounds like it is) strongly consider a hashmap. The amortized look up time is O(1) which is notably better than a regular map at O(log n).
If you have a very good hash function (no collision on your dataset) and you just need to check if entry exists or not, you try a bitset (say from http://bmagic.sourceforge.net/)
i believe it can reduce memory requirements and it's very fast.
I have data that is a set of ordered ints
[0] = 12345
[1] = 12346
[2] = 12454
etc.
I need to check whether a value is in the collection in C++, what container will have the lowest complexity upon retrieval? In this case, the data does not grow after initiailization. In C# I would use a dictionary, in c++, I could either use a hash_map or set. If the data were unordered, I would use boost's unordered collections. However, do I have better options since the data is ordered? Thanks
EDIT: The size of the collection is a couple of hundred items
Just to detail a bit over what have already been said.
Sorted Containers
The immutability is extremely important here: std::map and std::set are usually implemented in terms of binary trees (red-black trees for my few versions of the STL) because of the requirements on insertion, retrieval and deletion operation (and notably because of the invalidation of iterators requirements).
However, because of immutability, as you suspected there are other candidates, not the least of them being array-like containers. They have here a few advantages:
minimal overhead (in term of memory)
contiguity of memory, and thus cache locality
Several "Random Access Containers" are available here:
Boost.Array
std::vector
std::deque
So the only thing you actually need to do can be broken done in 2 steps:
push all your values in the container of your choice, then (after all have been inserted) use std::sort on it.
search for the value using std::binary_search, which has O(log(n)) complexity
Because of cache locality, the search will in fact be faster even though the asymptotic behavior is similar.
If you don't want to reinvent the wheel, you can also check Alexandrescu's [AssocVector][1]. Alexandrescu basically ported the std::set and std::map interfaces over a std::vector:
because it's faster for small datasets
because it can be faster for frozen datasets
Unsorted Containers
Actually, if you really don't care about order and your collection is kind of big, then a unordered_set will be faster, especially because integers are so trivial to hash size_t hash_method(int i) { return i; }.
This could work very well... unless you're faced with a collection that somehow causes a lot of collisions, because then unsorted containers will search over the "collisions" list of a given hash in linear time.
Conclusion
Just try the sorted std::vector approach and the boost::unordered_set approach with a "real" dataset (and all optimizations on) and pick whichever gives you the best result.
Unfortunately we can't really help more there, because it heavily depends on the size of the dataset and the repartition of its elements
If the data is in an ordered random-access container (e.g. std::vector, std::deque, or a plain array), then std::binary_search will find whether a value exists in logarithmic time. If you need to find where it is, use std::lower_bound (also logarithmic).
Use a sorted std::vector, and use a std::binary_search to search it.
Your other options would be a hash_map (not in the C++ standard yet but there are other options, e.g. SGI's hash_map and boost::unordered_map), or an std::map.
If you're never adding to your collection, a sorted vector with binary_search will most likely have better performance than a map.
I'd suggest using a std::vector<int> to store them and a std::binary_search or std::lower_bound to retrieve them.
Both std::unordered_set and std::set add significant memory overhead - and even though the unordered_set provides O(1) lookup, the O(logn) binary search will probably outperform it given that the data is stored contiguously (no pointer following, less chance of a page fault etc.) and you don't need to calculate a hash function.
If you already have an ordered array or std::vector<int> or similar container of the data, you can just use std::binary_search to probe each value. No setup time, but each probe will take O(log n) time, where n is the number of ordered ints you've got.
Alternately, you can use some sort of hash, such as boost::unordered_set<int>. This will require some time to set up, and probably more space, but each probe will take O(1) time on the average. (For small n, this O(1) could be more than the previous O(log n). Of course, for small n, the time is negligible anyway.)
There is no point in looking at anything like std::set or std::map, since those offer no advantage over binary search, given that the list of numbers to match will not change after being initialized.
So, the questions are the approximate value of n, and how many times you intend to probe the table. If you aren't going to check many values to see if they're in the ints provided, then setup time is very important, and std::binary_search on the sorted container is the way to go. If you're going to check a lot of values, it may be worth setting up a hash table. If n is large, the hash table will be faster for probing than binary search, and if there's a lot of probes this is the main cost.
So, if the number of ints to compare is reasonably small, or the number of probe values is small, go with the binary search. If the number of ints is large, and the number of probes is large, use the hash table.
vs2005 support
::stdext::hash_map
::std::map.
however it seems ::stdext::hash_map's insert and remove OP is slower then ::std::map in my test.
( less then 10000 items)
Interesting....
Can anyone offored a comparision article about them?
Normally you look to the complexities of the various operations, and that's a good guide: amortized O(1) insert, O(1) lookup, delete for a hashmap as against O(log N) insert, lookup, delete for a tree-based map.
However, there are certain situations where the complexities are misleading because the constant terms involved are extreme. For example, suppose that your 10k items are keyed off strings. Suppose further that those strings are each 100k characters long. Suppose that different strings typically differ near the beginning of the string (for example if they're essentially random, pairs will differ in the first byte with probability 255/256).
Then to do a lookup the hashmap has to hash a 100k string. This is O(1) in the size of the collection, but might take quite a long time since it's probably O(M) in the length of the string. A balanced tree has to do log N <= 14 comparisons, but each one only needs to look at a few bytes. This might not take very long at all.
In terms of memory access, with a 64 byte cache line size, the hashmap loads over 1500 sequential lines, and does 100k byte operations, whereas the tree loads 15 random lines (actually probably 30 due to the indirection through the string) and does 14 * (some small number) byte operations. You can see that the former might well be slower than the latter. Or it might be faster: how good are your architecture's FSB bandwidth, stall time, and speculative read caching?
If the lookup finds a match, then of course in addition to this both structures need to perform a single full-length string comparison. Also the hashmap might do additional failed comparisons if there happens to be a collision in the bucket.
So assuming that failed comparisons are so fast as to be negligible, while successful comparisons and hashing ops are slow, the tree might be roughly 1.5-2 times as fast as the hash. If those assumptions don't hold, then it won't be.
An extreme example, of course, but it's pretty easy to see that on your data, a particular O(log N) operation might be considerably faster than a particular O(1) operation. You are of course right to want to test, but if your test data is not representative of the real world, then your test results may not be representative either. Comparisons of data structures based on complexity refer to behaviour in the limit as N approaches infinity. But N doesn't approach infinity. It's 10000.
It is not just about insertion and removal. You must consider that memory is allocated differently in a hash_map vs map and you every time have to calculate the hash of the value being searched.
I think this Dr.Dobbs article will answer your question best:
C++ STL Hash Containers and Performance
It depends upon your usage and your hash collisions. One is a binary tree and the other is a hashtable.
Ideally the hash map will have O(1) insertion and lookup, and the map O(ln n), but it presumes non-clashing hashes.
hash_map uses a hash table, something that offers almost constant time O(1) operations assuming a good hash function.
map uses a BST, it offers O(lg(n)) operations, for 10000 elements that's 13 which is very acceptable
I'd say stay with map, it's safer.
Hash tables are supposed to be faster than binary trees (i.e. std::map) for lookup. Nobody has ever suggested that they are faster for insert and delete.
A hash map will create a hash of the string/key for indexing. Though while proving the complexity it is mentioned as O(1), hash_map does collision detection for every insert as a hash of a string can produce the same index as the hash of another string. A hash map hence has complexity for managing these collisions & you konw these collisions are based on the input data.
However, if you are going to perform lot of look-ups on the structure, opt for hash_map.