How can I build a lookup table in C++? - c++

I am a complete novice in C++. I am trying to read a file and build a lookup table (more like a hashtable just to check the existence of a string value). The file has about 300 thousand entries that I will use to build a lookup table. And after this, I will be performing some 1 million lookups on this. What is the most efficient way of doing this? Is it the map (google's first result) or is there a better structure for this purpose?

Based on the scenario, you probably also want to look at Tries

What you need is TRIE data structure. The dictionary is implemented widely using this data structure. Moreover it has O(n) lookup time where n is the length of the string and occupies less space. Trie has the abilities to quickly search for, insert, and delete entries.

map has log(n) lookups, but you can achieve O(1) with a hash table, as you suggested. It looks like STL implements one, called hash_map.

C++ std::map is not a hash table, but you could use it for a lookup table if you wanted.
Its performance characteristics as guaranteed by the C++ standard are:
O(log n) for searching for an element
O(log n) for inserting a new element
O(log n) for removing an element
There will definitely be memory overhead because the std::map is generally implemented with trees (and quite possibly a red-black tree), and pointers will be kept for each node in the map.
For better performance characteristics, you might want to look into Google's Sparsehash

Try: http://en.wikipedia.org/wiki/Unordered_map_%28C%2B%2B%29
In general hash tables are good, but if you want "the most efficient way" you'll have to provide more details.

If you want to check just the existance of a string value set is suffiecient as you don't have any key-value pairs. See here for documentation.

If your biggest concern is look up time (and it sounds like it is) strongly consider a hashmap. The amortized look up time is O(1) which is notably better than a regular map at O(log n).

If you have a very good hash function (no collision on your dataset) and you just need to check if entry exists or not, you try a bitset (say from http://bmagic.sourceforge.net/)
i believe it can reduce memory requirements and it's very fast.

Related

Why std::map is red black tree and not hash table ?

This is very strange for me, i expected it to be a hash table.
I saw 3 reasons in the following answer (which maybe correct but i don't think that they are the real reason).
Hash tables v self-balancing search trees
Although hash might be not a trivial operation. I think that for most of the types it is pretty simple.
when you use map you expect something that will give you amortized O(1) insert, delete, find , not log(n).
i agree that trees have better worst case performance.
I think that there is a bigger reason for that, but i can't figure it out.
In c# for example Dictionary is a hash table.
It's largely a historical accident. The standard containers (along with iterators and algorithms) were one of the very last additions before the feature set of the standard was frozen. As it happened, they didn't have what they considered an adequate definition of a hash-based map at the time, and there wasn't time to add it before features were frozen, so the original specification included only a tree-based map.
C++ 11 added std::unordered_map (as well as std::unordered_set and multi versions of both), which is based on hashing though.
The reason is that map is explicitly called out as an ordered container. It keeps the elements sorted and allows you to iterate in sorted order in linear time. A hashtable couldn't fulfill those requirements.
In C++11 they added std::unordered_map which is a hashtable implementation.
A hash table requires an additional hash function. The current implementation of map which uses a tree can work without an extra hash function by using operator<. Additionally the map allows sorted access to elements, which may be beneficial for some applications. With C++ we now have the hash versions available in form of unordered_set.
Simple answer: because a hash table cannot satisfy the complexity requirements of iteration over a std::map.
Why does std::map hold these requirements? Unanswerable question. Historical factors contribute but, overall, that's just the way it is.
Hashes are available as std::unordered_map.
It doesn't really matter what the two are called, or what they're called in some other language.

Hash table in C++

Is the insertion/deletion/lookup time of a C++ std::map O(log n)? Is it possible to implement an O(1) hash table?
Is the insertion/deletion/lookup time of a C++ map O(log n)?
Yes.
Is it possible to implement an O(1) hash table?
Definitely. The standard library also provides one as std::unordered_map.
C++ has a unordered_map type. The STL also contains a hash_map type, although this is not in the C++ standard library.
Now, for a bit of algorithmic theory. It is possible to implement an O(1) hash table under perfect conditions, and technically, hash tables are O(1) insertion and lookup. The perfect conditions in this case are that the hash function must be perfect (i.e. collision free), and you have infinite storage.
In practise, let's take a dumb hash table. For any input key, it returns 1. In this case, when there is a collision (i.e. on the second and subsequent insertions), it will have to chain further to find some free space. It can either go to the next storage location, or use a linked list for this.
In any case, in the best case, yes, hash tables are O(1) (until you have exhausted all of your hash values, of course, since it is impractical to have a hash function with an infinite amount of output). In the worst case (e.g. with my completely dumb hash function), hash tables are O(n), since you will have to traverse over the storage in order to find your actual value from the given hash, since the initial value is not the correct value.
The implementation of std::map is a tree. This is not directly specified in the standard, but as some good books are saying: "It is difficult to imagine that it can be anything else". This means that the insertion/deletion/lookup time for map is O(log n).
Classic hash tables have lookup time O(n/num_slots). Once the expected number of items in the table is comparable with the number of slots, you will have saturated O(1).

O(1) lookup in C++

Is there a data structure in C++ with O(1) lookup?
A std::map has O(log(n)) lookup time (right?).
I'm looking from something in std preferably (so not Boost pls). Also, if there is, how does it work?
EDIT: Ok, I wasn't clear enough I guess. I want to associate values, kind of like in a map. So I want something like std::map<int,string>, and find and insert should take O(1).
Arrays have O(1) lookup.
Hashtable (std::unordered_map) for c++11 has O(1) lookup. (Amortized, but more or less constant.)
I would also like to mention that tree based data structures like maps come with great advantages and are only log(n) which is more often than not sufficient.
Answer to your edit -> You can literally associate an index of an array to one of the values. Also hash tables are associative but perfect hash (each key maps to exactly 1 value) is really difficult to get.
One more thing worth mentioning: Arrays have great cache performance (due to locality, aka. elements being right next to each other so they can be prefetched to cache by the prefecthing engine). Trees, not so much. With reasonable amount of elements, hash performance can be more critical than asymptotic performance.
Data structures with O(1) lookup (ignoring the size of the key) include:
arrays
hash tables
For complex types, balanced trees will be fine at O(log n), or sometimes you can get away with a patricia trie at O(k).
For reference:complexity of search structures
An array has O(1) lookup.

When should I use unordered_map and not std::map

I'm wondering in which case I should use unordered_map instead of std::map.
I have to use unorderd_map each time I don't pay attention of order of element in the map ?
map
Usually implemented using red-black tree.
Elements are sorted.
Relatively small memory usage (doesn't need additional memory for the hash-table).
Relatively fast lookup: O(log N).
unordered_map
Usually implemented using hash-table.
Elements are not sorted.
Requires additional memory to keep the hash-table.
Fast lookup O(1), but constant time depends on the hash-function which could be relatively slow. Also keep in mind that you could meet with the Birthday problem.
Compare hash table (undorded_map) vs. binary tree (map), remember your CS classes and adjust accordingly.
The hash map usually has O(1) on lookups, the map has O(logN). It can be a real difference if you need many fast lookups.
The map keeps the order of the elements, which is also useful sometimes.
map allows to iterate over the elements in a sorted way, but unordered_map does not.
So use the std::map when you need to iterate across items in the map in sorted order.
The reason you'd choose one over the other is performance. Otherwise they'd only have created std::map, since it does more for you :)
Use std::map when you need elements to automatically be sorted. Use std::unordered_map other times.
See the SGI STL Complexity Specifications rationale.
unordered_map is O(1) but quite high constant overhead for lookup, insertion, and deletion. map is O(log(n)), so pick the complexity that best suits your needs. In addition, not all keys can be placed into both kinds of map.

What is the difference between set and hashset in C++ STL?

When should I choose one over the other?
Are there any pointers that you would recommend for using the right STL containers?
hash_set is an extension that is not part of the C++ standard. Lookups should be O(1) rather than O(log n) for set, so it will be faster in most circumstances.
Another difference will be seen when you iterate through the containers. set will deliver the contents in sorted order, while hash_set will be essentially random (Thanks Lou Franco).
Edit: The C++11 update to the C++ standard introduced unordered_set which should be preferred instead of hash_set. The performance will be similar and is guaranteed by the standard. The "unordered" in the name stresses that iterating it will produce results in no particular order.
stl::set is implemented as a binary search tree.
hashset is implemented as a hash table.
The main issue here is that many people use stl::set thinking it is a hash table with look-up of O(1), which it isn't, and doesn't have. It really has O(log(n)) for look-ups. Other than that, read about binary trees vs hash tables to get a better idea of the data structures.
Another thing to keep in mind is that with hash_set you have to provide the hash function, whereas a set only requires a comparison function ('<') which is easier to define (and predefined for native types).
I don't think anyone has answered the other part of the question yet.
The reason to use hash_set or unordered_set is the usually O(1) lookup time. I say usually because every so often, depending on implementation, a hash may have to be copied to a larger hash array, or a hash bucket may end up containing thousands of entries.
The reason to use a set is if you often need the largest or smallest member of a set. A hash has no order so there is no quick way to find the smallest item. A tree has order, so largest or smallest is very quick. O(log n) for a simple tree, O(1) if it holds pointers to the ends.
A hash_set would be implemented by a hash table, which has mostly O(1) operations, whereas a set is implemented by a tree of some sort (AVL, red black, etc.) which have O(log n) operations, but are in sorted order.
Edit: I had written that trees are O(n). That's completely wrong.