I am just starting to learn hashtables, and so far, I know that you take the object you want to hash and put it through an hash function, then use the index it returns to get the corresponding object you want. There is something I don't understand though:
What structure do you use to store the objects in so you can quickly index them with the code returned by the hash function? The only thing I can think of is to use an array, but to handle all the keys, you'd have to allocate one that's 9999999999999 elements big or something ridiculous like that. Or is it as simple as iterating over a linked list or something and comparing the ID in each of the elements with the key from that hash function? And if so, that seems kind of inefficient doesn't it?
Normally, you use an array (or something similar like a vector). You pick a reasonable size (e.g., 20% larger than the number of items you expect) and some method of resolving collisions when/if two keys produce the same hash value (e.g., each of those locations is the head of a linked list of items that hashed to that value).
Yes, you usually use an array but then you do a couple of things:
You convert the hash code to an array index by using the remainder of the hash code divided by the array size.
You make the size of the array a prime number as that makes step #1 more efficient (some hash algorithms need this to get a uniform distribution)
You come up with a design to handle hash collisions. #JerryCoffin's answer gives you more detail.
Generally it's array. If the array size is N then use hash function that returns numbers in range 0..(N-1). For example apply modulo N on the hash function result.
And then use collision resolution in Wikipedia.
Related
I am implementing a chained hash table using a vector < lists >. I resized my vector to a prime number, let's say 5. To choose the key I am using the universal hasing.
My question is, do I need to rehash my vector? I mean this code will generate always a key in a range between 0 and 5 because it depends from the size of my hashtable, causing collisions of course but the new strings will be all added in the lists of every position in the vector...so it seems I don't need to resize/rehash the whole thing. What do you think? Is this a mistake?
Yes, you do. Otherwise objects will be in the wrong hash bucket and when you search for them, you won't find them. The whole point of hashing is to make locating an object faster -- that won't work if objects aren't where they're supposed to be.
By the way, you probably shouldn't be doing this. There are people who have spent years developing efficient hashing algorithms. Trying to roll your own will result in poor performance. Start with the article on linear hashing in Wikipedia.
do I need to rehash my vector?
Your container could continue to function without rehashing, but searching, insertions and erasing will perform more and more like a plain list instead of a hash table: for example, if you've inserted 10,000 elements you can expect each list in your vector to have roundly 2000 elements, and you may have to search all 2000 to see if a value you're considering inserting is a duplicate, or to find a value to erase, or simply return an iterator to. Sure, 2,000 is better than 10,000, but it's a long way from the O(1) performance expected of a quality hash table implementation. Your non-resizing implementation is still "O(N)".
Is this a mistake?
Yes, a fundamental one.
I am writing a program for numerical simulation by using std::map to store some key-value pairs. The map is used as storing the states evoluted during the simulation. The type of the key is a integer and the value of corresponds to the key tells how many copies are there for the same keys, i.e. std::map. For each step of the simulation, I need to calculate how many values are there for the same key, so I will check that by the following code
if (map[key]>0) {do something here with the number of copies}
However, I soon find that this code doesn't work because even there is no such key in the map, whenever you call the map[key], it will generate a placeholder for that key and set the value as zero; therefore, I always overcount the total number of keys by std::map.size(). I later change the code as follow to search the key instead
if (map.find(key)!=map.end()) {...}
So is it the only and fastest way to check if a key exists or not for a map? I am going to run the simulation for hundreds millions times and it will call above code very often to check the key. Will it be too slow to use map.find() instead? Thanks.
The find member function is probably the fastest way to find whether a key is already in the map. That said, if you don't need to iterate over items in the map in order, you might get better performance with an std::unordered_map instead.
In a std::map or hashtable (std::unordered_map), the find function is very fast, as fast as using the [] subscripting operator. In fact, it's faster when the element is not found, because it doesn't have to insert one.
I don't think there is much difference in speed for various ways to check for existence of key. On the other hand: if your keys are integers and range is known, you might just use the array.
BTW:
I got interested about the speed of simple array, vector, map and unordered map. I have written simple program, that does 100000000 container[n]++, where n is a random number in range of 0 to 10000. The results:
array: 1.27s
vector: 1.36s
unordered map: 2.6s
map: 11.6s
The overhead of loop + index calculation in this simple case is ~0.8s.
So it all depends on how much time is spent elsewhere. If it's considerably more (per 100000000 iterations) then it does not matter much what you use. But if it's not, it can be quite a difference.
you can use hash_map, it is the fastest data structures for your key-value type;
also you can use map,but it is slower than hash_map
My question is, if i use uint32_t as data type for the key in std::map will it create a huge structure with indexes for each of the 2^32 combinations? Basically I want to generate a couple of 32-bit numbers each of which should be unique. I have the numbers but I am wondering what structure/technique to use to keep them in memory.
No, it will only create entries that you insert.
If you only have a small number (you mention a couple) it might be faster to put them in a vector and do a linear search. If it's more than a small number a map will be faster of course.
No, a map only allocates space for actual keys used, not the entire key space. That would be bad otherwise.
I have to store information about contents in a lookup table such that it can be accessed very quickly.I might need to refer some of the elements in look up table recursively to get complete information about contents. What will be better data structure to use:
Map with one of parameter, which will be unique to all the entries in look up table, as key and rest of the information as value
Use static array for each unique entries and access them when needed according to key(which will be same as the one used in MAP).
I want my software to be robust as if we have any crash it will be catastrophic for my product.
It depends on the range of keys that you have.
Usually, when you say lookup table, you mean a smallish table which you can index directly ( O(1) ). As a dumb example, for a substitution cipher, you could have a char cipher[256] and simply index with the ASCII code of a character to get the substitution character. If the keys are complex objects or simply too many, you're probably stuck with a map.
You might also consider a hashtable (see unordered_map).
Reply:
If the key itself can be any 32-bit number, it wouldn't make sense to store a very sparse 4-billion element array.
If however your keys are themselves between say 0..10000, then you can have a 10000-element array containing pointers to your objects (or the objects themselves), with only 2000-5000 of your elements containing non-null pointers (or meaningful data, respectively). Access will be O(1).
If you can have large keys, then I'd probably go with the unordered_map. With a map of 5000 elements, you'd get O(log n) to mean around ~12 accesses, a hash table should be pretty much one or two accesses tops.
I'm not familiar with perfect hashes, so I can't advise about their implementation. If you do choose that, I'd be grateful for a link or two with ideas to keep in mind.
The lookup times in a std::map should be O=ln(n), with a linear search in a static array in the worst case O=n.
I'd strongly opt for a std::map even if it has a larger memory footprint (which should not matter, in the most cases).
Also you can make "maps of maps" or even deeper structures:
typedef std::map<MyKeyType, std::map<MyKeyType, MyValueType> > MyDoubleMapType;
i've created a dll for gamemaker. dll's arrays where really slow so after asking around a bit i learnt i could use maps in c++ and make a dll.
anyway, ill represent what i need to store in a 3d array:
information[id][number][number]
the id corresponds to an objects id. the first number field ranges from 0 - 3 and each number represents a different setting. the 2nd number field represents the value for the setting in number field 1.
so..
information[101][1][4];
information[101][2][4];
information[101][3][4];
this would translate to "object with id 101 has a value of 4 for settings 1, 2 and 3".
i did this to try and copy it with maps:
//declared as a class member
map<double, map<int, double>*> objIdMap;
///// lower down the page, in some function
map<int, double> objSettingsMap;
objSettingsMap[1] = 4;
objSettingsMap[2] = 4;
objSettingsMap[3] = 4;
map<int, double>* temp = &objSettingsMap;
objIdMap[id] = temp;
so the first map, objIdMap stores the id as the key, and a pointer to another map which stores the number representing the setting as the key, and the value of the setting as the value.
however, this is for a game, so new objects with their own id's and settings might need to be stored (sometimes a hundred or so new ones every few seconds), and the existing ones constantly need to retrieve the values for every step of the game. are maps not able to handle this? i has a very similar thing going with game maker's array's and it worked fine.
Do not use double's as a the key of a map.
Try to use a floating point comparison function if you want to compare two doubles.
1) Your code is buggy: You store a pointer to a local object objSettingsMap which will be destroyed as soon as it goes out of scope. You must store a map obj, not a pointer to it, so the local map will be copied into this object.
2) Maps can become arbitrarily large (i have maps with millions of entrys). If you need speed try hash_maps (part of C++0x, but also available from other sources), which are considerably faster. But adding some hundred entries each second shouldn't be a problem. But befre worring about execution speed you should always use a profiler.
3) I am not really sure if your nested structures MUST be maps. Depending of what number of setting you have, and what values they may have, a structure or bitfield or a vector might be more accurate.
If you need really fast associative containers, try to learn about hashes. Maps are 'fast enough' but not brilliant for some cases.
Try to analyze what is the structure of objects you need to store. If the fields are fixed I'd recommend not to use nested maps. At all. Maps are usually intended for 'average' number of indexes. For low number simple lists are more effective because of insert / erase operations lower complexity. For great number of indexes you really need to think about hashing.
Don't forget about memory. std::map is highly dynamic template so on small objects stored you loose tons of memory because of dynamic allocation. Is it what you are really expecting? Once I was involved in std::map usage removal which lowered memory requirements in about 2 times.
If you only need to fill the map at startup and only search for elements (don't need to change structure) I'd recommend simple std::vector with sort applied after all the elems inserted. And then you can just use binary search (as you have sorted vector). Why? std::vector is much more predictable thing. The biggest advantage is continuous memory area.