How to create a hash table - c++

I would like to mention before continuing that I have looked at other questions asking the same thing on this site as well as on other sites. I hope that I can get a good answer, because my goal is twofold:
Foremost, I would like to learn how to create a hash table.
Secondly, I find that a lot of answers on Stack Overflow tend to assume a certain level of knowledge on a subject that is often not there, especially for the newer types. That being said, I hope to edit my main message to include an explanation of the process a bit more in depth once I figure it out myself.
Onto the main course:
As I understand them so far, a hash table is an array of lists (or a similar data structure) that hopes to, optimally, have as few collisions as possible in order to preserve it's lauded O(1) complexity. The following is my current process:
So my first step is to create an array of pointers:
Elem ** table;
table = new Elem*[size];//size is the desired size of the array
My second step is to create a hashing function( a very simple one ).
int hashed = 0;
hashed = ( atoi( name.c_str() ) + id ) % size;
//name is a std string, and id is a large integer. Size is the size of the array.
My third step would be to create something to detect collisions, which is the part I'm currently at.
Here's some pseudo-code:
while( table[hashedValue] != empty )
hashedValue++
else
put in the list at that index.
It's relatively inelegant, but I am still at the "what is this" stage. Bear with me.
Is there anything else? Did I miss something or do something incorrectly?
Thanks

Handle finding no empty slots and resizing the table.

You're missing a definition for Elem. That's not trivial, as it depends on whether you want a chaining or a probing hash table.

A hash function produces the same value for the same data. Your collision check, however, modifies that value, which means that the hash value not only depends on the input, but also on the presence of other elements in the hash map. This is bad, as you almost never will be able to actually access the element you put in before through its name, only through iterating over the map.
Second, your collision check is vulnerable to overflow / range errors, as you just increase the hash value without checking against the size of the map (though, as I said before, you shouldn't even be doing this).

Related

C++ Complicated look-up table

I have around 400.000 "items".
Each "item" consists of 16 double values.
At runtime I need to compare items with each other. Therefore I am muplicating their double values. This is quite time-consuming.
I have made some tests, and I found out that there are only 40.000 possible return values, no matter which items I compare with each other.
I would like to store these values in a look-up table so that I can easily retrieve them without doing any real calculation at runtime.
My question would be how to efficiently store the data in a look-up table.
The problem is that if I create a look-up table, it gets amazingly huge, for example like this:
item-id, item-id, compare return value
1 1 499483,49834
1 2 -0.0928
1 3 499483,49834
(...)
It would sum up to around 120 million combinations.
That just looks too big for a real-world application.
But I am not sure how to avoid that.
Can anybody please share some cool ideas?
Thank you very much!
Assuming I understand you correctly, You have two inputs with 400K possibilities, so 400K * 400K = 160B entries... assuming you have them indexed sequentially, and the you stored your 40K possibilities in a way that allowed 2-octets each, you're looking at a table size of roughly 300GB... pretty sure that's beyond current every-day computing. So, you might instead research if there is any correlation between the 400K "items", and if so, if you can assign some kind of function to that correlation that gives you a clue (read: hash function) as to which of the 40K results might/could/should result. Clearly your hash function and lookup needs to be shorter than just doing the multiplication in the first place. Or maybe you can reduce the comparison time with some kind of intelligent reduction, like knowing the result under certain scenarios. Or perhaps some of your math can be optimized using integer math or boolean comparisons. Just a few thoughts...
To speed things up, you should probably compute all of the possible answers, and store the inputs to each answer.
Then, I would recommend making some sort of look up table that uses the answer as the key(since the answers will all be unique), and then storing all of the possible inputs that get that result.
To help visualize:
Say you had the table 'Table'. Inside Table you have keys, and associated to those keys are values. What you do is you make the keys have the type of whatever format your answers are in(the keys will be all of your answers). Now, give your 400k inputs each a unique identifier. You then store the unique identifiers for a multiplication as one value associated to that particular key. When you compute that same answer again, you just add it as another set of inputs that can calculate that key.
Example:
Table<AnswerType, vector<Input>>
Define Input like:
struct Input {IDType one, IDType two}
Where one 'Input' might have ID's 12384, 128, meaning that the objects identified by 12384 and 128, when multiplied, will give the answer.
So, in your lookup, you'll have something that looks like:
AnswerType lookup(IDType first, IDType second)
{
foreach(AnswerType k in table)
{
if table[k].Contains(first, second)
return k;
}
}
// Defined elsewhere
bool Contains(IDType first, IDType second)
{
foreach(Input i in [the vector])
{
if( (i.one == first && i.two == second ) ||
(i.two == first && i.one == second )
return true;
}
}
I know this isn't real C++ code, its just meant as pseudo-code, and it's a rough cut as-is, but it might be a place to start.
While the foreach is probably going to be limited to a linear search, you can make the 'Contains' method run a binary search by sorting how the inputs are stored.
In all, you're looking at a run-once application that will run in O(n^2) time, and a lookup that will run in nlog(n). I'm not entirely sure how the memory will look after all of that, though. Of course, I don't know much about the math behind it, so you might be able to speed up the linear search if you can somehow sort the keys as well.

How to find largest values in C++ Map

My teacher in my Data Structures class gave us an assignment to read in a book and count how many words there are. Thats not all; we need to display the 100 most common words. My gut says to sort the map, but I only need 100 words from the map. After googling around, is there a "Textbook Answer" to sorting maps by the value and not the key?
I doubt there's a "Textbook Answer", and the answer is no: you can't sort maps by value.
You could always create another map using the values. However, this is not the most efficient solution. What I think would be better is for you to chuck the values into a priority_queue, and then pop the first 100 off.
Note that you don't need to store the words in the second data structure. You can store pointers or references to the word, or even a map::iterator.
Now, there's another approach you could consider. That is to maintain a running order of the top 100 candidates as you build your first map. That way there would be no need to do the second pass and build an extra structure which, as you pointed out, is wasteful.
To do this efficiently you would probably use a heap-like approach and do a bubble-up whenever you update a value. Since the word counts only ever increase, this would suit the heap very nicely. However, you would have a maintenance issue on your hands. That is: how you reference the position of a value in the heap, and keeping track of values that fall off the bottom.

C++ hashing: Open addressing and Chaining

For Chaining:
Can someone please explain this concept to me and provide me a theory example and a simple code one?
I get the idea of "Each table location points to a linked list (chain) of items that hash to this location", but I can't seem to illustrate what's actually going on.
Suppose we had h(x) (hashing function) = x/10 mod 5. Now to hash 12540, 51288, 90100, 41233, 54991, 45329, 14236, how would that look like?
And for open addressing (linear probing, quadratic probing, and probing for every R location), can someone explain that to me as well? I tried Googling around but I seem to get confused further.
Chaining is probably the most obvious form of hashing. The hash-table is actually an array of linked-lists that are initially empty. Items are inserted by adding a new node to the linked-list at the item's calculated table index. If a collision occurs then a new node is linked to the previous tail node of the linked-list. (Actually, an implementation may sort the items in the list but let's keep it simple). One advantage of this mode is that the hash-table can never become 'full', a disadvantage is that you jump around memory a lot and your CPU cache will hate you.
Open Addressing tries to take advantage of the fact that the hash-table is likely to be sparsely populated (large gaps between entries). The hash-table is an array of items. If a collision occurs, instead of adding the item to the end of the current item at that location, the algorithm searches for the next empty space in the hash-table. However this means that you cannot rely on the hashcode alone to see if an item is present, you must also compare the contents if the hashcode matches.
The 'probing' is the strategy the algorithm follows when trying to find the next free slot.
One issue is that the table can become full, i.e. no more empty slots. In this case the table will need to be resized and the hash function changed to take into account the new size. All existing items in the table must be reinserted too as their hash codes will no longer have the same values once the hash function is changed. This may take a while.
Here's a Java animation of a hash table.
because you do mod 5, your table will have 5 locations
location 0: 90100
because the result of 90100/10 mod 5 is 0
for same reason, you have:
location 1: None
location 2: 45329
location 3: 51288->41233->14236
location 4: 12540->54991
you can check out more info on wikipedia
In open addressing we have to store element in table using any of the technique (load factor less than equal to one).
But in case of chaining the hash table only stores the head pointers of Linklist ,Therefore load factor can be greater than one.

How to map 13 bit value to 4 bit code?

I have a std::map for some packet processing program.
I didn't noticed before profiling but unfortunately this map lookup alone consume about 10% CPU time (called too many time).
Usually there only exist at most 10 keys in the input data. So I'm trying to implement a kind of key cache in front of the map.
Key value is 13 bit integer. I know there are only 8192 possible keys and array of 8192 items can give constant time lookup but I feel already ashamed and don't want use such a naive approach :(
Now, I'm just guessing some method of hashing that yield 4 bit code value for 13 bit integer very fast.
Any cool idea?
Thanks in advance.
UPDATE
Beside my shame, I don't have total control over source code and it's almost prohibited to make new array for this purpose.
Project manager said (who ran the profiler) linked list show small performance gain and recommended using std::list instead of std::map.
UPDATE
Value of keys are random (no relationship) and doesn't have good distribution.
Sample:
1) 0x100, 0x101, 0x10, 0x0, 0xffe
2) 0x400, 0x401, 0x402, 0x403, 0x404, 0x405, 0xff
Assuming your hash table either contains some basic type -- it's almost no memory at all. Even on 64-bit systems it's only 64kb of memory. There is no shame in using a lookup table like that, it has some of the best performance you can get.
You may want to go with middle solution and open addressing technique: one array of size 256. Index to an array is some simple hash function like XOR of two bytes. Element of the array is struct {key, value}. Collisions are handled by storing collided element at the next available index. If you need to delete element from array, and if deletion is rare then just recreate array (create a temporary list from remaining elements, and then create array from this list).
If you pick your hash function smartly there would not be any collisions almost ever. For instance, from your two examples one such hash would be to XOR low nibble of high byte with high nibble of low byte (and do what you like with remaining 13-th bit).
Unless you're writing for some sort of embedded system where 8K is really significant, just use the array and move on. If you really insist on doing something else, you might consider a perfect hash generator (e.g., gperf).
If there are really only going to be something like 10 active entries in your table, you might seriously consider using an unsorted vector to hold this mapping. Something like this:
typedef int key_type;
typedef int value_type;
std::vector<std::pair<key_type, value_type> > mapping;
inline void put(key_type key, value_type value) {
for (size_t i=0; i<mapping.size(); ++i) {
if (mapping[i].first==key) {
mapping[i].second=value;
return;
}
}
mapping.push_back(std::make_pair(key, value));
}
inline value_type get(key_type key) {
for (size_t i=0; i<mapping.size(); ++i) {
if (mapping[i].first==key) {
return mapping[i].second;
}
}
// do something reasonable if not found?
return value_type();
}
Now, the asymptotic speed of these algorithms (each O(n)) is much worse than you'd have with either a red-black tree (like std::map at O(log n)) or hash table (O(1)). But you're not talking about dealing with a large number of objects, so asymptotic estimates don't really buy you much.
Additionally, std::vector buys you both low overhead and locality of reference, which neither std::map nor std::list can offer. So it's more likely that a small std::vector will stay entirely within the L1 cache. As it's almost certainly the memory bottleneck that's causing your performance issues, using a std::vector with even my poor choice of algorithm will likely be faster than either a tree or linked list. Of course, only a few solid profiles will tell you for sure.
There are certainly algorithms that might be better choices: a sorted vector could potentially give even better performance; a well tuned small hash table might work as well. I suspect that you'll run into Amdahl's law pretty quickly trying to improve on a simple unsorted vector, however. Pretty soon you might find yourself running into function call overhead, or some other such concern, as a large contributor to your profile.
I agree with GWW, you don't use so much memory in the end...
But if you want, you could use an array of 11 or 13 linkedlists, and hash the keys with the % function. If the key number is less than the array size, complexity tents still to be O(1).
When you always just have about ten keys, use a list (or array). Do some benchmarking to find out whether or not using a sorted list (or array) and binary search will improve performance.
You might first want to see if there are any unnecessary calls to the key lookup. You only want to do this once per packet ideally -- each time you call a function there is going to be some overhead, so getting rid of extra calls is good.
Map is generally pretty fast, but if there is any exploitable pattern in the way keys are mapped to items you could use that and potentially do better. Could you provide a bit more information about the keys and the associated 4-bit values? E.g. are they sequential, is there some sort of pattern?
Finally, as others have mentioned, a lookup table is very fast, 8192 values * 4 bits is only 4kb, a tiny amount of memory indeed.
I would use a lookup table. It's tiny unless you are using a micrcontroller or something.
Otherwise I would do this -
Generate a table of say 30 elements.
For each lookup calculate a hash value of (key % 30) and compare it with the stored key in that location in the table. If the key is there then you found your value. if the slot is empty, then add it. If the key is wrong then skip to the next free cell and repeat.
With 30 cells and 10 keys collisions should be rare but if you get one it's fast to skip to the next cell, and normal lookups are simply a modulus and a compare operation so fairly fast

Searching fast through a sorted list of strings in C++

I have a list of about a hundreds unique strings in C++, I need to check if a value exists in this list, but preferrably lightning fast.
I am currenly using a hash_set with std::strings (since I could not get it to work with const char*) like so:
stdext::hash_set<const std::string> _items;
_items.insert("LONG_NAME_A_WITH_SOMETHING");
_items.insert("LONG_NAME_A_WITH_SOMETHING_ELSE");
_items.insert("SHORTER_NAME");
_items.insert("SHORTER_NAME_SPECIAL");
stdext::hash_set<const std::string>::const_iterator it = _items.find( "SHORTER_NAME" ) );
if( it != _items.end() ) {
std::cout << "item exists" << std::endl;
}
Does anybody else have a good idea for a faster search method without building a complete hashtable myself?
The list is a fixed list of strings which will not change. It contains a list of names of elements which are affected by a certain bug and should be repaired on-the-fly when opened with a newer version.
I've built hashtables before using Aho-Corasick but I'm not really willing to add too much complexity.
I was amazed by the number of answers. I ended up testing a few methods for their performance and ended up using a combination of kirkus and Rob K.'s answers. I had tried a binary search before but I guess I had a small bug implementing it (how hard can it be...).
The results where shocking... I thought I had a fast implementation using a hash_set... well, ends out I did not. Here's some statistics (and the eventual code):
Random lookup of 5 existing keys and 1 non-existant key, 50.000 times
My original algorithm took on average 18,62 seconds
A lineair search took on average 2,49 seconds
A binary search took on average 0,92 seconds.
A search using a perfect hashtable generated by gperf took on average 0,51 seconds.
Here's the code I use now:
bool searchWithBinaryLookup(const std::string& strKey) {
static const char arrItems[][NUM_ITEMS] = { /* list of items */ };
/* Binary lookup */
int low, mid, high;
low = 0;
high = NUM_ITEMS;
while( low < high ) {
mid = (low + high) / 2;
if(arrAffectedSymbols[mid] > strKey) {
high = mid;
}
else if(arrAffectedSymbols[mid] < strKey) {
low = mid + 1;
}
else {
return true;
}
}
return false;
}
NOTE: This is Microsoft VC++ so I'm not using the std::hash_set from SGI.
I did some tests this morning using gperf as VardhanDotNet suggested and this is quite a bit faster indeed.
If your list of strings are fixed at compile time, use gperf
http://www.gnu.org/software/gperf/
QUOTE:
gperf is a perfect hash function generator. For a given list of strings, it produces a hash function and hash table, in form of C or C++ code, for looking up a value depending on the input string. The hash function is perfect, which means that the hash table has no collisions, and the hash table lookup needs a single string comparison only.
The output of gperf is not governed by gpl or lgpl, afaik.
You could try a PATRICIA Trie if none of the standard containers meet your needs.
Worst-case lookup is bounded by the length of the string you're looking up. Also, strings share common prefixes so it is really easy on memory.So if you have lots of relatively short strings this could be beneficial.
Check it out here.
Note: PATRICIA = Practical Algorithm to Retrieve Information Coded in Alphanumeric
If it's a fixed list, sort the list and do a binary search? I can't imagine, with only a hundred or so strings on a modern CPU, you're really going to see any appreciable difference between algorithms, unless your application is doing nothing but searching said list 100% of the time.
What's wrong with std::vector? Load it, sort(v.begin(), v.end()) once and then use lower_bound() to see if the string is in the vector. lower_bound is guaranteed to be O(log2 N) on a sorted random access iterator. I can't understand the need for a hash if the values are fixed. A vector takes less room in memory than a hash and makes fewer allocations.
I doubt you'd come up with a better hashtable; if the list varies from time to time you've probably got the best way.
The fastest way would be to construct a finite state machine to scan the input. I'm not sure what the best modern tools are (it's been over ten years since I did anything like this in practice), but Lex/Flex was the standard Unix constructor.
A FSM has a table of states, and a list of accepting states. It starts in the beginning state, and does a character-by-character scan of the input. Each state has an entry for each possible input character. The entry could either be to go into another state, or to abort because the string isn't in the list. If the FSM gets to the end of the input string without aborting, it checks the final state it's in, which is either an accepting state (in which case you've matched the string) or it isn't (in which case you haven't).
Any book on compilers should have more detail, or you can doubtless find more information on the web.
If the set of strings to check for numbers in the hundreds as you say, and this is when doing I/O (loading a file, which I assume comes from a disk, commonly), then I'd say: profile what you've got, before looking for more exotic/complex solutions.
Of course, it could be that your "documents" contain hundreds of millions to these strings, in which case I guess it really starts to take time ... Without more detail, it's hard to say for sure.
What I'm saying boils down to "consider the use-case and typical scenarios, before (over)optimizing", which I guess is just a specialization of that old thing about roots of evil ... :)
100 unique strings? If this isn't called frequently, and the list doesn't change dynamically, I'd probably use a straight forward const char array with a linear search. Unless you search it a lot, something that small just isn't worth the extra code. Something like this:
const char _items[][MAX_ITEM_LEN] = { ... };
int i = 0;
for (; strcmp( a, _items[i] ) < 0 && i < NUM_ITEMS; ++i );
bool found = i < NUM_ITEMS && strcmp( a, _items[i] ) == 0;
For a list that small, I think your implementation and maintenance costs with anything more complex would probably outweigh the run time costs, and you're not really going to get cheaper space costs than this. To gain a little more speed, you could do a hash table of first char -> list index to set the initial value of i;
For a list this small, you probably won't get much faster.
You're using binary search, which is O(log(n)). You should look at interpolation search, which is not as good "worst case," but it's average case is better: O(log(log(n)).
I don't know which kind of hashing function MS uses for stings, but maybe you could come up with something simpler (=faster) that works in your special case. The container should allow you to use a custom hashing class.
If it's an implementation issue of the container you can also try if boosts std::tr1::unordered_set gives better results.
a hash table is a good solution, and by using a pre-existing implementation you are likely to get good performance. an alternative though i believe is called "indexing".
keep some pointers around to convenient locations. e.g. if its using letters for the sorting, keep a pointer to everything starting aa, ab, ac... ba, bc, bd... this is a few hundred pointers, but means that you can skip to part of the list which is quite near to the result before continuing. e.g. if an entry is is "afunctionname" then you can binary search between the pointers for af and ag, much faster than searching the whole lot... if you have a million records in total you will likely only have to binary search a list of a few thousand.
i re-invented this particular wheel, but there may be plenty of implementations out there already, which will save you the headache of implementing and are likely faster than any code I could paste in here. :)