Hashing large keys without collisions

Hashing large keys without collisions - c++

I am writing a checkers engine. I am aware of standard (zorbrist) hashing schemes, however because of the nature of my engine, they are unsuitable. Any collision of any sort will result in catastrophic errors.
To fix this, I would like to use the entire (unique) representation of the board state as the key. This should not really be a problem, since the state is determined by 6 32bit unsigned integers. This worked in python without any problems besides speed. In C++, I'm using std::unordered_map.
Every way I've tried to implement this has failed. I've tried an std::pair of boost::uint_type128 as the key. Again, there needs to be a guarantee that there won't be collisions.

Related

Are there faster hash functions for unordered_map/set in C++?

Default function is from std::hash. I wonder if there are better hash functions for saving computational time? for integer keys as well as string keys.
I tried City Hash from Google for both integer and string keys, but its performance is a little worse than std::hash.

std::hash functions are already good in performance. I think you should try open source hash functions.
Check this out https://github.com/Cyan4973/xxHash. I quote from its description: "xxHash is an Extremely fast Hash algorithm, running at RAM speed limits. It successfully completes the SMHasher test suite which evaluates collision, dispersion and randomness qualities of hash functions. Code is highly portable, and hashes are identical on all platforms (little / big endian)."
Also this thread from another question on this site: Fast Cross-Platform C/C++ Hashing Library. FNV, Jenkins and MurmurHash are known to be fast.

You need to explain 'better' in what sense? The fastest hash function would be simply use the value, but that is useless. A more specific answer would depend on your memory constraints and what probabilities of collision are you willing to accept.
Also note that the inbuilt hash functions are built differently for different types, and as a result, I expect the hash functions for int and string to already by optimised in the general sense for time complexity and collision probability.

How can I implement Python sets in another language (maybe C++)?

I want to translate some Python code that I have already written to C++ or another fast language because Python isn't quite fast enough to do what I want to do. However the code in question abuses some of the impressive features of Python sets, specifically the average O(1) membership testing which I spam within performance critical loops, and I am unsure of how to implement Python sets in another language.
In Python's Time Complexity Wiki Page, it states that sets have O(1) membership testing on average and in worst-case O(n). I tested this personally using timeit and was astonished by how blazingly fast Python sets do membership testing, even with large N. I looked at this Stack Overflow answer to see how C++ sets compare when using find operations to see if an element is a member of a given set and it said that it is O(log(n)).
I hypothesize the time complexity for find is logarithmic in that C++ std library sets are implemented with some sort of binary tree. I think that because Python sets have average O(1) membership testing and worst case O(n), they are probably implemented with some sort of associative array with buckets which can just look up an element with ease and test it for some dummy value which indicates that the element is not part of the set.
The thing is, I don't want to slow down any part of my code by switching to another language (since that is the problem im trying to fix in the first place) so how could I implement my own version of Python sets (specifically just the fast membership testing) in another language? Does anybody know anything about how Python sets are implemented, and if not, could anyone give me any general hints to point me in the right direction?
I'm not looking for source code, just general ideas and links that will help me get started.
I have done a bit of research on Associative Arrays and I think I understand the basic idea behind their implementation but I'm unsure of their memory usage. If Python sets are indeed just really associative arrays, how can I implement them with a minimal use of memory?
Additional note: The sets in question that I want to use will have up to 50,000 elements and each element of the set will be in a large range (say [-999999999, 999999999]).

The theoretical difference betwen O(1) and O(log n) means very little in practice, especially when comparing two different languages. log n is small for most practical values of n. Constant factors of each implementation are easily more significant.
C++11 has unordered_set and unordered_map now. Even if you cannot use C++11, there are always the Boost version and the tr1 version (the latter is named hash_* instead of unordered_*).

Several points: you have, as has been pointed out, std::set and
std::unordered_set (the latter only in C++11, but most compilers have
offered something similar as an extension for many years now). The
first is implemented by some sort of balanced tree (usually a red-black
tree), the second as a hash_table. Which one is faster depends on the
data type: the first requires some sort of ordering relationship (e.g.
< if it is defined on the type, but you can define your own); the
second an equivalence relationship (==, for example) and a hash
function compatible with this equivalence relationship. The first is
O(lg n), the second O(1), if you have a good hash function. Thus:
If comparison for order is significantly faster than hashing,
std::set may actually be faster, at least for "smaller" data sets,
where "smaller" depends on how large the difference is—for
strings, for example, the comparison will often resolve after the first
couple of characters, whereas the hash code will look at every
character. In one experiment I did (many years back), with strings of
30-50 characters, I found the break even point to be about 100000
elements.
For some data types, simply finding a good hash function which is
compatible with the type may be difficult. Python uses a hash table for
its set, and if you define a type with a function __hash__ that always
returns 1, it will be very, very slow. Writing a good hash function
isn't always obvious.
Finally, both are node based containers, which means they use a lot
more memory than e.g. std::vector, with very poor locality. If lookup
is the predominant operation, you might want to consider std::vector,
keeping it sorted and using std::lower_bound for the lookup.
Depending on the type, this can result in a significant speed-up, and
much less memory use.

Hash Table Implementation Using An Array of Linked Lists

This question has been bugging me for quite a long time and today I've read a detailed article related to hash tables. Without checking any implementation examples I wanted to give a shot for writing a Hash Table from scratch.
The seperate chaining method gave me the idea of implementing the hash table. Anyone who has experience on data structures might regard this question as a joke but i'm a beginner and without diving straight at the code I wanted to discuss my implementations efficiency. Would it be efficient or any other fundamental ideas could be preferred than this?

I think for starters you can also peek into the source (or documentations) of some hash maps implemented in boost libraries. It is called unordered_map. (link is here)
As long as you don't know about these implementations and want to use a hash and you are annoyed because it is not in STL, you are intrigued to write your own fast datastore.
But now implementing hash-maps are so much out of the game that C++11 has unordered_map in its STL. You'll see there are plenty of more interesting stuff out there.
Note: separate chaining is called bucket hash. In fact, boost uses bucket hash, see this link. Maybe you could rather look up some performance comparisons. Chances are those who do perf's will write good enough implementations.

Using closed addressing, another alternative is to use a self balancing binary search tree, e.g. red-black tree/std::map or heap tree, for the inner data structure, or even another hash map with different hashing algorithm.
Using open addressing, another alternative to linear probing are quadratic probing and double hashing; there are also less commonly used strategies such as cuckoo hashing, hopscotch hashing, etc.
The key points of implementing hash table is choosing the right hashing algorithm, resizing strategy (load factor), and collision resolution strategy. The best strategy is highly dependant on the type of workload that you're expecting as there are tradeoffs for each approach.

c++ hashtable where keys are strings and values are vectors of strings

I have a large collection of unique strings (about 500k). Each string is associated with a vector of strings. I'm currently storing this data in a
map<string, vector<string> >
and it's working fine. However I'd like the look-up into the map to be faster than log(n). Under these constrained circumstances how can I create a hashtable that supports O(1) look-up? Seems like this should be possible since I know all the keys ahead of time... and all the keys are unique (so I don't have to account for collisions).
Cheers!

You can create a hashtable with boost::unordered_map, std::tr1::unordered_map or (on C++0x compilers) std::unordered_map. That takes almost zero effort. Google sparsehash may be faster still and tends to take less memory. (Deletion can be a pain, but it seems you won't need that.)
If the code is still not fast enough, you can exploit prior knowledge of the keys with a minimal perfect hash, as suggested by others, to obtain guaranteed O(1) performance. Whether the code generating effort that takes is worth it depends on you; putting 500k keys into a tool like gperf may take a code generator generator.
You may also want to look at CMPH, which generates a perfect hash function at run-time, though through a C API.

I would look into creating a Perfect Hash Function for your table. This will guarantee no collisions which are an expensive operation to resolve. Perfect Hash Function Generators are also available.

What you're looking for is a Perfect Hash. gperf is often used to generate these, but I don't know how well it works with such a large collection of strings.

If you want no collisions for a known collection of keys you're looking for a perfect hash. The CMPH library (my apologies as it is for C rather than C++) is mature and can generate minimal perfect hashes for rather large data sets.

What is more efficient a switch case or an std::map

I'm thinking about the tokenizer here.
Each token calls a different function inside the parser.
What is more efficient:
A map of std::functions/boost::functions
A switch case

I would suggest reading switch() vs. lookup table? from Joel on Software. Particularly, this response is interesting:
" Prime example of people wasting time
trying to optimize the least
significant thing."
Yes and no. In a VM, you typically
call tiny functions that each do very
little. It's the not the call/return
that hurts you as much as the preamble
and clean-up routine for each function
often being a significant percentage
of the execution time. This has been
researched to death, especially by
people who've implemented threaded
interpreters.
In virtual machines, lookup tables storing computed addresses to call are usually preferred to switches. (direct threading, or "label as values". directly calls the label address stored in the lookup table) That's because it allows, under certain conditions, to reduce branch misprediction, which is extremely expensive in long-pipelined CPUs (it forces to flush the pipeline). It, however, makes the code less portable.
This issue has been discussed extensively in the VM community, I would suggest you to look for scholar papers in this field if you want to read more about it. Ertl & Gregg wrote a great article on this topic in 2001, The Behavior of Efficient Virtual Machine Interpreters on Modern Architectures
But as mentioned, I'm pretty sure that these details are not relevant for your code. These are small details, and you should not focus too much on it. Python interpreter is using switches, because they think it makes the code more readable. Why don't you pick the usage you're the most comfortable with? Performance impact will be rather small, you'd better focus on code readability for now ;)
Edit: If it matters, using a hash table will always be slower than a lookup table. For a lookup table, you use enum types for your "keys", and the value is retrieved using a single indirect jump. This is a single assembly operation. O(1). A hash table lookup first requires to calculate a hash, then to retrieve the value, which is way more expensive.
Using an array where the function addresses are stored, and accessed using values of an enum is good. But using a hash table to do the same adds an important overhead
To sum up, we have:
cost(Hash_table) >> cost(direct_lookup_table)
cost(direct_lookup_table) ~= cost(switch) if your compiler translates switches into lookup tables.
cost(switch) >> cost(direct_lookup_table) (O(N) vs O(1)) if your compiler does not translate switches and use conditionals, but I can't think of any compiler doing this.
But inlined direct threading makes the code less readable.

STL Map that comes with visual studio 2008 will give you O(log(n)) for each function call since it hides a tree structure beneath.
With modern compiler (depending on implementation) , A switch statement will give you O(1) , the compiler translates it to some kind of lookup table.
So in general , switch is faster.
However , consider the following facts:
The difference between map and switch is that : Map can be built dynamically while switch can't. Map can contain any arbitrary type as a key while switch is very limited to c++ Primitive types (char , int , enum , etc...).
By the way , you can use a hash map to achieve nearly O(1) dispatching (though , depending on the hash table implementation , it can sometimes be O(n) at worst case). Even though , switch will still be faster.
Edit
I am writing the following only for fun and for the matter of the discussion
I can suggest an nice optimization for you but it depends on the nature of your language and whether you can expect how your language will be used.
When you write the code:
You divide your tokens into two groups , one group will be of very High frequently used and the other of low frequently used. You also sort the high frequently used tokens.
For the high frequently tokens you write an if-else series with the highest frequently used coming first. for the low frequently used , you write a switch statement.
The idea is to use the CPU branch prediction in order to even avoid another level of indirection (assuming the condition checking in the if statement is nearly costless).
in most cases the CPU will pick the correct branch without any level of indirection . They will be few cases however that the branch will go to the wrong place.
Depending on the nature of your languege , Statisticly it may give a better performance.
Edit : Due to some comments below , Changed The sentence telling that compilers will allways translate a switch to LUT.

What is your definition of "efficient"? If you mean faster, then you probably should profile some test code for a definite answer. If you're after flexible and easy-to-extend code though, then do yourself a favor and use the map approach. Everything else is just premature optimization...

Like yossi1981 said, a switch could be optimized of beeing a fast lookup-table but there is not guarantee, every compiler has other algorithms to determine whether to implement the switch as consecutive if's or as fast lookup table, or maybe a combination of both.
To gain a fast switch your values should meet the following rule:
they should be consecutive, that is e.g. 0,1,2,3,4. You can leave some values out but things like 0,1,2,34,43 are extremely unlikely to be optimized.
The question really is: is the performance of such significance in your application?
And wouldn't a map which loads its values dynamically from a file be more readable and maintainable instead of a huge statement which spans multiple pages of code?

You don't say what type your tokens are. If they are not integers, you don't have a choice - switches only work with integer types.

The C++ standard says nothing about the performance of its requirements, only that the functionality should be there.
These sort of questions about which is better or faster or more efficient are meaningless unless you state which implementation you're talking about. For example, the string handling in a certain version of a certain implementation of JavaScript was atrocious, but you can't extrapolate that to being a feature of the relevant standard.
I would even go so far as to say it doesn't matter regardless of the implementation since the functionality provided by switch and std::map is different (although there's overlap).
These sort of micro-optimizations are almost never necessary, in my opinion.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js