A good repartition algorithm - c++

I am implementing a memcached client library. I want it to support several servers and so I wish to add some load-balancing system.
Basically, you can do two operations on a server:
Store a value given its key.
Get a value given its key.
Let us say I have N servers (from 0 to N - 1), I'd like to have a repartition function which, from a given key and server number N, would give me an index in the [0, N[ range.
unsigned int getServerIndex(const std::string& key, unsigned int serverCount);
The function should be as fast and simple as possible and must respect the following constraint:
getServerIndex(key, N) == getServerIndex(key, N); //aka. No random return.
I wish I could do this without using an external library (like OpenSSL and its hashing functions). What are my options here?
Side notes:
Obviously, the basic implementation:
unsigned int getServerIndex(const std::string& key, unsigned int serverCount)
{
return 0;
}
Is not a valid answer as this is not exactly a good repartition function :D
Additional information:
Keys will usually be any possible string, within the ANSI charset (mostly [a-zA-Z0-9_-]). The size may be anything from a one-char-key to whatever-size-you-want.
A good repartition algorithm is an algorithm for which the probability of returning a is equal (or not too far) from the probability of returning b, for two different keys. The number of servers might change (rarely though) and if it does, it is acceptable that the returned index for a given key changes as well.

You're probably looking for something that implements consistent hashing. The easiest way to do this is to assign a random ID to each memcache server, and allocate each item to the memcache server which has the closest ID to the item's hash, by some metric.
A common choice for this - and the one taken by distributed systems such as Kademlia - would be to use the SHA1 hash function (though the hash is not important), and compare distances by XORing the hash of the item with the hash of the server and interpreting the result as a magnitude. All you need, then, is a way of making each client aware of the list of memcache servers and their IDs.
When a memcache server joins or leaves, it need only generate its own random ID, then ask its new neighbours to send it any items that are closer to its hash than to their own.

I think the hashing approach is the right idea. There are many simplistic hashing algorithms out there.
With the upcoming C++0x and the newly standard unordered_map, the hash of strings is becoming a standard operation. Many compilers are already delivered with a version of the STL which features a hash_map and thus already have a pre-implemented hash function.
I would start with those... but it would be better if we had more information on your strings: are they somehow constrained to a limited charset, or is it likely that they will be many similar strings ?
The problem is that a "standard" hash might not produce a uniform distribution if the input is not uniformly distributed to begin with...
EDIT:
Given the information, I think the hash function already shipped with most STL should work, since you do not seem to have a highly concentrated area. However I am by now way expert in probabilities, so take it with a grain of salt (and experiment).

What about something very simple like
hash(key) % serverCount

Related

C++ OpenSSL: md5-based 64-bits hash

I know the original md5 algorithm produces an 128-bits hash.
Following Mark Adler's comments here I'm interested in getting a good 64-bits hash.
Is there a way to create an md5-based 64-bits hash using OpenSSL? (md5 looks good enough for my needs).
If not, is there another algorithm implemented in the OpenSSL library that can get this job done with quality not less than md5's (except for the length of course)?
I claim, that 'hash quality' is strongly related to the hash length.
AFAIK, OpenSSL does not have 64bit hash algos, so the first idea I had is simple and most probably worthless:
halfMD5 = md5.hiQuadWord ^ md5.lowQuadWord
Finally, I'd simply use an algorithm with appropriate output, like crc64.
Some crc64 sources to verify:
http://www.backplane.com/matt/crc64.html
http://bioinfadmin.cs.ucl.ac.uk/downloads/crc64/
http://en.wikipedia.org/wiki/Computation_of_CRC
http://www.pathcom.com/~vadco/crc.html
Edit
At first glanceת Jenkins looks perfect, however I'm trying to find a friendly c++ implementation for it without luck so far. BTW, I'm wondering, since this is such a good hash for databases' duplication checking, how come that non of the common opensource libraries, like OpenSSL, provides an API of it? – Subway
This might be simply due to the fact, that OpenSSL is a crypto library in the first place, using large hash values with appropriate crypto characteristics.
Hash algos for data structures have some other primary goals, e.g. good distribution characteristics for hash tables, where small hash values are used as an index into a list of buckets containing zero, one or multiple (colliding) element(s).
So the point is, whether, how and where collisions are handled.
In a typical DBMS, an index on a column will handle them itself.
Corresponding containers (maps or sets):
C++: std::size_t (32 or 64bits) for std::unordered_multimap and std::unordered_multiset
In java, one would make a mapping with lists as buckets: HashMap<K,List<V>>
The unique constraint would additionally prohibit insertion of equal field contents:
C++: std::size_t (32 or 64bits) for std::unordered_map and std::unordered_set
Java: int (32bits) for HashMap and HashSet
For example, we have a table with file contents (plaintext, non-crypto application) and a checksum or hash value for mapping or consistency checks. We want to insert a new file. For that, we precompute the hash value or checksum and query for existing files with equal hash values or checksums respectively. If none exists, there won't be a collision, insertion would be safe. If there are one or more existing records, there is a high probability for an exact match and a lower probability for a 'real' collision.
In case collisions should be omitted, one could add an unique constraint to the hash column and reuse existing records with the possibility of mismatching/colliding contents. Here, you'd want to have a database friendly hash algo like 'Jenkins'.
In case collisions need to be handled, one could add an unique constraint to the plaintext column. Less database friendly checksum algos like crc won't have an influence on collisions among records and can be chosen according to certain types of corruption to be detected or other requirements. It is even possible to use the XOR'ed quad words of an md5 as mentioned at the beginning.
Some other thoughts:
If an index/constraint on plaintext columns does the mapping, any hash value can be used to do reasonably fast lookups to find potential matches.
No one will stop you from adding both, a mapping friendly hash and a checksum.
Unique constraints will also add an index, which are basically like the hash tables mentioned above.
In short, it greatly depends on what exactly you want to achieve with a 64bit hash algo.

What's the correct way to generate random strings without duplicates

I'm thinking about generating random strings, without making any duplication.
First thought was to use a binary tree create and locate for duplicate in tree, if any.
But this may not be very effective.
Second thought was using MD5 like hash method which create messages based only on time, but this may introduce another problem, different machines has different accuracy of time.
And in a modern processor, more than one string could be created in a single timestamp.
Is there any better way to do this?
Generate N sequential strings, then do a random shuffle to pull them out in random order. If they need to be unique across separate generators, mix a unique generator ID into the string.
Beware of MD5, there's no guarantee that two different Strings won't generate the same hash.
As for your problem, it depends on a number of constraints: are the strings short or long? Do they have to be meaningful? Etc... Two solutions from the top of my head:
1 Generate UUIDs then turn them into String with a binary representation or base 64 algorithm.
2 Simply generate random Strings and put them in a searchable structure (HashMap) so that you can find very quickly (O(1)-O(log n)) if a generated String already has a duplicate, in which case it is discarded.
A tree probably won't be the most efficient, especially for insertions - as it will have to constantly re-balance itself (somewhat of an "expensive" operation).
I'd recommend using a HashSet type data structure. The hashing algorithm should already be quite efficient (much more so than something like MD5), and all operations are constant-time. Insert all your Strings into the Set. If you create a new String, check to see if it already exists in the Set.
It sounds like you want to generate a uuid? See http://docs.python.org/library/uuid.html
>>> import uuid
>>> uuid.uuid4()
UUID('dafd3cb8-3163-4734-906b-a33671ce52fe')
You should specify in what programming language you're coding. For instance, in Java this will work nicely: UUID.randomUUID().toString() . UUID identifiers are unique in practice, as is stated in wikipedia:
The intent of UUIDs is to enable distributed systems to uniquely identify information without significant central coordination. In this context the word unique should be taken to mean "practically unique" rather than "guaranteed unique". Since the identifiers have a finite size it is possible for two differing items to share the same identifier. The identifier size and generation process need to be selected so as to make this sufficiently improbable in practice.
A binary tree is probably better than usual here - no rebalancing necessary, because your strings are random, and it's on random data that binary trees work their best. However, it's still O(log(n)) for lookup and addition.
But maybe more efficient, if you know in advance how many random strings you'll need and don't mind a little probability in the mix, is to use a bloom filter.
Bloom filters give an efficient, probabilistic set membership test with memory requirements as low as one bit per element saved in the set. Basically, a bloom filter can say with 100% certainty that a member does not belong to a set, but with a high but not quite 100% certainty that a member is in a set. In your case, throwing out an extra candidate or two shouldn't hurt at all, so the probabilistic nature shouldn't hurt a bit.
Bloom filters are also relatively unique in that they can test for set membership in constant time.
For a while, I listed treaps here, but that's silly - they do a lot of operations in O(log(n)) again, and would only be relevant if your data isn't truly random.
If you don't need your strings to be saved in order for some reason (and it sounds like you probably don't), a traditional hash table is a good way to go. They like to know how big your final dataset will be in advance (to avoid slow hash table resizes), but they too are constant time for insertion and lookup.
http://stromberg.dnsalias.org/svn/bloom-filter/trunk/

C++ (Hashmap style) Data Structure Ideal For This Scenario?

People have asked similar questions about the efficiency of various data structures but none I have read are totally applicable to my scenario so I wondered if people had suggestions for one that was tailored to satisfy the following criteria efficiently:
Each element will have a unique key. There will be no possibility of collisions because each element hashes to a different key. EDIT: *The key is a 32-bit uint.*
The elements are all unique and therefore can be thought of as a set.
The only operations required are adding and getting, not deletion. These need to be quick as they will be used several 100,000 times in a typical run!
The order in which elements are kept is irrelevant.
Speed is more important than memory-consumption... though it can't be too
greedy!
I am developing for a company that will use the program commercially so any third-party data structures should come with no copyright protection or anything, but if the STL has a data structure that will do the job efficiently then that would be perfect.
I know there are countless Hashmap/Dictionary style C++ data structures with implementations that are built to satisfy different criteria so if someone can suggest one ideal for this situation then that would be greatly appreciated.
Many thanks
Edit:
I found this passage on SO that seems to suggest unordered_map would be good?
hash_map and unordered_map are generally implemented with hash tables.
Thus the order is not maintained. unordered_map insert/delete/query
will be O(1) (constant time) where map will be O(log n) where n is the
number of items in the data structure. So unordered_map is faster, and
if you don't care about the order of the items should be preferred
over map. Sometimes you want to maintain order (ordered by the key)
and for that map would be the choice.
Looks like a prefix tree (with element at each node end) also fits in this scenario. It's damn fast, even faster than hash map because no hash value calculation is done and getting a value is purely O(n) where n is the key length. It's a bit memory hungry but common prefix of keys are shared in the same node path.
EDIT: I assume the keys are string, not simple values like integers
As for build-in solutions I'd recommand google::dense_hash_map. They are really fast especially for numeric keys. You'll have to decide on a specific key that will be reserved as "empty_key". Moreover here is a really nice comparison of different hash-map implementations.
An excerpt
Library Linux-intCPU (sec) Linux-strCPU (sec) Linux PeakMem (MB)
glib 3.490 4.720 24.968
ghthash 3.260 3.460 61.232
CC’s hashtable 3.040 4.050 129.020
TR1 1.750 3.300 28.648
STL hash_set 2.070 3.430 25.764
google-sparse 2.560 6.930 5.42/8.54
google-dense 0.550 2.820 24.7/49.3
khash (C++) 1.100 2.900 6.88/13.1
khash (C) 1.140 2.940 6.91/13.1
STL set (RB) 7.840 18.620 29.388
kbtree (C) 4.260 17.620 4.86/9.59
NP’s splaytree 11.180 27.610 19.024
However, when setting a "deleted_key", this map can also perform deletions. So maybe it'll be possible to create a custom solution that is even more efficient. But apart from that minor point, any hash-map should exactly suit your needs (note that "map" is an ordered tree-map and thus slower).
What you need definitely sounds like a hash set, C++ has this as either std::tr1::unordered_set or in Boost.Unordered.
P.S. Note, however, that TR1 is not yet standard, and you'll probably need to get Boost for the implementation.
It sounds like std::unordered_set would fit the bill, but without
knowing more about the key, it's difficult to say. I'm curious about
how you can guarantee that there will be no possibility of collisions:
this implies a small (less than the size of the table), finite set of
keys. If this is the case, it may be more efficient to map the keys to
a small int, and use std::vector (with empty slots for the entries not
present).
What you're looking for is an unordered_set. You can find one in Boost, TR1, or C++0x. If you're hoping to associate the key with a value, then unordered_map does just that- also in Boost/TR1/C++0x.

Perfect hash function for a set of integers with no updates

In one of the applications I work on, it is necessary to have a function like this:
bool IsInList(int iTest)
{
//Return if iTest appears in a set of numbers.
}
The number list is known at app load up (But are not always the same between two instances of the same application) and will not change (or added to) throughout the whole of the program. The integers themselves maybe large and have a large range so it is not efficient to have a vector<bool>. Performance is a issue as the function sits in a hot spot. I have heard about Perfect hashing but could not find out any good advice. Any pointers would be helpful. Thanks.
p.s. I'd ideally like if the solution isn't a third party library because I can't use them here. Something simple enough to be understood and manually implemented would be great if it were possible.
I would suggest using Bloom Filters in conjunction with a simple std::map.
Unfortunately the bloom filter is not part of the standard library, so you'll have to implement it yourself. However it turns out to be quite a simple structure!
A Bloom Filter is a data structure that is specialized in the question: Is this element part of the set, but does so with an incredibly tight memory requirement, and quite fast too.
The slight catch is that the answer is... special: Is this element part of the set ?
No
Maybe (with a given probability depending on the properties of the Bloom Filter)
This looks strange until you look at the implementation, and it may require some tuning (there are several properties) to lower the probability but...
What is really interesting for you, is that for all the cases it answers No, you have the guarantee that it isn't part of the set.
As such a Bloom Filter is ideal as a doorman for a Binary Tree or a Hash Map. Carefully tuned it will only let very few false positive pass. For example, gcc uses one.
What comes to my mind is gperf. However, it is based in strings and not in numbers. However, part of the calculation can be tweaked to use numbers as input for the hash generator.
integers, strings, doesn't matter
http://videolectures.net/mit6046jf05_leiserson_lec08/
After the intro, at 49:38, you'll learn how to do this. The Dot Product hash function is demonstrated since it has an elegant proof. Most hash functions are like voodoo black magic. Don't waste time here, find something that is FAST for your datatype and that offers some adjustable SEED for hashing. A good combo there is better than the alternative of growing the hash table.
#54:30 The Prof. draws picture of a standard way of doing perfect hash. The perfect minimal hash is beyond this lecture. (good luck!)
It really all depends on what you mod by.
Keep in mind, the analysis he shows can be further optimized by knowing the hardware you are running on.
The std::map you get very good performance in 99.9% scenarios. If your hot spot has the same iTest(s) multiple times, combine the map result with a temporary hash cache.
Int is one of the datatypes where it is possible to just do:
bool hash[UINT_MAX]; // stackoverflow ;)
And fill it up. If you don't care about negative numbers, then it's twice as easy.
A perfect hash function maps a set of inputs onto the integers with no collisions. Given that your input is a set of integers, the values themselves are a perfect hash function. That really has nothing to do with the problem at hand.
The most obvious and easy to implement solution for testing existence would be a sorted list or balanced binary tree. Then you could decide existence in log(N) time. I doubt it'll get much better than that.
For this problem I would use a binary search, assuming it's possible to keep the list of numbers sorted.
Wikipedia has example implementations that should be simple enough to translate to C++.
It's not necessary or practical to aim for mapping N distinct randomly dispersed integers to N contiguous buckets - i.e. a perfect minimal hash - the important thing is to identify an acceptable ratio. To do this at run-time, you can start by configuring a worst-acceptible ratio (say 1 to 20) and a no-point-being-better-than-this-ratio (say 1 to 4), then randomly vary (e.g. changing prime numbers used) a fast-to-calculate hash algorithm to see how easily you can meet increasingly difficult ratios. For worst-acceptible you don't time out, or you fall back on something slower but reliable (container or displacement lists to resolve collisions). Then, allow a second or ten (configurable) for each X% better until you can't succeed at that ratio or reach the no-pint-being-better ratio....
Just so everyone's clear, this works for inputs only known at run time with no useful patterns known beforehand, which is why different hash functions have to be trialed or actively derived at run time. It is not acceptible to simple say "integer inputs form a hash", because there are collisions when %-ed into any sane array size. But, you don't need to aim for a perfectly packed array either. Remember too that you can have a sparse array of pointers to a packed array, so there's little memory wasted for large objects.
Original Question
After working with it for a while, I came up with a number of hash functions that seemed to work reasonably well on strings, resulting in a unique - perfect hashing.
Let's say the values ranged from L to H in the array. This yields a Range R = H - L + 1.
Generally it was pretty big.
I then applied the modulus operator from H down to L + 1, looking for a mapping that keeps them unique, but has a smaller range.
In you case you are using integers. Technically, they are already hashed, but the range is large.
It may be that you can get what you want, simply by applying the modulus operator.
It may be that you need to put a hash function in front of it first.
It also may be that you can't find a perfect hash for it, in which case your container class should have a fall back position.... binary search, or map or something like that, so that
you can guarantee that the container will work in all cases.
A trie or perhaps a van Emde Boas tree might be a better bet for creating a space efficient set of integers with lookup time bring constant against the number of objects in the data structure, assuming that even std::bitset would be too large.

How to check whether my custom hashing is good in hash_map?

I've written a custom hashing for my custom key in stdext::hash_map and would like to check whether the hasher is good. I'm using STL supplied with VS 2008. A typical check, as I know, is to check the uniformity of distribution among buckets.
How should I organize such a check correctly? A solution that comes to my mind is to modify STL sources to add a method to hash_map that walks through buckets and does the subject. Is there are any better ways?
Maybe, derive from hash_map and create there such method?
Your best bet might be to just take your hashing algorithm to an array of ints and count the number of times that each hash bucket is hit, given real-world data. (I'm suggesting taking the STL out of the equation here, really.)
If you end up seeing high deviation in your counts with large sets of real-world data, your hashing algorithm is generating lots of collisions when there are plenty of empty (or emptier) buckets available.
Note that 'high deviation' is a relative term. A good hash algorithm is a deterministic random process and any random process has a chance of generating strange results, so test often, test well, and wherever possible, use your actual problem domain as a source of your tests and your controls.
I'd run one (large) dataset through stl::hash_map. Once done, I'd collect the results for all buckets using the following method
From hash_map:
size_type elems_in_bucket (size_type __n) const;
Finally, I would do compute the standard deviation (SD) of the elem-to-bucket distribution.
I'd do the above for different hash functions. Whichever hash function results in minimum SD is the winner (for this dataset).