MurmurHash3_x86_32() expects a seed parameter. What value should I use and what does it do?
The seed parameter is a means for you to randomize the hash function. You should provide the same seed value for all calls to the hashing function in the same application of the hashing function. However, each invocation of your application (assuming it is creating a new hash table) can use a different seed, e.g., a random value.
Why is it provided?
One reason is that attackers may use the properties of a hash function to construct a denial of service attack. They could do this by providing strings to your hash function that all hash to the same value destroying the performance of your hash table. But if you use a different seed for each run of your program, the set of strings the attackers must use changes.
See: Effective DoS on web application platform
There's also a Twitter tag for #hashDoS
A value named seed here stands for salt. Provide any random but private (to you app) data to it, so the hash function will give different results for the same data. This feature is used for example make a digest of you data to detect modifcation of original data by 3rd persons. They hardly can replicate the valid hash value until they know the salt you used.
Salt (or seed) is also used to prevent hash collisions for different data. For example, your data blocks A and B might produce the same hash: h(A) == h(B). But you can avoid this conflicting condition if provide some sort of additional data. Collisions are quite rare, but sometimes salt is a way to avoid them for the concrete set of data.
Related
I'm going to use redis cache where key is a clojure map (serialized into bytes by nippy).
Can i use hash of the clojure map as a key in redis cache?
Another words, does clojure map hash depends only on data structure value and does not depend on any memory allocation.
Investigating:
I navigated through code and found out IHashEq interface which is implemented by clojure data structures.
In the result, IHashEq impl ends with calling of Object.hashCode which has following contract:
Whenever it is invoked on the same object more than once during
an execution of a Java application, the {#code hashCode} method
must consistently return the same integer, provided no information
used in {#code equals} comparisons on the object is modified.
This integer need not remain consistent from one execution of an
application to another execution of the same application.
Well, just want to clarify that i cannot use hash as id persistent in other process because:
two equal values give two equal hash codes, but not vice verse. So there is a chance of collision
there is no guarantee that clojure map hash will be the same for the the same values in different jvm processes
Please, confirm.
Re your two points:
Two equal values will yield the same hash code, but two unequal values may also give the same hash code. So the chance of collision makes this a bad choice of key.
Different JVM's should generate the same hashcode for a given value, given the same version of Java & Clojure (& very probably for different versions, although this is not guarenteed).
You can use secure hash library (like this one) to address your concerns (like in blockchain). Albeit you have to pay for its performance penalty.
So I constructed my unordered_set passing 512 as min buckets, i.e. the n parameter.
My hash function will always return a value in the range [0,511].
My question is, may I still get collision between two values which the hashes are not the same? Just to make it clearer, I tolerate any collision regarding values with the same hash, but I may not get collision with different hashes values.
Any sensible implementation would implement bucket(k) as hasher(k) % bucket_count(), in which case you won't get collisions from values with different hashes if the hash range is no larger than bucket_count().
However, there's no guarantee of this; only that equal hash values map to the same bucket. A bad implementation could (for example) ignore one of the buckets and still meet the container requirements; in which case, you would get collisions.
If your program's correctness relies on different hash values ending up in different buckets, then you'll have to either check the particular implementation you're using (perhaps writing a test for this behaviour), or implement your own container satisfying your requirements.
Since you don't have an infinite number of buckets and/or a perfect hash function, you would surely eventually get collisions (i.e. hashes referring to the same location) if you continue inserting keys (or even with fewer keys, take a look at the birthday paradox).
The key to minimize them is to tune your load factor and (as I suppose STL does internally) deal with collisions. Regarding the bucket value you should choose it in order to avoid rehashing.
I am new to the hashing in general and also to the STL world and saw the new std::unrdered_set and the SGI :hash_set,both of which uses the hasher hash. I understand to get a good load factor , you might need to write your own hashfunction, and I have been able to write one.
However, I am trying to go deep into , how the original default has_functions are written.
My question is :
1) How is the original default HashFcn written ; more concretely how is the hash generated?
Is it based on some pseudo random number. Can anyone point me to some header file (I am a bit lost with the documentation), where I can look up ; how the hasher hash is implemented.
2)How does it guarantee that each time , you will be able to get the same key?
Please, let me know if I can make my questions clearer any way?
In the version of gcc that I happen to have installed here, the required hash functions are in /usr/lib/gcc/i686-pc-cygwin/4.7.3/include/c++/bits/functional_hash.h
The hashers for integer types are defined using the macro _Cxx_hashtable_define_trivial_hash. As you might expect from the name, this just casts the input value to size_t.
This is how gcc does it. If you're using gcc then you should have a similarly-named file somewhere. If you're using a different compiler then the source will be somewhere else. It is not required that every implementation uses a trivial hash for integer types, but I suspect that it is very common.
It's not based on a random number generator, and hopefully it's now pretty obvious to you how this function guarantees to return the same key for the same input every time! The reason for using a trivial hash is that it's as fast as it gets. If it gives a bad distribution for your data (because your values tend to collide modulo the number of buckets) then you can either use a different, slower hash function or a different number of buckets (std::unordered_set doesn't let you specify the exact number of buckets, but it does let you set a minimum). Since library implementers don't know anything about your data, I think they will tend not to introduce slower hash functions as the default.
A hash function must be deterministic -- i.e., the same input must always produce the same result.
Generally speaking, you want the hash function to produce all outputs with about equal probability for arbitrary inputs (but while desirable, this is no mandatory -- and for any given hash function, there will always be an arbitrary number of inputs that produce identical outputs).
Generally speaking, you want the hashing function to be fast, and to depend (to at least some degree) on the entirety of the input.
A fairly frequently seen pattern is: start with some semi-random input. Combine one byte of input with the current value. Do something that will move the bits around (multiplication, rotation, etc.) Repeat for all bytes of the input.
I know the original md5 algorithm produces an 128-bits hash.
Following Mark Adler's comments here I'm interested in getting a good 64-bits hash.
Is there a way to create an md5-based 64-bits hash using OpenSSL? (md5 looks good enough for my needs).
If not, is there another algorithm implemented in the OpenSSL library that can get this job done with quality not less than md5's (except for the length of course)?
I claim, that 'hash quality' is strongly related to the hash length.
AFAIK, OpenSSL does not have 64bit hash algos, so the first idea I had is simple and most probably worthless:
halfMD5 = md5.hiQuadWord ^ md5.lowQuadWord
Finally, I'd simply use an algorithm with appropriate output, like crc64.
Some crc64 sources to verify:
http://www.backplane.com/matt/crc64.html
http://bioinfadmin.cs.ucl.ac.uk/downloads/crc64/
http://en.wikipedia.org/wiki/Computation_of_CRC
http://www.pathcom.com/~vadco/crc.html
Edit
At first glanceת Jenkins looks perfect, however I'm trying to find a friendly c++ implementation for it without luck so far. BTW, I'm wondering, since this is such a good hash for databases' duplication checking, how come that non of the common opensource libraries, like OpenSSL, provides an API of it? – Subway
This might be simply due to the fact, that OpenSSL is a crypto library in the first place, using large hash values with appropriate crypto characteristics.
Hash algos for data structures have some other primary goals, e.g. good distribution characteristics for hash tables, where small hash values are used as an index into a list of buckets containing zero, one or multiple (colliding) element(s).
So the point is, whether, how and where collisions are handled.
In a typical DBMS, an index on a column will handle them itself.
Corresponding containers (maps or sets):
C++: std::size_t (32 or 64bits) for std::unordered_multimap and std::unordered_multiset
In java, one would make a mapping with lists as buckets: HashMap<K,List<V>>
The unique constraint would additionally prohibit insertion of equal field contents:
C++: std::size_t (32 or 64bits) for std::unordered_map and std::unordered_set
Java: int (32bits) for HashMap and HashSet
For example, we have a table with file contents (plaintext, non-crypto application) and a checksum or hash value for mapping or consistency checks. We want to insert a new file. For that, we precompute the hash value or checksum and query for existing files with equal hash values or checksums respectively. If none exists, there won't be a collision, insertion would be safe. If there are one or more existing records, there is a high probability for an exact match and a lower probability for a 'real' collision.
In case collisions should be omitted, one could add an unique constraint to the hash column and reuse existing records with the possibility of mismatching/colliding contents. Here, you'd want to have a database friendly hash algo like 'Jenkins'.
In case collisions need to be handled, one could add an unique constraint to the plaintext column. Less database friendly checksum algos like crc won't have an influence on collisions among records and can be chosen according to certain types of corruption to be detected or other requirements. It is even possible to use the XOR'ed quad words of an md5 as mentioned at the beginning.
Some other thoughts:
If an index/constraint on plaintext columns does the mapping, any hash value can be used to do reasonably fast lookups to find potential matches.
No one will stop you from adding both, a mapping friendly hash and a checksum.
Unique constraints will also add an index, which are basically like the hash tables mentioned above.
In short, it greatly depends on what exactly you want to achieve with a 64bit hash algo.
I would like to store a v8::Persistent<v8::Object> handles in a hash type container (more precisely Google dense_hash_set). Do I need to implement my own hasher function for this? Can I rely on the v8::Object::GetIdentityHash method for the hash value? Poking at the code I can see that they are basically just generating a random 32-bit number for the object and caching it. Is that enough to avoid hash collisions?
My answer is, yes, it can be used as a hash key, but...
According to this, int v8::Object::GetIdentityHash():
Returns the identity hash for this object.
The current implementation uses a hidden property on the object to
store the identity hash.
The return value will never be 0. Also, it is not guaranteed to be
unique.
It maybe generate same keys for different objects, and you may have collisions. However it's not an enough reason to abandon this function.
The problem is keeping collision-rates low. And it depends on distribution of GetIdentityHash and size of the hash table.
You can test it and count the collisions and check if it's damages your performance or not?!