I would like to store a v8::Persistent<v8::Object> handles in a hash type container (more precisely Google dense_hash_set). Do I need to implement my own hasher function for this? Can I rely on the v8::Object::GetIdentityHash method for the hash value? Poking at the code I can see that they are basically just generating a random 32-bit number for the object and caching it. Is that enough to avoid hash collisions?
My answer is, yes, it can be used as a hash key, but...
According to this, int v8::Object::GetIdentityHash():
Returns the identity hash for this object.
The current implementation uses a hidden property on the object to
store the identity hash.
The return value will never be 0. Also, it is not guaranteed to be
unique.
It maybe generate same keys for different objects, and you may have collisions. However it's not an enough reason to abandon this function.
The problem is keeping collision-rates low. And it depends on distribution of GetIdentityHash and size of the hash table.
You can test it and count the collisions and check if it's damages your performance or not?!
Related
I'm going to use redis cache where key is a clojure map (serialized into bytes by nippy).
Can i use hash of the clojure map as a key in redis cache?
Another words, does clojure map hash depends only on data structure value and does not depend on any memory allocation.
Investigating:
I navigated through code and found out IHashEq interface which is implemented by clojure data structures.
In the result, IHashEq impl ends with calling of Object.hashCode which has following contract:
Whenever it is invoked on the same object more than once during
an execution of a Java application, the {#code hashCode} method
must consistently return the same integer, provided no information
used in {#code equals} comparisons on the object is modified.
This integer need not remain consistent from one execution of an
application to another execution of the same application.
Well, just want to clarify that i cannot use hash as id persistent in other process because:
two equal values give two equal hash codes, but not vice verse. So there is a chance of collision
there is no guarantee that clojure map hash will be the same for the the same values in different jvm processes
Please, confirm.
Re your two points:
Two equal values will yield the same hash code, but two unequal values may also give the same hash code. So the chance of collision makes this a bad choice of key.
Different JVM's should generate the same hashcode for a given value, given the same version of Java & Clojure (& very probably for different versions, although this is not guarenteed).
You can use secure hash library (like this one) to address your concerns (like in blockchain). Albeit you have to pay for its performance penalty.
From some source^, I get a hash buffer of length 20 (SHA-1) for a particular data (say a file or block of bytes). If this given hash (consider it as string, not hash) is not found in map, then I would pull more information, and insert this information with this hash. To make it clear:
unordered_map<Hash_of_20_Bytes, Information>
It is my map. The key would be a 20-byte buffer, and Information is some structure containing detailed information. So, if the source^ gives me some hash, I would lookup that hash into this Information-map and use/generate appropriately.
The point is, in my case, the given 20-byte hash is guaranteed to not have any collision. However, unordered_map would still calculate the (FNV) hash for the key (the key itself being a hash!). Can't I instruct the collection class not to generate the hash, instead use the key has unique-key itself (to ensure O(1))?
I am not sure if unordered_map computes the hash for integers also (i.e. to reduce the need for additional computation).
One approach is to use a vector of pair<20-byte, Info> itself, and do a binary search. However, just to avoid penalty of hash computation (by hash-container) it incurs more penalty of keeping the vector sorted).
A hasher for std::unordered_map must satisfy the Hash concept. So it must return a std::size_t, which is extremely unlikely to be more than 20 bytes.
Therefore it is not possible to provide an identity hasher for this 20-byte hash, and so even if no collision is guaranteed for the 20-byte hash, unless it can be reliably reduced to a 32-bit space (or rather a sizeof(std::size_t) space) without collision, collisions will be unavoidable for this case and this container.
You cannot use the hash as-is anyway, since unordered_map expects a size_t as hash, not a 20 bytes buffer.
Now, what you can do is to provide an extremely simple custom hash function: since the input is already a good hash you can just take the first sizeof(size_t) bytes and brutally memcpy them into a size_t, discarding all the others. I don't know think you'll get incredible performance speedups, but it doesn't cost much to try this out.
Can't I instruct the collection class not to generate the hash, instead, use the key has unique-key itself (to ensure O(1))?
The underlying assumption here is flawed; yes, your key is already a good, well-behaved hash, so you don't need to apply a complex hash function over it go get the expected hash properties and you won't get collisions of the type "different data map to the same hash"; but in general if you have a decent hash function most collisions don't come from the hash function mapping the same key to the same hash, but from the current size of the hash table - i.e. from the fact that multiple hash values are mapped to the same bucket. So, again, you aren't going to gain much.
So I constructed my unordered_set passing 512 as min buckets, i.e. the n parameter.
My hash function will always return a value in the range [0,511].
My question is, may I still get collision between two values which the hashes are not the same? Just to make it clearer, I tolerate any collision regarding values with the same hash, but I may not get collision with different hashes values.
Any sensible implementation would implement bucket(k) as hasher(k) % bucket_count(), in which case you won't get collisions from values with different hashes if the hash range is no larger than bucket_count().
However, there's no guarantee of this; only that equal hash values map to the same bucket. A bad implementation could (for example) ignore one of the buckets and still meet the container requirements; in which case, you would get collisions.
If your program's correctness relies on different hash values ending up in different buckets, then you'll have to either check the particular implementation you're using (perhaps writing a test for this behaviour), or implement your own container satisfying your requirements.
Since you don't have an infinite number of buckets and/or a perfect hash function, you would surely eventually get collisions (i.e. hashes referring to the same location) if you continue inserting keys (or even with fewer keys, take a look at the birthday paradox).
The key to minimize them is to tune your load factor and (as I suppose STL does internally) deal with collisions. Regarding the bucket value you should choose it in order to avoid rehashing.
MurmurHash3_x86_32() expects a seed parameter. What value should I use and what does it do?
The seed parameter is a means for you to randomize the hash function. You should provide the same seed value for all calls to the hashing function in the same application of the hashing function. However, each invocation of your application (assuming it is creating a new hash table) can use a different seed, e.g., a random value.
Why is it provided?
One reason is that attackers may use the properties of a hash function to construct a denial of service attack. They could do this by providing strings to your hash function that all hash to the same value destroying the performance of your hash table. But if you use a different seed for each run of your program, the set of strings the attackers must use changes.
See: Effective DoS on web application platform
There's also a Twitter tag for #hashDoS
A value named seed here stands for salt. Provide any random but private (to you app) data to it, so the hash function will give different results for the same data. This feature is used for example make a digest of you data to detect modifcation of original data by 3rd persons. They hardly can replicate the valid hash value until they know the salt you used.
Salt (or seed) is also used to prevent hash collisions for different data. For example, your data blocks A and B might produce the same hash: h(A) == h(B). But you can avoid this conflicting condition if provide some sort of additional data. Collisions are quite rare, but sometimes salt is a way to avoid them for the concrete set of data.
Is there a way to write simple hashtable with the key as "strings" and value as the frequency, so that there are NO collisons? There will no be removal from the hashtable, and if the object already exists in the hashtable, then just update its frequency(add them together).
I was thinking there might be a algorithm that can compute a unique number from the string which will be used as the index.
Yes, i am avoiding the use of all STL construct including unordered_map.
You can use any perfect hash generator like gperf
See here for a list: http://en.wikipedia.org/wiki/Perfect_hash_function
PS. You'd still possibly want to use a map instead of flat array/vector in case the mapped domain gets too big/sparse
It really depends on what you mean by 'simple'.
The std::map is a fairly simple class. Still, it uses a red-black tree with all of the insertion, deletion, and balancing nicely hidden away, and it is templated to handle any orderable type as a key and any type as the value. Most map classes use a similar implementation, and avoid any sort of hashing functionality.
Hashing without collisions is not a trivial matter whatsoever. Perhaps the simplest method is Pearson Hashing.
It seems like you have 3 choices:
Implement your own perfect hashing class. This would be a pretty good sized class with a lot of functionality and some decently complex algorithms. I don't think this is simple.
Download and use a perfect hashing library that is already out there. Of course, you have to worry about deployability.
Use STL's map class. It's embedded, well-documented, easy to use, type-flexible, and completely cross-platform. This seems like the 'simplest' solution.
If I may ask, Why are you avoiding STL?
If the set of possible strings is known beforehand, you can use a perfect hash function generator to do this. But otherwise, what you ask is impossible.
Now, it IS possible to make the likelihood of collisions extremely low by using a good hash function and making sure your table is huge. You basically need a big enough table to make the likelihood of invoking the Birthday Paradox low enough to suit you. Then you just use n bits of output from SHA-1, and 2^n will be your table size.
I'm also wondering if maybe you could use a Bloom filter and have an actual counter instead of bits. Keep a list of all the words you've stuffed into the bloom filter and what entries they've incremented (which will be the same each time) and you have yourself a gigantic linear function that you might be able to solve to get all the individual counts back out again.