Using of hash(map) as key in durable cache - clojure

I'm going to use redis cache where key is a clojure map (serialized into bytes by nippy).
Can i use hash of the clojure map as a key in redis cache?
Another words, does clojure map hash depends only on data structure value and does not depend on any memory allocation.
Investigating:
I navigated through code and found out IHashEq interface which is implemented by clojure data structures.
In the result, IHashEq impl ends with calling of Object.hashCode which has following contract:
Whenever it is invoked on the same object more than once during
an execution of a Java application, the {#code hashCode} method
must consistently return the same integer, provided no information
used in {#code equals} comparisons on the object is modified.
This integer need not remain consistent from one execution of an
application to another execution of the same application.
Well, just want to clarify that i cannot use hash as id persistent in other process because:
two equal values give two equal hash codes, but not vice verse. So there is a chance of collision
there is no guarantee that clojure map hash will be the same for the the same values in different jvm processes
Please, confirm.

Re your two points:
Two equal values will yield the same hash code, but two unequal values may also give the same hash code. So the chance of collision makes this a bad choice of key.
Different JVM's should generate the same hashcode for a given value, given the same version of Java & Clojure (& very probably for different versions, although this is not guarenteed).

You can use secure hash library (like this one) to address your concerns (like in blockchain). Albeit you have to pay for its performance penalty.

Related

Understanding the difference btw sparse_hash_map/dense_hash_map/flat_hash_map of google?

After some survey, I know these 3 hash data structures e.g. sparse_hash/dense_hash/flat_hash individually from CppCon2017.
I know that they use some common points to make faster than std::unordered_map.
store key-value pointer in one cacheline e.g. dense_hash_map/flat_hash_map use one
metadata to store pointer of key/value to fit in one cacheline to
speed up.
collision handling e.g. dense_hash_map use quadratic
probing and flat_hash uses robin-hood?
But if they use the common mechanism, they shell be almost the same performance.
Any other algorithm details to differentiate them?

Storing persistent v8 object handles in a hashed container

I would like to store a v8::Persistent<v8::Object> handles in a hash type container (more precisely Google dense_hash_set). Do I need to implement my own hasher function for this? Can I rely on the v8::Object::GetIdentityHash method for the hash value? Poking at the code I can see that they are basically just generating a random 32-bit number for the object and caching it. Is that enough to avoid hash collisions?
My answer is, yes, it can be used as a hash key, but...
According to this, int v8::Object::GetIdentityHash():
Returns the identity hash for this object.
The current implementation uses a hidden property on the object to
store the identity hash.
The return value will never be 0. Also, it is not guaranteed to be
unique.
It maybe generate same keys for different objects, and you may have collisions. However it's not an enough reason to abandon this function.
The problem is keeping collision-rates low. And it depends on distribution of GetIdentityHash and size of the hash table.
You can test it and count the collisions and check if it's damages your performance or not?!

What is MurmurHash3 seed parameter?

MurmurHash3_x86_32() expects a seed parameter. What value should I use and what does it do?
The seed parameter is a means for you to randomize the hash function. You should provide the same seed value for all calls to the hashing function in the same application of the hashing function. However, each invocation of your application (assuming it is creating a new hash table) can use a different seed, e.g., a random value.
Why is it provided?
One reason is that attackers may use the properties of a hash function to construct a denial of service attack. They could do this by providing strings to your hash function that all hash to the same value destroying the performance of your hash table. But if you use a different seed for each run of your program, the set of strings the attackers must use changes.
See: Effective DoS on web application platform
There's also a Twitter tag for #hashDoS
A value named seed here stands for salt. Provide any random but private (to you app) data to it, so the hash function will give different results for the same data. This feature is used for example make a digest of you data to detect modifcation of original data by 3rd persons. They hardly can replicate the valid hash value until they know the salt you used.
Salt (or seed) is also used to prevent hash collisions for different data. For example, your data blocks A and B might produce the same hash: h(A) == h(B). But you can avoid this conflicting condition if provide some sort of additional data. Collisions are quite rare, but sometimes salt is a way to avoid them for the concrete set of data.

What's the correct way to generate random strings without duplicates

I'm thinking about generating random strings, without making any duplication.
First thought was to use a binary tree create and locate for duplicate in tree, if any.
But this may not be very effective.
Second thought was using MD5 like hash method which create messages based only on time, but this may introduce another problem, different machines has different accuracy of time.
And in a modern processor, more than one string could be created in a single timestamp.
Is there any better way to do this?
Generate N sequential strings, then do a random shuffle to pull them out in random order. If they need to be unique across separate generators, mix a unique generator ID into the string.
Beware of MD5, there's no guarantee that two different Strings won't generate the same hash.
As for your problem, it depends on a number of constraints: are the strings short or long? Do they have to be meaningful? Etc... Two solutions from the top of my head:
1 Generate UUIDs then turn them into String with a binary representation or base 64 algorithm.
2 Simply generate random Strings and put them in a searchable structure (HashMap) so that you can find very quickly (O(1)-O(log n)) if a generated String already has a duplicate, in which case it is discarded.
A tree probably won't be the most efficient, especially for insertions - as it will have to constantly re-balance itself (somewhat of an "expensive" operation).
I'd recommend using a HashSet type data structure. The hashing algorithm should already be quite efficient (much more so than something like MD5), and all operations are constant-time. Insert all your Strings into the Set. If you create a new String, check to see if it already exists in the Set.
It sounds like you want to generate a uuid? See http://docs.python.org/library/uuid.html
>>> import uuid
>>> uuid.uuid4()
UUID('dafd3cb8-3163-4734-906b-a33671ce52fe')
You should specify in what programming language you're coding. For instance, in Java this will work nicely: UUID.randomUUID().toString() . UUID identifiers are unique in practice, as is stated in wikipedia:
The intent of UUIDs is to enable distributed systems to uniquely identify information without significant central coordination. In this context the word unique should be taken to mean "practically unique" rather than "guaranteed unique". Since the identifiers have a finite size it is possible for two differing items to share the same identifier. The identifier size and generation process need to be selected so as to make this sufficiently improbable in practice.
A binary tree is probably better than usual here - no rebalancing necessary, because your strings are random, and it's on random data that binary trees work their best. However, it's still O(log(n)) for lookup and addition.
But maybe more efficient, if you know in advance how many random strings you'll need and don't mind a little probability in the mix, is to use a bloom filter.
Bloom filters give an efficient, probabilistic set membership test with memory requirements as low as one bit per element saved in the set. Basically, a bloom filter can say with 100% certainty that a member does not belong to a set, but with a high but not quite 100% certainty that a member is in a set. In your case, throwing out an extra candidate or two shouldn't hurt at all, so the probabilistic nature shouldn't hurt a bit.
Bloom filters are also relatively unique in that they can test for set membership in constant time.
For a while, I listed treaps here, but that's silly - they do a lot of operations in O(log(n)) again, and would only be relevant if your data isn't truly random.
If you don't need your strings to be saved in order for some reason (and it sounds like you probably don't), a traditional hash table is a good way to go. They like to know how big your final dataset will be in advance (to avoid slow hash table resizes), but they too are constant time for insertion and lookup.
http://stromberg.dnsalias.org/svn/bloom-filter/trunk/

A good repartition algorithm

I am implementing a memcached client library. I want it to support several servers and so I wish to add some load-balancing system.
Basically, you can do two operations on a server:
Store a value given its key.
Get a value given its key.
Let us say I have N servers (from 0 to N - 1), I'd like to have a repartition function which, from a given key and server number N, would give me an index in the [0, N[ range.
unsigned int getServerIndex(const std::string& key, unsigned int serverCount);
The function should be as fast and simple as possible and must respect the following constraint:
getServerIndex(key, N) == getServerIndex(key, N); //aka. No random return.
I wish I could do this without using an external library (like OpenSSL and its hashing functions). What are my options here?
Side notes:
Obviously, the basic implementation:
unsigned int getServerIndex(const std::string& key, unsigned int serverCount)
{
return 0;
}
Is not a valid answer as this is not exactly a good repartition function :D
Additional information:
Keys will usually be any possible string, within the ANSI charset (mostly [a-zA-Z0-9_-]). The size may be anything from a one-char-key to whatever-size-you-want.
A good repartition algorithm is an algorithm for which the probability of returning a is equal (or not too far) from the probability of returning b, for two different keys. The number of servers might change (rarely though) and if it does, it is acceptable that the returned index for a given key changes as well.
You're probably looking for something that implements consistent hashing. The easiest way to do this is to assign a random ID to each memcache server, and allocate each item to the memcache server which has the closest ID to the item's hash, by some metric.
A common choice for this - and the one taken by distributed systems such as Kademlia - would be to use the SHA1 hash function (though the hash is not important), and compare distances by XORing the hash of the item with the hash of the server and interpreting the result as a magnitude. All you need, then, is a way of making each client aware of the list of memcache servers and their IDs.
When a memcache server joins or leaves, it need only generate its own random ID, then ask its new neighbours to send it any items that are closer to its hash than to their own.
I think the hashing approach is the right idea. There are many simplistic hashing algorithms out there.
With the upcoming C++0x and the newly standard unordered_map, the hash of strings is becoming a standard operation. Many compilers are already delivered with a version of the STL which features a hash_map and thus already have a pre-implemented hash function.
I would start with those... but it would be better if we had more information on your strings: are they somehow constrained to a limited charset, or is it likely that they will be many similar strings ?
The problem is that a "standard" hash might not produce a uniform distribution if the input is not uniformly distributed to begin with...
EDIT:
Given the information, I think the hash function already shipped with most STL should work, since you do not seem to have a highly concentrated area. However I am by now way expert in probabilities, so take it with a grain of salt (and experiment).
What about something very simple like
hash(key) % serverCount