What does `locality-sensitive` stands for in locality-sensitive hashing? - data-mining

What does locality-sensitive stands for in locality-sensitive hashing ? Is there formal definition of this term ?

LSH maps high dimension vectors to buckets and tries to ensure that vectors that are "near" to each other are mapped to the same bucket. The definition of "near" is just the neighborhood with respect to some distance function (e.g. Euclidean).
"Locality" refers to region in space; and "sensitive" means that the nearby locations are mapped to same bucket. In other words, the output of the hashing function depends on (is sensitive to) the location in space (the locality).
This is my understanding. I am sure theoretical folks must have more formal definition. Hope this helps.

Usually, hashing functions would be used to separate nearby values, to reduce the risk of collisions. Think of cryptographic hashes: you do want every single character change to completely change the hash code.
This does not hold for the hash functions as used in LSH. Well, technically it holds for the hash functions, but not for the step just before hashing: the data is put into buckets, a lossy operation, which usually will put nearby points into the same bucket. After that, only the bucket numbers are actually hashed (IIRC), so you don't get millions of buckets, but only as many as desired.
If you have independent functions you use for mapping and binning, they will likely overlap, so that you can find all true neighbors in at least one of the hash collision buckets the query point is in.

Related

How does std::unordered_set allocate bucket to provide O(1) lookup?

The unordered_set should provide O(1) lookup time in case of a good hash function.
Each bucket contains items with the same hash value.
Let's assume that our hash function is ideal and gives no collisions at all.
Hash values can vary from 0 to max(std::size_t).
How to organize unordered_set without allocating continuous memory interval for buckets and to provide O(1) lookup?
We can not allocate continuous memory interval because if we allocate then we use a lot of memory for only to several hash values - 0 and 1000000 for example. Values in the middle are not used at all, but we allocated memory for them.
Each bucket contains items with the same hash value.
Wrong. Each bucket "contains" items with the same value of:
       hash(key) % current number of buckets.
Let's assume that our hash function is ideal and gives no collisions at all.
Not having collisions in pre-mod space isn't necessarily ideal: what matters is collisions after modding (or masking if there's a power-of-2 bucket count, e.g. Visual C++) into the number of buckets.
And even then, having no collisions is not (normally) the design goal for a hash function used with a hash table. That's known as perfect hashing, and usually only practical when the keys are all known up front, pre-inserted, then fast lookup is wanted for the rest of the application's lifetime. In more common scenarios, the aim in the general case is to have - when inserting a new value - it be roughly as likely that you'll collide with already-inserted values as it would be if you picked a bucket a random. That's because a high-quality hash function effectively produces a random-but-repeatable value. Hash functions work well because it's acceptable to have a certain level of collisions, which is statistically related to the current load factor (and not the size() of the container).
Hash values can vary from 0 to max(std::size_t).
How to organize unordered_set without allocating continuous memory interval for buckets and to provide O(1) lookup?
You typically have somewhere between 1 and ~2 buckets per value/element when the table had its largest size(). You can customise this somewhat by calling max_load_factor(float) to determine when the table resizes, but you can't customise by how much it resizes - that's left to the implementation. GCC will usually pick a prime a bit larger than twice the current size. Visual C++ will usually double the buckets.
We can not allocate continuous memory interval because if we allocate then we use a lot of memory for only to several hash values - 0 and 1000000 for example.
Values in the middle are not used at all, but we allocated memory for them.
This ignores the modding of the hash value into the bucket count. (It also ignores the viability of sparse arrays, which can be practical because virtual address space can be massively larger than the physical RAM backing it, but that's not the point of hash tables.)

Storing filepath and size in C++

I'm processing a large number of image files (tens of millions) and I need to return the number of pixels for each file.
I have a function that uses an std::map<string, unsigned int> to keep track of files already processed. If a path is found in the map, then the value is returned, otherwise the file is processed and inserted into the map. I do not delete entries from the map.
The problem is as the number of entries grow, the time for lookup is killing the performance. This portion of my application is single threaded.
I wanted to know if unordered_map is the solution to this, or the fact that I'm using std::string as keys going to affects the hashing and require too many rehashings as the number of keys increases, thus once again killing the performance.
One other item to note is that the paths for the string are expected (but not guaranteed) to have the same prefix, for example: /common/until/here/now_different/. So all strings will likely have the same first N characters. I could potentially store these as relative to the common directory. How likely is that to help performance?
unordered_map will probably be better in this case. It will typically be implemented as a hash table, with amortized O(1) lookup time, while map is usually a binary tree with O(log n) lookups. It doesn't sound like your application would care about the order of items in the map, it's just a simple lookup table.
In both cases, removing the common prefix should be helpful, as it means less time has to be spent needlessly iterating over that part of the strings. For unordered_map it will have to traverse it twice: once to hash and then to compare against the keys in the table. Some hash functions also limit the amount of a string they hash, to prevent O(n) hash performance -- if the common prefix is longer than this limit, you'll end up with a worst-case hash table (everything is in one bucket).
I really like Galik's suggestion of using inodes if you can, but if not...
Will emphasise a point already made in comments: if you've reason to care, always implement the alternatives and measure. The more reason, the more effort it's worth expending on that....
So /- another option is to use a 128-bit cryptographic strength hash function on your filepaths, then trust that statistically it's extremely unlikely to produce a collision. A rule of thumb is that if you have 2^n distinct keys, you want significantly more than a 2n-bit hash. For ~100m keys, that's about 2*27 bits, so you could probably get away with a 64 bit hash but it's a little too close for comfort and headroom if the number of images grows over the years. Then use a vector to back a hash table of just the hashes and file sizes, with say quadratic probing. Your caller would ideally pre-calculate the hash of an incoming file path in a different thread, passing your lookup API only the hash.
The above avoids the dynamic memory allocation, indirection, and of course memory usage when storing variable-length strings in the hash table and utilises the cache much better. Relying on hashes not colliding may make you uncomfortable, but past a point the odds of a meteor destroying the computer, or lightning frying it, will be higher than the odds of a collision in the hash space (i.e. before mapping to hash table bucket), so there's really no point fixating on that. Cryptographic hashing is relatively slow, hence the suggestion to let clients do it in other threads.
(I have worked with a proprietary distributed database based on exactly this principle for path-like keys.)
Aside: beware Visual C++'s string hashing - they pick 10 characters spaced along your string to incorporate in the hash value, which would be extremely collision prone for you, especially if several of those were taken from the common prefix. The C++ Standard leaves implementations the freedom to provide whatever hashes they like, so re-measure such things if you ever need to port your system.

QHash: Any weak spots performance-wise, besides rehashing? Why not?

This is more of a theoretical question. In addition, I have to admit, that I did some rather sophisticated performance tests about this some time ago, but I cannot find the source code anymore.
Some words about the type of application: I am focussing on scenarios with really big numbers of entries, from 50,000 up to some million, while memory consumption does not really matter.
I fully understand the basic concept of a hash data structure, why it generally has constant access times, and why rehashing is required at some point.
Any possible key is mapped to a certain slot in the hash structure. Of course, collisions may occur, resulting in multiple keys being mapped to the same slot. This is where implementation details come into play. As far as I know, there is some kind of logic using the "next" slot if the initially assigned slot is occupied.
My feeling is, there has to be a weak spot somewhere, performance-wise. Given a really big QHash, filled up just below its capacity, of which keys are then removed randomly, while new keys are added (without ever increaing the total number of stored keys, making sure it does not rehash): I would think, this has to lead to severe performance degration in the long term.
Filling up a QHash just below its capacity, with random values, should result in a considerable amount of key collisions. The lookup of a key affected from collisions requires multiple slots to be inspected, resulting in performance penalties. Removing keys and adding new random keys instead should make things even worse: Contiguous sequences of colliding keys will be fragmented. Collisions occupy 'foreign' slots, forcing a key actually mapped to this slot to be stored somewhere else. This slot might still be free'd later...
To make it short, I would expect, for the given scenario (perfoming deletes and inserts on a QHash which is always kept just below its capacity), performance should degrade in the long run, either because of lookup times are increasing due to increasing disorder, or because of periodical reordering.
However, I had taken some efforts to show this performance degration once (as I said, I cannot find this project anymore, so I stand here barehanded, I'm afraid), and I could not find any.
Is there a special magic in place regarding QHash handling collisions I am not aware of?
tl;dr;
Is there a special magic in place regarding QHash handling collisions I am not aware of
There is no magic. You just misunderstood how hash maps work.
(*) I will treat the concept of hash map, not the specific implementation QHash. And although there are several approaches to handling collisions, what I am describing here is the most common pattern. It is used, among others, by c++ std::unordered_map, Qt QHash, java HashMap.
Your terminology with "slots" it's a bit confusing. I first though that by slot you mean "bucket", but I now think you mean element of a bucket.
So in a hash, colliding keys are stored in a bucket. This can be any container, from vector to list.
Finding a bucket is O(1) and finding a key inside a bucket is O(k), where k is the bucket's length. So key access is constant in best case and linear in worst case.
You seem to assume that the number of buckets somehow increases when the hash map fills it's capacity. Well, there is no such thing as a capacity for a hash map (like there is for a vector for instance). So the situation that you describe: "having a hash with the capacity of let's say 100 and in the worst case all 100 elements collide and are stored in the same bucket" can't happen.
For a hash map you have a measure called "load factor" which is the average number of elements per bucket (size / bucket_count). The hash will increase the number of buckets (and recompute the hashes of every element, repositioning them) when the load factor exceeds a threshold max load factor. Performance is assured first and foremost by the quality of the hash function which must assure that keys are uniformly spread across all buckets. But no matter how good the hash function is, you can still have situations where some buckets are considerably larger then the rest. The fail-safe for that is the mentioned the max load factor.
By consequence, the max load factor achieves two purposes:
it acts as a "logical capacity" if you will: it makes the hash naturally increase the buckets count in the scenario that elements are added to the hash uniformly, making all the buckets too large.
it acts as a "fail safe" for the hash function. It increases the buckets count in the (rare) scenario that you have multiple keys colliding on a small subset of buckets.

Strategy to set the number of initial buckets in `std::unordered_set` in C++

If we know that we're going to hash between m and n items, where m and n are relatively large, what's a reasonable strategy for setting the number of initial buckets for std::unordered_set? If it helps, in my case m=n/2. In general, I'd like to optimize for speed, but can't afford an unreasonable amount of memory. Thanks in advance.
tl;dr There's no simple answer. Either measure, or let the container manage the bucket size automatically.
As I tried to say in the comments, there are too many variables, and you don't seem to realise how vague you're being. It took an hour for you to even say which implementation you're interested in.
m and n are "relatively large" ... relative to what?
"These are the only two operations and I want them to be fast." Define fast? What's fast enough? What's too slow? Have you measured?
If you want to minimize the load factor, so that there is on average no more than one element per bucket (and so no iteration through buckets needed once that the right bucket is known) then you'll need at least n buckets. But that doesn't guarantee one bucket per element, because the function used to determine the bucket from a hash code might return the same value for every pointer you put in the container. Knowing if that's likely depends on the hash function being used, and the function that maps hash codes to buckets, and the pointer values themselves.
For GCC the hash function for pointers is the identity function. For the default unordered_map implementation the mapping to buckets is hash_function(x) % bucket_count() and the bucket count is always a prime number, to reduce the likelihood of collisions. If the addresses you're storing in the hash map tend to be separated by multiples of the bucket count then they're going to end up in the same bucket. Knowing how likely that is depends on the number of buckets used for n (which you haven't stated) and the distribution of pointer values you're using (which you haven't stated).
If you use a custom hash function that has knowledge of the pointer values you expect to store then you could use a perfect hash function that uniformly distributes between [0, n) and then set the bucket_count() to n and ensure no collisions.
But it's not obvious that ensuring only a single element per bucket is worth it, because it uses more memory. Iterating through a bucket containing two or three elements is not going to be a bottleneck in most programs. Maybe it will be in yours, it's impossible to know because you haven't said what you want except it has to be fast. Which is so vague it's meaningless.
The only way to answer these questions is for you to measure the real world performance, nobody can give you a magic number that will make your code faster based on your vague requirements. If there was an easy answer that always makes things faster for "relatively large" number of elements then the standard library implementation should already be doing that and so you'd just be wasting your time doing the same thing by hand.
As an alternative, if you can live with logarithmic performance (usually not a problem), use a std::map instead. Then you have guaranteed lookup complexity 100% of the time, no re-hashing. A very useful property e.g. in hard real-time systems.

Will I guaranteedly not get collisions with different hashes values in `unordered_set` if I specify the min buckets size in constructor?

So I constructed my unordered_set passing 512 as min buckets, i.e. the n parameter.
My hash function will always return a value in the range [0,511].
My question is, may I still get collision between two values which the hashes are not the same? Just to make it clearer, I tolerate any collision regarding values with the same hash, but I may not get collision with different hashes values.
Any sensible implementation would implement bucket(k) as hasher(k) % bucket_count(), in which case you won't get collisions from values with different hashes if the hash range is no larger than bucket_count().
However, there's no guarantee of this; only that equal hash values map to the same bucket. A bad implementation could (for example) ignore one of the buckets and still meet the container requirements; in which case, you would get collisions.
If your program's correctness relies on different hash values ending up in different buckets, then you'll have to either check the particular implementation you're using (perhaps writing a test for this behaviour), or implement your own container satisfying your requirements.
Since you don't have an infinite number of buckets and/or a perfect hash function, you would surely eventually get collisions (i.e. hashes referring to the same location) if you continue inserting keys (or even with fewer keys, take a look at the birthday paradox).
The key to minimize them is to tune your load factor and (as I suppose STL does internally) deal with collisions. Regarding the bucket value you should choose it in order to avoid rehashing.