unordered_map of a hash - c++

From some source^, I get a hash buffer of length 20 (SHA-1) for a particular data (say a file or block of bytes). If this given hash (consider it as string, not hash) is not found in map, then I would pull more information, and insert this information with this hash. To make it clear:
unordered_map<Hash_of_20_Bytes, Information>
It is my map. The key would be a 20-byte buffer, and Information is some structure containing detailed information. So, if the source^ gives me some hash, I would lookup that hash into this Information-map and use/generate appropriately.
The point is, in my case, the given 20-byte hash is guaranteed to not have any collision. However, unordered_map would still calculate the (FNV) hash for the key (the key itself being a hash!). Can't I instruct the collection class not to generate the hash, instead use the key has unique-key itself (to ensure O(1))?
I am not sure if unordered_map computes the hash for integers also (i.e. to reduce the need for additional computation).
One approach is to use a vector of pair<20-byte, Info> itself, and do a binary search. However, just to avoid penalty of hash computation (by hash-container) it incurs more penalty of keeping the vector sorted).

A hasher for std::unordered_map must satisfy the Hash concept. So it must return a std::size_t, which is extremely unlikely to be more than 20 bytes.
Therefore it is not possible to provide an identity hasher for this 20-byte hash, and so even if no collision is guaranteed for the 20-byte hash, unless it can be reliably reduced to a 32-bit space (or rather a sizeof(std::size_t) space) without collision, collisions will be unavoidable for this case and this container.

You cannot use the hash as-is anyway, since unordered_map expects a size_t as hash, not a 20 bytes buffer.
Now, what you can do is to provide an extremely simple custom hash function: since the input is already a good hash you can just take the first sizeof(size_t) bytes and brutally memcpy them into a size_t, discarding all the others. I don't know think you'll get incredible performance speedups, but it doesn't cost much to try this out.
Can't I instruct the collection class not to generate the hash, instead, use the key has unique-key itself (to ensure O(1))?
The underlying assumption here is flawed; yes, your key is already a good, well-behaved hash, so you don't need to apply a complex hash function over it go get the expected hash properties and you won't get collisions of the type "different data map to the same hash"; but in general if you have a decent hash function most collisions don't come from the hash function mapping the same key to the same hash, but from the current size of the hash table - i.e. from the fact that multiple hash values are mapped to the same bucket. So, again, you aren't going to gain much.

Related

Unordered map of unordered set in C++ 11

I wanted to implement something, that maps an unordered set of integers to an integer value. Some kind of C++ equivalent of Python dict, which has sets as keys and ints as values.
So far I used std::map<std::set<int>, int> set_lookup; but from what I understood this is unnecessarily slow as it uses trees. I don't care about the ordering, only speed is important.
From what I have understand, the desired structure is std::unordered_map<std::unordered_set<int>, int, hash> set_lookup; which needs a hash function to work.
Is this the right approach? And how would a minimum running example look like? I couldn't find how the hash part should look like.
It isn't clear whether you ask about the syntax for defining a hash function, or about how to define a mathematically good hash for a set of ints.
Anyway - in case it is the former, here is how you should technically define a hash function for your case:
template <>
struct hash<std::unordered_set<int>>
{
std::size_t operator()(const std::unordered_set<int>& k) const
{
using std::size_t;
using std::hash;
using std::string;
// ...
// Here you should create and return a meaning full hash value:
return 5;
}
};
void main()
{
std::unordered_map<std::unordered_set<int>, int> m;
}
Having written that, I join the other comments about whether it is a good direction to solve your problem.
You haven't described your problem, so I cannot answer that.
I understood [std::map<std::set<int>, int> set_lookup;] is unnecessarily slow as it uses trees.
Is [std::unordered_map<std::unordered_set<int>, int, hash>] the right approach?
It depends. If your keys are created then not changed, and you want to be able to do a lot of lookups very fast, then a hash-table based approach would indeed be good, but you'll need two things for that:
to be able to hash keys
to be able to compare keys
To hash keys, deciding on a good hash function is a bit of an art form. A rarely bad - but sometimes slower than necessary - approach is to use boost hash_combine (which is short enough that you can copy it into your code - see here for the implementation). If your integer values are already quite random across most of their bits, though, simply XORing them together would produce a great hash. If you're not sure, use hash_combine or a better hash (e.g. MURMUR32). The time taken to hash will depend on the time to traverse, and traversing an unordered_set typically involves a linked list traversal (which typically jumps around in memory pages and is CPU cache unfriendly). The best way to store the values for fast traversal is in contiguous memory - i.e. a std::vector<>, or std::array<> if the size is known at compile time.
The other thing you need to do is compare keys for equality: that also works fastest when elements in the key are contiguous in memory, and consistently ordered. Again, a sorted std::vector<> or std::array<> would be best.
That said, if the sets for your keys are large, and you can compromise on a statistical guarantee of key equality, you could use e.g. a 256-bit hash and code as if hash collisions always correspond to key equality. That's often not an acceptable risk, but if your hash is not collision prone and you have e.g. a 256 bit hash, a CPU could run flat-chat for millennia hashing distinct keys and still be unlikely to produce the same hash even once, so it is a use I've seen even financial firms use in their core in-house database products, as it can save so much time.
If you're tempted by that compromise, you'd want std::unordered_map<HashValue256, std::pair<int, std::vector<int>>>. To find the int associated with a set of integers, you'd hash them first, then do a lookup. It's easy to write a hash function that produces the same output for a set or sorted vector<> or array<>, as you can present the elements to something like hash_combine in the same sorted order during traversal (i.e. just size_t seed = 0; for (auto& element : any_sorted_container) hash_combine(seed, element);). Storing the vector<int> means you can traverse the unordered_map later if you want to find all the key "sets" - if you don't need to do that (e.g. you're only ever looking up the ints by keys known to the code at the time, and you're comfortable with the statistical improbability of a good hash colliding, you don't even need to store the keys/vectors): std::unordered_map<HashValue256, int>.

how to specialize std::hash call operator for return types other than size_t?

This is the first time I've tried using std::unordered_map, c++17. I'm trying to build a quick LRU where I'm mapping sha1 digests to blocks of data. My sha1 object is fully comparable etc, but when I try to instantiate the map, I get this error:
/usr/include/c++/7/bits/hashtable_policy.h:87:34: error: no match for call to ‘(const std::hash<
kbs::crypto::basic_digest<20,kbs::crypto::openssl_sha1_digest_engine_policy> >) (const kbs::crypto::basic_digest<20, kbs::crypto::openssl_sha1_digest_engine_policy>&)’
noexcept(declval<const _Hash&>()(declval<const _Key&>()))>
So it looks like I can specialize std::hash for my user defined type. But, it always returns size_t, which is bad, going from 20 bytes to 8 bytes kinda defeats the purpose of using sha1 for a hash_key.
Is there a way to work around this? Alternative containers? It's a waste to have to write my own hashmap. I guess I could use std:set...
Unordered map does not assume the hashes (the size_ts) are unique. It assumes keys are. And it performs well if the hashes are good.
Unordered map uses the size_t to determine which bucket to put the data into. It handles collisions in the size_t space fine.
Map your sha hash to a size_t however you want, and use the sha hash as your key. In the unlikely event you get a size_t hashing collision (50/50 when you have roughly 4 billion elements in the unordered map, assuming good hashing - see "birthday paradox " for the math, or more often with smaller hash tables; it dynamically grows the table) it will fall back on equality of your sha hash keys to avoid "real" collisions.
There are multiple kinds of collisions.
Sha-hash collisions: bad, means same key different data.
size_t hash collisions: meh, means the two elements will always be in same bucket.
internal hash table collisions: common, means that at this specific size, the two elements are in the same bucket.
Unless most of your data maos to the same size_t, that map being lossy is perfectly fine. You just worry about the first kind of collision, and provide == on your sha-hashes.

What are some ways to scramble a matrix based on a 256 bit key?

Looking to build an algorithm that will scramble a matrix based on a 256-bit key. Given two m*n matrices A and B and key K, I would like A and B to be scrambled in the same way. So informally, if A==B, scramble(A,K)==scramble(B,K).
What I'm trying to do seems to have similarities to encryption, but I'm wholly unfamiliar with the field. I feel like there must be some things I can leverage from encryption algorithms to make the process fast and computationally efficient.
To clarify, the main purpose of the scrambling is to obfuscate the matrix content while still allowing for comparisons to be made.
It sounds like you might need Cryptographic hash. Feeding your matrix/image into one generates (almost) unique hash value for it. This hash value is convenient, as it's constant size and usually much smaller than the source data. It's practically impossible to go from the hash value back to original data, and hashing the same image data again yields the same hash value.
If you want to add a secret key into this, you can concatenate the image data and the key, and compute hash over that. With the same data and key you'll receive the same hash value, and if you change either, the hash value changes.
(almost unique: by pigeonhole principle, turning a large input into smaller hash value, there must be multiple inputs that generate the same hash value. In practice, this is rarely a concern)

Will I guaranteedly not get collisions with different hashes values in `unordered_set` if I specify the min buckets size in constructor?

So I constructed my unordered_set passing 512 as min buckets, i.e. the n parameter.
My hash function will always return a value in the range [0,511].
My question is, may I still get collision between two values which the hashes are not the same? Just to make it clearer, I tolerate any collision regarding values with the same hash, but I may not get collision with different hashes values.
Any sensible implementation would implement bucket(k) as hasher(k) % bucket_count(), in which case you won't get collisions from values with different hashes if the hash range is no larger than bucket_count().
However, there's no guarantee of this; only that equal hash values map to the same bucket. A bad implementation could (for example) ignore one of the buckets and still meet the container requirements; in which case, you would get collisions.
If your program's correctness relies on different hash values ending up in different buckets, then you'll have to either check the particular implementation you're using (perhaps writing a test for this behaviour), or implement your own container satisfying your requirements.
Since you don't have an infinite number of buckets and/or a perfect hash function, you would surely eventually get collisions (i.e. hashes referring to the same location) if you continue inserting keys (or even with fewer keys, take a look at the birthday paradox).
The key to minimize them is to tune your load factor and (as I suppose STL does internally) deal with collisions. Regarding the bucket value you should choose it in order to avoid rehashing.

STL Map versus Static Array

I have to store information about contents in a lookup table such that it can be accessed very quickly.I might need to refer some of the elements in look up table recursively to get complete information about contents. What will be better data structure to use:
Map with one of parameter, which will be unique to all the entries in look up table, as key and rest of the information as value
Use static array for each unique entries and access them when needed according to key(which will be same as the one used in MAP).
I want my software to be robust as if we have any crash it will be catastrophic for my product.
It depends on the range of keys that you have.
Usually, when you say lookup table, you mean a smallish table which you can index directly ( O(1) ). As a dumb example, for a substitution cipher, you could have a char cipher[256] and simply index with the ASCII code of a character to get the substitution character. If the keys are complex objects or simply too many, you're probably stuck with a map.
You might also consider a hashtable (see unordered_map).
Reply:
If the key itself can be any 32-bit number, it wouldn't make sense to store a very sparse 4-billion element array.
If however your keys are themselves between say 0..10000, then you can have a 10000-element array containing pointers to your objects (or the objects themselves), with only 2000-5000 of your elements containing non-null pointers (or meaningful data, respectively). Access will be O(1).
If you can have large keys, then I'd probably go with the unordered_map. With a map of 5000 elements, you'd get O(log n) to mean around ~12 accesses, a hash table should be pretty much one or two accesses tops.
I'm not familiar with perfect hashes, so I can't advise about their implementation. If you do choose that, I'd be grateful for a link or two with ideas to keep in mind.
The lookup times in a std::map should be O=ln(n), with a linear search in a static array in the worst case O=n.
I'd strongly opt for a std::map even if it has a larger memory footprint (which should not matter, in the most cases).
Also you can make "maps of maps" or even deeper structures:
typedef std::map<MyKeyType, std::map<MyKeyType, MyValueType> > MyDoubleMapType;