OLE Variants, as used by older versions of Visual Basic and pervasively in COM Automation, can store lots of different types: basic types like integers and floats, more complicated types like strings and arrays, and all the way up to IDispatch implementations and pointers in the form of ByRef variants.
Variants are also weakly typed: they convert the value to another type without warning depending on which operator you apply and what the current types are of the values passed to the operator. For example, comparing two variants, one containing the integer 1 and another containing the string "1", for equality will return True.
So assuming that I'm working with variants at the underlying data level (e.g. VARIANT in C++ or TVarData in Delphi - i.e. the big union of different possible values), how should I hash variants consistently so that they obey the right rules?
Rules:
Variants that hash unequally should compare as unequal, both in sorting and direct equality
Variants that compare as equal for both sorting and direct equality should hash as equal
It's OK if I have to use different sorting and direct comparison rules in order to make the hashing fit.
The way I'm currently working is I'm normalizing the variants to strings (if they fit), and treating them as strings, otherwise I'm working with the variant data as if it was an opaque blob, and hashing and comparing its raw bytes. That has some limitations, of course: numbers 1..10 sort as [1, 10, 2, ... 9] etc. This is mildly annoying, but it is consistent and it is very little work. However, I do wonder if there is an accepted practice for this problem.
There's a built in tension in your question between the use of a hash function and the stated requirements, which are to validated against the input of the hash. I'd suggest we keep in mind a few properties of hashes in general: information is lost during the hashing process, and hash collisions are to be expected. It is possible to construct a perfect hash without collisions, but it would be problematic (or impossible?) to construct a perfect hash function if the domain of the function is any possible OLE Variant. On the other hand, if we're not talking about a perfect hash, then your first rule is violated.
I don't know the larger context of what you're trying to accomplish, but I must push back on one of your assumptions: is a hash function really what you want? Your requirements could be met in a fairly straightforward way if you develop a system that encodes, not hashes, all of the possible OLE Variant attributes so that they can be recalled later and compared against other Variant images.
Your baseline implementation of converting the Variant to a string representation is moving in this direction. As you are no doubt aware, a Variant can contain pointers, double pointers, and arrays, so you'll have to develop consistent string representation of these data types. I question whether this approach could really be classified as a hash. Aren't you just persisting data attributes?
Hash codes of VARIANTS that are equal should be equal.
Without knowing the equality and coercion rules that are used for testing equality, it is hard to come up with a proper implementation.
So in summary, to make stuff comparable you first stream to a common format, string or blob.
How do you handle e.g. localisation, e.g. formating of reals? A real compared to a string containing the same real created in another locale will fail. Or a real written to string with a different precision setting.
It sounds to me the definition of equal() is the problem, not the hashing. If "equal" values can be serialized to string (or blob) differently, hashing will fail.
Related
I know that the STL std::hash class in C++ returns hash only in form of numbers even for strings. But I want to have a hash function which returns the hash as a mixture of letters and alphabets on passing an integer in c++ with less collisions. Is there any standard library function that I can use?
I want something like this:
H(12345) = a44f81ji234kop
with least number of collisions and a good distribution.
You can pick any normal hash function you like, then do the conversion to "a44f81ji234kop"-style text as a second step (discussed below). The Standard Library doesn't attempt to provide any guarantees on hash function quality, so as you seem to want those, you're better off picking a third party library, e.g. https://github.com/stbrumme/hash-library
Once you have a number, you can use base-36 encoding to convert it to the kind of numbers-plus-text representation you prefer. You can specify base when converting
int->text using std::to_chars,
text->int (i.e. if you want to get the numeric hash value back from the base 36 value) using stoi.
The C++ standard library lacks functionality to combine hashes or hash multiple object of different type into one hash, unfortunately.
One good way is to use the hashing infrastructure from Types Don't Know #:
The problem solved herein is how to support the hashing of N different types of keys using M different hashing algorithms, using an amount of source code that is proportional to N+M, as opposed to the current system based on std::hash<T> which requires an amount of source code proportional to N*M. And consequently in practice today M==1, and the single hashing algorithm is supplied only by the std::lib implementor. As it is too difficult and error prone for the client to supply alternative algorithms for all of the built-in scalar types (int, long, double, etc.). Indeed, it has even been too difficult for the committee to supply hashing support for all of the types our clients might reasonably want to use as keys: pair, tuple, vector, complex, duration, forward_list etc.
Below link it is mentioned chances of collision but I am trying to use it for finding duplicate entry:
http://www.cplusplus.com/reference/functional/hash/
I am using std::hash<std::string> and storing the return value in std::unordered_set. if emplace is fails, I am marking string as it is duplicate string.
Hashes are generally functions from a large space of values into a small space of values, e.g. from the space of all strings to 64-bit integers. There are a lot more strings than 64-bit integers, so obviously multiple strings can have the same hash. A good hash function is such that there's no simple rule relating strings with the same hash value.
So, when we want to use hashes to find duplicate strings (or duplicate anything), it's always a two-phase process (at least):
Look for strings with identical hash (i.e. locate the "hash bucket" for your string)
Do a character-by-character comparison of your string with other strings having the same hash.
std::unordered_set does this - and never mind the specifics. Note that it does this for you, so it's redundant for you to hash yourself, then store the result in an std::unordered_set.
Finally, note that there are other features one could use for initial duplicate screening - or for searching among the same-hash values. For example, string length: Before comparing two strings character-by-character, you check their lengths (which you should be able to access without actually iterating the strings); different lengths -> non-equal strings.
Yes, it is possible that two different strings will share the same hash. Simply put, let's imagine you have a hash stored in an 8bit type (unsigned char).
That is 2^8 = 256 possible values. That means you can only have 256 unique hashes of arbitrary inputs.
Since you can definitely create more than 256 different strings, there is no way the hash would be unique for all possible strings.
std::size_t is a 64bit type, so if you used this as a storage for the hash value, you'd have 2^64 possible hashes, which is marginally more than 256 possible unique hashes, but it's still not enough to differentiate between all the possible strings you can create.
You just can't store an entire book in only 64 bits.
Yes it can return the same result for different strings. This is a natural consequence of reducing an infinite range of possibilities to a single 64-bit number.
There exist things called "perfect hash functions" which produce a hash function that will return unique results. However, this is only guaranteed for a known set of inputs. An unknown input from outside might produce a matching hash number. That possibility can be reduced by using a bloom filter.
However, at some point with all these hash calculations the program would have been better off doing simple string comparisons in an unsorted linear array. Who cares if the operation is O(1)+C if C is ridiculously big.
Yes, std::hash return same result for different std::string.
The creation of buckets is different by different compiler.
Compiler based implementation found at link:
hashing and rehashing for std::unordered_set
I study std::hash's references, and find it can't hash a serialized data, like char*. Is it correct or normal ? How can I hash a serialized buffer?
The idea with std::hash is to provide a general hashing algorithm for fixed-size data that is good enough for most uses, so users don't need to roll their own every time. The problem with variable length inputs is that hashing them is a much more complex problem, often depending on characteristics of the data itself, to require the standard library to include such an algorithm, and thus the implementation is punted to the developer. For example, a hash algorithm that works great for ASCII strings might not work so great for data containing mostly zeros, and a good algorithm for the latter might give too many collisions for strings. (There are also speed tradeoffs; some hashing algorithms might work great for everything but be too slow.)
IIRC, an old, old hashing algorithm for ASCII strings is to simply multiply every character's ASCII value together. Needless to say, this is really fast and only works because there are no zeros.
So instead of using std::hash, you're supposed to write your own hashing class with the same API (i.e. it must define size_t operator()(Key)) and pass that class as the Hash template parameter to hash-using templates like std::unordered_set.
I am new to the hashing in general and also to the STL world and saw the new std::unrdered_set and the SGI :hash_set,both of which uses the hasher hash. I understand to get a good load factor , you might need to write your own hashfunction, and I have been able to write one.
However, I am trying to go deep into , how the original default has_functions are written.
My question is :
1) How is the original default HashFcn written ; more concretely how is the hash generated?
Is it based on some pseudo random number. Can anyone point me to some header file (I am a bit lost with the documentation), where I can look up ; how the hasher hash is implemented.
2)How does it guarantee that each time , you will be able to get the same key?
Please, let me know if I can make my questions clearer any way?
In the version of gcc that I happen to have installed here, the required hash functions are in /usr/lib/gcc/i686-pc-cygwin/4.7.3/include/c++/bits/functional_hash.h
The hashers for integer types are defined using the macro _Cxx_hashtable_define_trivial_hash. As you might expect from the name, this just casts the input value to size_t.
This is how gcc does it. If you're using gcc then you should have a similarly-named file somewhere. If you're using a different compiler then the source will be somewhere else. It is not required that every implementation uses a trivial hash for integer types, but I suspect that it is very common.
It's not based on a random number generator, and hopefully it's now pretty obvious to you how this function guarantees to return the same key for the same input every time! The reason for using a trivial hash is that it's as fast as it gets. If it gives a bad distribution for your data (because your values tend to collide modulo the number of buckets) then you can either use a different, slower hash function or a different number of buckets (std::unordered_set doesn't let you specify the exact number of buckets, but it does let you set a minimum). Since library implementers don't know anything about your data, I think they will tend not to introduce slower hash functions as the default.
A hash function must be deterministic -- i.e., the same input must always produce the same result.
Generally speaking, you want the hash function to produce all outputs with about equal probability for arbitrary inputs (but while desirable, this is no mandatory -- and for any given hash function, there will always be an arbitrary number of inputs that produce identical outputs).
Generally speaking, you want the hashing function to be fast, and to depend (to at least some degree) on the entirety of the input.
A fairly frequently seen pattern is: start with some semi-random input. Combine one byte of input with the current value. Do something that will move the bits around (multiplication, rotation, etc.) Repeat for all bytes of the input.
Usually, entities and components or other parts of the game code in data-driven design will have names that get checked if you want to find out which object you're dealing with exactly.
void Player::Interact(Entity *myEntity)
{
if(myEntity->isNearEnough(this) && myEntity->GetFamilyName() == "guard")
{
static_cast<Guard*>(myEntity)->Say("No mention of arrows and knees here");
}
}
If you ignore the possibility that this might be premature optimization, it's pretty clear that looking up entities would be a lot faster if their "name" was a simple 32 bit value instead of an actual string.
Computing hashes out of the string names is one possible option. I haven't actually tried it, but with a range of 32bit and a good hashing function the risk of collision should be minimal.
The question is this: Obviously we need some way to convert in-code (or in some kind of external file) string-names to those integers, since the person working on these named objects will still want to refer to the object as "guard" instead of "0x2315f21a".
Assuming we're using C++ and want to replace all strings that appear in the code, can this even be achieved with language-built in features or do we have to build an external tool that manually looks through all files and exchanges the values?
Jason Gregory wrote this on his book :
At Naughty Dog, we used a variant of the CRC-32 algorithm to hash our strings, and we didn't encounter a single collision in over two years of development on Uncharted: Drake's Fortune.
So you may want to look into that.
And about the build step you mentioned, he also talked about it. They basically encapsulate the strings that need to be hashed in something like:
_ID("string literal")
And use an external tool at build time to hash all the occurrences. This way you avoid any runtime costs.
This is what enums are for. I wouldn't dare to decide which resource is best for the topic, but there are plenty to choose from: https://www.google.com/search?q=c%2B%2B+enum
I'd say go with enums!
But if you already have a lot of code already using strings, well, either just keep it that way (simple and usually enough fast on a PC anyway) or hash it using some kind of CRC or MD5 into an integer.
This is basically solved by adding an indirection on top of a hash map.
Say you want to convert strings to integers:
Write a class wraps both an array and a hashmap. I call these classes dictionaries.
The array contains the strings.
The hash map's key is the string (shared pointers or stable arrays where raw pointers are safe work as well)
The hash map's value is the index into the array the string is located, which is also the opaque handle it returns to calling code.
When adding a new string to the system, it is searched for already existing in the hashmap, returns the handle if present.
If the handle is not present, add the string to the array, the index is the handle.
Set the string and the handle in the map, and return the handle.
Notes/Caveats:
This strategy makes getting the string back from the handle run in constant time (it is merely an array deference).
handle identifiers are first come first serve, but if you serialize the strings instead of the values it won't matter.
Operator[] overloads for both the key and the value are fairly simple (registering new strings, or getting the string back), but wrapping the handle with a user-defined class (wrapping an integer) adds a lot of much needed type safety, and also avoids ambiguity if you want the key and the values to be the same types (overloaded[]'s wont compile and etc)
You have to store the strings in RAM, which can be a problem.