Collate Hash Function - c++

In the local object there is a collate facet.
The collate facet has a hash method that returns a long.
http://www.cplusplus.com/reference/std/locale/collate/hash/
Two questions:
Does anybody know what hashing method is used.
I need a 32bit value.
If my long is longer than 32 bits, does anybody know about techniques for folding the hash into a shorter version. I can see that if done incorrectly that folding could generate lots of clashes (and though I can cope with clashes as I need to take that into account anyway, I would prefer if they were minimized).
Note:
I can't use C++0x features
Boost may be OK.

No, nobody really knows -- it can vary from one implementation to another. The primary requirements are (N3092, §20.8.15):
For all object types Key for which there exists a specialization hash, the instantiation hash shall:
satisfy the Hash requirements (20.2.4), with Key as the function call argument type, the DefaultConstructible requirements (33), the CopyAssignable requirements (37),
be swappable (20.2.2) for lvalues,
provide two nested types result_type and argument_type which shall be synonyms for size_t and Key, respectively,
satisfy the requirement that if k1 == k2 is true, h(k1) == h(k2) is also true, where h is an object of type hash and k1 and k2 are objects of type Key.
and (N3092, §20.2.4):
A type H meets the Hash requirements if:
it is a function object type (20.8),
it satisifes the requirements of CopyConstructible and Destructible (20.2.1),
the expressions shown in the following table are valid and have the indicated semantics, and
it satisfies all other requirements in this subclause.
§20.8.15 covers the requirements on the result of hashing, §20.2.4 on the hash itself. As you can see, however, both are pretty general. The table that's mentioned basically covers three more requirements:
A hash function must be "pure" (i.e., the result depends only on the input, not any context, history, etc.)
The function must not modify the argument that's passed to it, and
It must not throw any exceptions.
Exact algorithms definitely are not specified though -- and despite the length, most of the requirements above are really just stating requirements that (at least to me) seem pretty obvious. In short, the implementation is free to implement hashing nearly any way it wants to.

If the implementation uses a reasonable hash function, there should be no bits in the hash value that have any special correlation with the input. So if the hash function gives you 64 "random" bits, but you only want 32 of them, you can just take the first/last/... 32 bits of the value as you please. Which ones you take doesn't matter since every bit is as random as the next one (that's what makes a good hash function).
So the simplest and yet completely reasonable way to get a 32-bit hash value would be:
int32_t value = hash(...);
(Of course this collapses groups of 4 billion values down to one, which looks like a lot, but that can't be avoided if there are four billion times as many source values as target values.)

Related

Is there a C++ hash function that returns hash as a mixture of letters and strings?

I know that the STL std::hash class in C++ returns hash only in form of numbers even for strings. But I want to have a hash function which returns the hash as a mixture of letters and alphabets on passing an integer in c++ with less collisions. Is there any standard library function that I can use?
I want something like this:
H(12345) = a44f81ji234kop
with least number of collisions and a good distribution.
You can pick any normal hash function you like, then do the conversion to "a44f81ji234kop"-style text as a second step (discussed below). The Standard Library doesn't attempt to provide any guarantees on hash function quality, so as you seem to want those, you're better off picking a third party library, e.g. https://github.com/stbrumme/hash-library
Once you have a number, you can use base-36 encoding to convert it to the kind of numbers-plus-text representation you prefer. You can specify base when converting
int->text using std::to_chars,
text->int (i.e. if you want to get the numeric hash value back from the base 36 value) using stoi.
The C++ standard library lacks functionality to combine hashes or hash multiple object of different type into one hash, unfortunately.
One good way is to use the hashing infrastructure from Types Don't Know #:
The problem solved herein is how to support the hashing of N different types of keys using M different hashing algorithms, using an amount of source code that is proportional to N+M, as opposed to the current system based on std::hash<T> which requires an amount of source code proportional to N*M. And consequently in practice today M==1, and the single hashing algorithm is supplied only by the std::lib implementor. As it is too difficult and error prone for the client to supply alternative algorithms for all of the built-in scalar types (int, long, double, etc.). Indeed, it has even been too difficult for the committee to supply hashing support for all of the types our clients might reasonably want to use as keys: pair, tuple, vector, complex, duration, forward_list etc.

std::tuple sizeof, is it a missed optimization?

I've checked all major compilers, and sizeof(std::tuple<int, char, int, char>) is 16 for all of them. Presumably they just put elements in order into the tuple, so some space is wasted because of alignment.
If tuple stored elements internally like: int, int, char, char, then its sizeof could be 12.
Is it possible for an implementation to do this, or is it forbidden by some rule in the standard?
std::tuple sizeof, is it a missed optimization?
Yep.
Is it possible for an implementation to do this[?]
Yep.
[Is] it forbidden by some rule in the standard?
Nope!
Reading through [tuple], there is no constraint placed upon the implementation to store the members in template-argument order.
In fact, every passage I can find seems to go to lengths to avoid making any reference to member-declaration order at all: get<N>() is used in the description of operational semantics. Other wording is stated in terms of "elements" rather than "members", which seems like quite a deliberate abstraction.
In fact, some implementations do apparently store the members in reverse order, at least, probably simply due to the way they use inheritance recursively to unpack the template arguments (and because, as above, they're permitted to).
Speaking specifically about your hypothetical optimisation, though, I'm not aware of any implementation that doesn't store elements in [some trivial function of] the user-given order; I'm guessing that it would be "hard" to come up with such an order and to provide the machinery for std::get, at least as compared to the amount of gain you'd get from doing so. If you are really concerned about padding, you may of course choose your element order carefully to avoid it (on some given platform), much as you would with a class (without delving into the world of "packed" attributes). (A "packed" tuple could be an interesting proposal…)
Yes, it's possible and has been (mostly) done by R. Martinho Fernandes. He used to have a blog called Flaming Danger Zone, which is now down for some reason, but its sources are still available on github.
Here are the all four parts of the Size Matters series on this exact topic: 1, 2, 3, 4.
You might wish to view them raw since github doesn't understand C++ highlighting markup used and renders code snippets as unreadable oneliners.
He essentially computes a permutation for tuple indices via C++11 template meta-program, that sorts elements by alignment in non-ascending order, stores the elements according to it, and then applies it to the index on every access.
They could. One possible reason they don’t: some architectures, including x86, have an indexing mode that can address an address base + size × index in a single instruction—but only when size is a power of 2. Or it might be slightly faster to do a load or store aligned to a 16-byte boundary. This could make code that addresses arrays of std::tuple slightly faster and more compact if they add four padding bytes.

How to handle mixing of different libraries (eg. stl and eigen3) that use different types for indices (size_t, int, ...)

I have following problem. I have some code that uses Eigen3. Eigen3 uses int or long int for indices. At some points in the code I have to store values from the eigen-arrays in a std::vector.
Here is some example:
std::vector myStdVector;
Eigen::VectorXd myEigen;
....
for(size_t i=0; i<myStdVector.size(); i++)
{
myStdVector[i] = myEigen(i);
}
Here I get the compiler warning:
warning: implicit conversion loses integer precision: 'const size_t'
(aka 'const unsigned long') to 'int'
So of course I could add a static_cast<int>(i) to all the functions where such a scenario occurs, but I wonder if there is a better way to deal with such things. I guess this happens with many other "library-mixing" too.
In this specific case, I would suggest using the smaller container's index type; this would be Eigen's index type, as determined by your Eigen::VectorXd. Ideally, it would be used as Eigen::Index, for forwards-compatibility.
It might also be worth looking into how Eigen defines its index type. In particular, you are allowed to redefine it if necessary, by changing a preprocessor directive, by #defineing the symbol EIGEN_DEFAULT_DENSE_INDEX_TYPE; it defaults to std::ptrdiff_t.
[Note, however, that in my own code, I generally prefer to use the larger index (in this case, size_t), but do range checks as if using the smaller of the index types if applicable (in this case, Eigen::Index). This is just a personal preference, however, and not necessarily what I consider to be the best option.]
Generally, when trying to choose the best index type, I would suggest that you look at their available ranges. First, if one or more of the potential types are signed, and if one or more signed potential type allows negative values*, you'll want to eliminate any unsigned types, especially ones that are larger than the largest signed type. Then, you'd look at your use case, eliminate any types that aren't viable for your intended purpose, and choose the best fit out of the remaining potential types.
In your case specifically, you want to store values from an Eigen3 container in an STL container, where the Eigen3 container is indexed with ptrdiff_t and (as mentioned in your comment) to your knowledge only uses non-negative index values. In this case, either is a viable option; the range of non-negative index values provided by ptrdiff_t fits nicely inside size_t's range, and the loop condition will be determined by your VectorXd (and thus is also guaranteed to fit inside the Eigen3 container's index type). Thus, both potential types are viable choices. As the additional range provided by size_t is currently unnecessary, I would consider the index type provided by your Eigen setup to be slightly better suited to the task at hand.
*: While it's typically safe to assume that index values will always be positive due to how indexing works, I can see a few cases where allowing negatives would be beneficial. These are typically rare, though.
Note that I assumed the loop condition i<myStdVector.size() in your example code was a typo, due to not lining up with the initial description or the operation performed inside the loop body. If I was incorrect, then this decision becomes more complex.

C++ hash function, how is the original haser i.e. hash<int xkey> implemented

I am new to the hashing in general and also to the STL world and saw the new std::unrdered_set and the SGI :hash_set,both of which uses the hasher hash. I understand to get a good load factor , you might need to write your own hashfunction, and I have been able to write one.
However, I am trying to go deep into , how the original default has_functions are written.
My question is :
1) How is the original default HashFcn written ; more concretely how is the hash generated?
Is it based on some pseudo random number. Can anyone point me to some header file (I am a bit lost with the documentation), where I can look up ; how the hasher hash is implemented.
2)How does it guarantee that each time , you will be able to get the same key?
Please, let me know if I can make my questions clearer any way?
In the version of gcc that I happen to have installed here, the required hash functions are in /usr/lib/gcc/i686-pc-cygwin/4.7.3/include/c++/bits/functional_hash.h
The hashers for integer types are defined using the macro _Cxx_hashtable_define_trivial_hash. As you might expect from the name, this just casts the input value to size_t.
This is how gcc does it. If you're using gcc then you should have a similarly-named file somewhere. If you're using a different compiler then the source will be somewhere else. It is not required that every implementation uses a trivial hash for integer types, but I suspect that it is very common.
It's not based on a random number generator, and hopefully it's now pretty obvious to you how this function guarantees to return the same key for the same input every time! The reason for using a trivial hash is that it's as fast as it gets. If it gives a bad distribution for your data (because your values tend to collide modulo the number of buckets) then you can either use a different, slower hash function or a different number of buckets (std::unordered_set doesn't let you specify the exact number of buckets, but it does let you set a minimum). Since library implementers don't know anything about your data, I think they will tend not to introduce slower hash functions as the default.
A hash function must be deterministic -- i.e., the same input must always produce the same result.
Generally speaking, you want the hash function to produce all outputs with about equal probability for arbitrary inputs (but while desirable, this is no mandatory -- and for any given hash function, there will always be an arbitrary number of inputs that produce identical outputs).
Generally speaking, you want the hashing function to be fast, and to depend (to at least some degree) on the entirety of the input.
A fairly frequently seen pattern is: start with some semi-random input. Combine one byte of input with the current value. Do something that will move the bits around (multiplication, rotation, etc.) Repeat for all bytes of the input.

What's the recommended implementation for hashing OLE Variants?

OLE Variants, as used by older versions of Visual Basic and pervasively in COM Automation, can store lots of different types: basic types like integers and floats, more complicated types like strings and arrays, and all the way up to IDispatch implementations and pointers in the form of ByRef variants.
Variants are also weakly typed: they convert the value to another type without warning depending on which operator you apply and what the current types are of the values passed to the operator. For example, comparing two variants, one containing the integer 1 and another containing the string "1", for equality will return True.
So assuming that I'm working with variants at the underlying data level (e.g. VARIANT in C++ or TVarData in Delphi - i.e. the big union of different possible values), how should I hash variants consistently so that they obey the right rules?
Rules:
Variants that hash unequally should compare as unequal, both in sorting and direct equality
Variants that compare as equal for both sorting and direct equality should hash as equal
It's OK if I have to use different sorting and direct comparison rules in order to make the hashing fit.
The way I'm currently working is I'm normalizing the variants to strings (if they fit), and treating them as strings, otherwise I'm working with the variant data as if it was an opaque blob, and hashing and comparing its raw bytes. That has some limitations, of course: numbers 1..10 sort as [1, 10, 2, ... 9] etc. This is mildly annoying, but it is consistent and it is very little work. However, I do wonder if there is an accepted practice for this problem.
There's a built in tension in your question between the use of a hash function and the stated requirements, which are to validated against the input of the hash. I'd suggest we keep in mind a few properties of hashes in general: information is lost during the hashing process, and hash collisions are to be expected. It is possible to construct a perfect hash without collisions, but it would be problematic (or impossible?) to construct a perfect hash function if the domain of the function is any possible OLE Variant. On the other hand, if we're not talking about a perfect hash, then your first rule is violated.
I don't know the larger context of what you're trying to accomplish, but I must push back on one of your assumptions: is a hash function really what you want? Your requirements could be met in a fairly straightforward way if you develop a system that encodes, not hashes, all of the possible OLE Variant attributes so that they can be recalled later and compared against other Variant images.
Your baseline implementation of converting the Variant to a string representation is moving in this direction. As you are no doubt aware, a Variant can contain pointers, double pointers, and arrays, so you'll have to develop consistent string representation of these data types. I question whether this approach could really be classified as a hash. Aren't you just persisting data attributes?
Hash codes of VARIANTS that are equal should be equal.
Without knowing the equality and coercion rules that are used for testing equality, it is hard to come up with a proper implementation.
So in summary, to make stuff comparable you first stream to a common format, string or blob.
How do you handle e.g. localisation, e.g. formating of reals? A real compared to a string containing the same real created in another locale will fail. Or a real written to string with a different precision setting.
It sounds to me the definition of equal() is the problem, not the hashing. If "equal" values can be serialized to string (or blob) differently, hashing will fail.