What hash function used in dictionary (hash_table)? - c++

I'm writting interpreter of language.
There is problem: I want to create type-dictionary, where you can put value of any type by index, that value of any type (simple[int,float,string] or complex[list,array,dictionary] of simple types or of complex of simple types ...). That is the same like in python-lang.
What algorithm of hash-function should I use?
For strings there are many examples of hashes - the simplest: sum of all characters multiplied by 31, divided by HASH_SIZE, that simple number.
But for DIFFERENT TYPES, I think, It must be more complicated algorithm.
I find SHA256, but don't know, how use "unsigned char[32]" result type for adressing in hash-table - it is much more than RAM in computer.
thank you.

There are hash tables in C++11, newest C++ standard - std::unordered_map, std::unordered_set.
EDIT:
Since every type has different distribution, usually every type has its own hash function. This is how it's done in Java (.hashCode() method inherited from Object), C#, C++11 and many other implementations.
EDIT2:
Typical hash function does two things:
1.) Create object representation in a natural number. (this is what .hashCode() in Java does)
For example - string "CAT" can be transformed to:
67 * 256^2 + 65 * 256^1 + 84 = 4407636
2.) Map this number to position in array.
One of the way to do this is:
integer_part(fractional_part(k*4407636)*m)
Where k is a constant (Donald Knuth in his book Art of Programming recommends (sqrt(5)+1)/2), m is size of your hash table and fractional_part and integer_part (obviously) calculate fractional part and integer part of real number.
In your hash table implementation, you need to handle collisions, especially when there are much more possible keys than size of your hash table.
EDIT3:
I read more on the subject, and it looks like
67 * 256^2 + 65 * 256^1 + 84 = 4407636
is really bad way to do hash_code. This is because, "somethingAAAAAABC" and "AAAAAABC" give exactly the same hash code.

Well, a common approach is to define the hash function as a method belonging to the type.
That way you can call different algorithms for different types through a common API.
That ,of course, entails that you define wrapper classes for every baisc "c type" that you want to use in your interpreter.

Related

c++ hash<string> is there a way to get the same value in linux and windows

I try to find a way to get the same result when I hash a given string in windows and in linux.
but for example if I run the following code:
hash<string> h;
cout << h("hello");
it will return 3305111549 in windows and 2762169579135187400 in linux.
If it is not possible to get the same return value accross these 2 platforms, is there any other decent hash function that would return the same value on both systems?
No. As per std::hash reference, emphasis mine:
The actual hash functions are implementation-dependent and are not
required to fulfill any other quality criteria except those specified
above.
More specifically you are using the std::hash<std::string> template specialization whose hashes:
equal the hashes of corresponding std::basic_string_view classes
which are also implementation dependent. So no, you can not expect the same std::hash results with different implementations. Furthermore since C++14:
Hash functions are only required to produce the same result for the
same input within a single execution of a program;
Not only you cannot depend on hash values among different platforms, but the standard doesn't guarantee that the hash value will be the same among different runs of the same program. It only guarantees that the value will be the same during the same run.
This is the only requirement the C++14 standard poses for the returned value (beside that it's type should be std::size_t) (17.6.3.4):
The value returned shall depend only on the argument k for the
duration of the program. [ Note: Thus all evaluations of the
expression h(k) with the same value for k yield the same result for a
given execution of the program. — end note ]
[ Note: For two different values t1 and t2, the probability that h(t1) and > h(t2) compare equal should be very small, approaching 1.0 /
numeric_limits::max(). — end note ]
(where h is a hash functor, k is the key)
If you want to have the same value, then use a well-known hash algorithm, like MurmurHash3.
It won’t work with std::hash:
The actual hash functions are implementation-dependent and are not required to fulfill any other quality criteria except those specified above. Notably, some implementations use trivial (identity) hash functions which map an integer to itself. In other words, these hash functions are designed to work with unordered associative containers, but not as cryptographic hashes, for example.
http://en.cppreference.com/w/cpp/utility/hash
I try to find a way to get the same result when I hash a given string
in windows and in linux. but for example if I run the following code:
hash<string> h;
cout << h("hello");
it will return 3305111549 in windows and 2762169579135187400 in linux.
The results are correct. As mentioned in other answers, the C++ standard doesn't even guarantee that the values will be the same between various execution of the same program.
If it is not possible to get the same return value accross these 2
platforms, is there any other decent hash function that would return
the same value on both systems?
Yes!. You may want to check out Best hashing algorithms for speed and uniqueness for a list of good hash functions to implement.
However, after you select the one you want to use, you need one more extra guarantee: that the underlaying representations of characters are the same between the two platforms. That is that the numerical representations of 'a' in platform 1 is same as 'a' in platform 2. If one platform uses ASCII and the other uses a different encoding scheme, you aren't likely to get the same results.
Again, std::hash<> already has a specialization for std::hash<std::string>. So, other than your standard library's provision, there's nothing you can do about enforcing a behavior for the result of std::hash<std::string>()("hello"). Your option is to use:
a custom hash function-object, e.g myNAMESPACE::hash<std::string>()("hello"), or
use a custom string type, and specialize it for std::hash; e.g std::hash<MyString>()("hello")

What are some checksum implementations that allow for incremental computation?

In my program I have a set of sets that are stored in a proprietary hash table. Like all hash tables, I need two functions for each element. First, I need the hash value to use for insertion. Second, I need a compare function when there's conflicts. It occurs to me that a checksum function would be perfect for this. I could use the value in both functions. There's no shortage of checksum functions but I would like to know if there's any commonly available ones that I wouldn't need to bring in a library for (my company is a PIA when it comes to that).A system library would be ok.
But I have an additional, more complicated requirement. I need for the checksum to be incrementally calculable. That is, if a set contains A B C D E F and I subtract D from the set, it should be able to return a new checksum value without iterating over all the elements in the set again. The reason for this is to prevent non-linearity in my code. Ideally, I'd like for the checksum to be order independent but I can sort them first if needed. Does such an algorithm exist?
Simply store a dictionary of items in your set, and their corresponding hash value. The hash value of the set is the hash value of the concatenated, sorted hashes of the items. In Python:
hashes = '''dictionary of hashes in string representation'''
# e.g.
hashes = { item: hashlib.sha384(item) for item in items }
sorted_hashes = sorted(hashes.values())
concatenated_hashes = ''.join(sorted_hashes)
hash_of_the_set = hashlib.sha384(concatenated_hashes)
As hash function I would use sha384, but you might want to try Keccak-384.
Because there are (of course) no cryptographic hash functions with a lengths of only 32-bit, you have to use a checksum instead, like Adler-32 or CRC32. The idea remains the same. Best use Adler32 on the items and crc32 on the concatenated hashes:
hashes = { item: zlib.adler32(item) for item in items }
sorted_hashes = sorted(hashes.values())
concatenated_hashes = ''.join(sorted_hashes)
hash_of_the_set = zlib.crc32(concatenated_hashes)
In C++ you can use Adler-32 and CRC-32 of Botan.
A CRC is a set of bits that are calculated from an input.
If your input is the same size (or less) as the CRC (in your case - 32 bits), you can find the input that created this CRC - in effect reversing it.
If your input is larger than 32 bits, but you know all the input except for 32 bits, you can still reverse the CRC to find the missing bits.
If, however, the unknown part of the input is larger than 32 bits, you can't find it as there is more than one solution.
Why am I telling you this? Imagine you have the CRC of the set
{A,B,C}
Say you know what B is, and you can now calculate easily the CRC of the set
{A,C}
(by "easily" I mean - without going over the entire A and C inputs - like you wanted)
Now you have 64 bits describing A and C! And since we didn't have to go over the entirety of A and C to do it - it means we can do it even if we're missing information about A and C.
So it looks like IF such a method exists, we can magically fix more than 32 unknown bits from an input if we have the CRC of it.
This obviously is wrong. Does that mean there's no way to do what you want? Of course not. But it does give us constraints on how it can be done:
Option 1: we don't gain more information from CRC({A,C}) that we didn't have in CRC({A,B,C}). That means that the (relative) effect of A and C on the CRC doesn't change with the removal of B. Basically - it means that when calculating the CRC we use some "order not important" function when adding new elements:
we can use, for example, CRC({A,B,C}) = CRC(A) ^ CRC(B) ^ CRC(C) (not very good, as if A appears twice it's the same CRC as if it never appeared at all), or CRC({A,B,C}) = CRC(A) + CRC(B) + CRC(C) or CRC({A,B,C}) = CRC(A) * CRC(B) * CRC(C) (make sure CRC(X) is odd, so it's actually just 31 bits of CRC) or CRC({A,B,C}) = g^CRC(A) * g^CRC(B) * g^CRC(C) (where ^ is power - useful if you want cryptographically secure) etc.
Option 2: we do need all of A and C to calculate CRC({A,C}), but we have a data structure that makes it less than linear in time to do so if we already calculated CRC({A,B,C}).
This is useful if you want specifically CRC32, and don't mind remembering more information in addition to the CRC after the calculation (the CRC is still 32 bit, but you remember a data structure that's O(len(A,B,C)) that you will later use to calculate CRC{A,C} more efficiently)
How will that work? Many CRCs are just the application of a polynomial on the input.
Basically, if you divide the input into n chunks of 32 bit each - X_1...X_n - there is a matrix M such that
CRC(X_1...X_n) = M^n * X_1 + ... + M^1 * X_n
(where ^ here is power)
How does that help? This sum can be calculated in a tree-like fashion:
CRC(X_1...X_n) = M^(n/2) * CRC(X_1...X_n/2) + CRC(X_(n/2+1)...X_n)
So you begin with all the X_i on the leaves of the tree, start by calculating the CRC of each consecutive pair, then combine them in pairs until you get the combined CRC of all your input.
If you remember all the partial CRCs on the nodes, you can then easily remove (or add) an item anywhere in the list by doing just O(log(n)) calculations!
So there - as far as I can tell, those are your two options. I hope this wasn't too much of a mess :)
I'd personally go with option 1, as it's just simpler... but the resulting CRC isn't standard, and is less... good. Less "CRC"-like.
Cheers!

Is there a library that would produce a string that would hash (SHA1) to a given input?

I'm wondering if it's possible to find a block of text that would hash to a known value. In particular, I'm looking for a function CreateDataFromHash() that could be called as follows:
unsigned char myHash[] = "da39a3ee5e6b4b0d3255bfef95601890afd80709";
unsigned int length = 10000;
CreateDataFromHash(myHash, length);
Here CreateDataFromHash would return the string of the length 10000 containing arbitrary data, which would hash to myHash using SHA1.
Thanks.
There's no known easy or even moderately difficult way to do this in general.
The entire point of hashes (or so-called one-way functions), is that it's easy to compute them, but next to impossible to reverse their computation (find input values based on output). That said, for some hash functions, there are known methods that may allow computing inputs for a given hash value in reasonable time.
For example, this MD5 sum technique will find collisions (but not input for a given output) in about 8 hours on a 1.6GHz computer.
For SHA-1 in particular you may be interested in reading this.
One of the purposes of SHA1 is that this should be very hard to do.
hashing is a one way function. you can't get input from the output.
This would be a "preimage attack". No such thing is publicly known against SHA-1.
The only attack known against SHA-1 is a collision attack. That means I find two inputs that produce the same result, but neither of them is pre-ordained, so to speak. Even so, this attack isn't really feasible for most people -- based on the amount of computation involved, the closest I can figure is that you'd have to spend somewhere in the range of a few million dollars to build a machine that would give you about one colliding pair of keys per week (assuming it ran, doing nothing else 24/7).
You have to brute force it. See
PHP brute force password generator
Get string, do hash, compare, repeat

Two-way "Hashing" of string

I want to generate int from a string and be able to generate it back.
Something like hash function but two-way function.
I want to use ints as ID in my application, but want to be able to convert it back in case of logging or debugging.
Like:
int id = IDProvider::getHash("NameOfMyObject");
object * a = createObject(id);
...
if(error)
{
LOG(IDProvider::getOriginalString(a->getId()), "some message");
}
I have heard of slightly modified CRC32 to be fast and 100% reversible, but I can not find it and I am not able to write it by myself.
Any hints what should I use?
Thank you!
edit
I have just founded the source I have the whole CRC32 thing from:
Jason Gregory : Game Engine Architecture
quotation:
"As with any hashing system, collisions are a possibility (i.e., two different strings might end up with the same hash code). However, with a suitable hash function, we can all but guarantee that collisions will not occur for all reasonable input strings we might use in our game. After all, a 32-bit hash chode represents more than four billion possible values. So if our hash function does a good job of distributing strings evently throughout this very large range, we are unlikely to collide. At Naughty Dog, we used a variant of the CRC-32 algorithm to hash our strings, and we didn't encounter a single collision in over two years of development on Uncharted: Drake's Fortune."
Reducing an arbitrary length string to a fixed size int is mathematically impossible to reverse. See Pidgeonhole principle. There is a near infinite amount of strings, but only 2^32 32 bit integers.
32 bit hashes(assuming your int is 32 bit) can have collisions very easily. So it's not a good unique ID either.
There are hashfunctions which allow you to create a message with a predefined hash, but it most likely won't be the original message. This is called a pre-image.
For your problem it looks like the best idea is creating a dictionary that maps integer-ids to strings and back.
To get the likelyhood of a collision when you hash n strings check out the birthday paradox. The most important property in that context is that collisions become likely once the number of hashed messages approaches the squareroot of the number of available hash values. So with a 32 bit integer collisions become likely if you hash around 65000 strings. But if you're unlucky it can happen much earlier.
I have exactly what you need. It is called a "pointer". In this system, the "pointer" is always unique, and can always be used to recover the string. It can "point" to any string of any length. As a bonus, it also has the same size as your int. You can obtain a "pointer" to a string by using the & operand, as shown in my example code:
#include <string>
int main() {
std::string s = "Hai!";
std::string* ptr = &s; // this is a pointer
std::string copy = *ptr; // this retrieves the original string
std::cout << copy; // prints "Hai!"
}
What you need is encryption. Hashing is by definition one way. You might try simple XOR Encryption with some addition/subtraction of values.
Reversible hash function?
How come MD5 hash values are not reversible?
checksum/hash function with reversible property
http://groups.google.com/group/sci.crypt.research/browse_thread/thread/ffca2f5ac3093255
... and many more via google search...
You could look at perfect hashing
http://en.wikipedia.org/wiki/Perfect_hash_function
It only works when all the potential strings are known up front. In practice what you enable by this, is to create a limited-range 'hash' mapping that you can reverse-lookup.
In general, the [hash code + hash algorithm] are never enough to get the original value back. However, with a perfect hash, collisions are by definition ruled out, so if the source domain (list of values) is known, you can get the source value back.
gperf is a well-known, age old program to generate perfect hashes in c/c++ code. Many more do exist (see the Wikipedia page)
Is it not possible. Hashing is not-returnable function - by definition.
As everyone mentioned, it is not possible to have a "reversible hash". However, there are alternatives (like encryption).
Another one is to zip/unzip your string using any lossless algorithm.
That's a simple, fully reversible method, with no possible collision.

Fast 64 bit comparison

I'm working on a GUI framework, where I want all the elements to be identified by ascii strings of up to 8 characters (or 7 would be ok).
Every time an event is triggered (some are just clicks, but some are continuous), the framework would callback to the client code with the id and its value.
I could use actual strings and strcmp(), but I want this to be really fast (for mobile devices), so I was thinking to use char constants (e.g. int id = 'BTN1';) so you'd be doing a single int comparison to test for the id. However, 4 chars isn't readable enough.
I tried an experiment, something like-
long int id = L'abcdefg';
... but it looks as if char constants can only hold 4 characters, and the only thing making a long int char constant gives you is the ability for your 4 characters to be twice as wide, not have twice the amount of characters. Am I missing something here?
I want to make it easy for the person writing the client code. The gui is stored in xml, so the id's are loaded in from strings, but there would be constants written in the client code to compare these against.
So, the long and the short of it is, I'm looking for a cross-platform way to do quick 7-8 character comparison, any ideas?
Are you sure this is not premature optimisation? Have you profiled another GUI framework that is slow purely from string comparisons? Why are you so sure string comparisons will be too slow? Surely you're not doing that many string compares. Also, consider strcmp should have a near optimal implementation, possibly written in assembly tailored for the CPU you're compiling for.
Anyway, other frameworks just use named integers, for example:
static const int MY_BUTTON_ID = 1;
You could consider that instead, avoiding the string issue completely. Alternatively, you could simply write a helper function to convert a const char[9] in to a 64-bit integer. This should accept a null-terminated string "like so" up to 8 characters (assuming you intend to throw away the null character). Then your program is passing around 64-bit integers, but the programmer is dealing with strings.
Edit: here's a quick function that turns a string in to a number:
__int64 makeid(const char* str)
{
__int64 ret = 0;
strncpy((char*)&ret, str, sizeof(__int64));
return ret;
}
One possibility is to define your IDs as a union of a 64-bit integer and an 8-character string:
union ID {
Int64 id; // Assuming Int64 is an appropriate typedef somewhere
char name[8];
};
Now you can do things like:
ID id;
strncpy(id.name, "Button1", 8);
if (anotherId.id == id.id) ...
The concept of string interning can be useful for this problem, turning string compares into pointer compares.
Easy to get pre-rolled Components
binary search tree for the win -- you get a red-black tree from most STL implementations of set and map, so you might want to consider that.
Intrusive versions of the STL containers perform MUCH better when you move the container nodes around a lot (in the general case) -- however they have quite a few caveats.
Specific Opinion -- First Alternative
If I was you I'd stick to a 64-bit integer type and bundle it in a intrusive container and use the library provided by boost. However if you are new to this sort of thing then use stl::map it is conceptually simpler to grasp, and it has less chances of leaking resources since there is more literature and guides out there for these types of containers and the best practises.
Alternative 2
The problem you are trying to solve I believe: is to have a global naming scheme which maps to handles. You can create a mapping of names to handles so that you can use the names to retrieve handles:
// WidgetHandle is a polymorphic base class (i.e., it has a virtual method),
// and foo::Luv implement WidgetHandle's interface (public inheritance)
foo::WidgetHandle * LuvComponent =
Factory.CreateComponent<foo::Luv>( "meLuvYouLongTime");
....
.... // in different function
foo::WidgetHandle * LuvComponent =
Factory.RetrieveComponent<foo::Luv>("meLuvYouLongTime");
Alternative 2 is a common idiom for IPC, you create an IPC type say a pipe in one process and you can ask the kernel for to retrieve the other end of the pipe by name.
I see a distinction between easily read identifiers in your code, and the representation being passed around.
Could you use an enumerated type (or a large header file of constants) to represent the identifier? The names of the enumerated types could then be as long and meaningful as you wish, and still fit in (I am guessing) a couple of bytes.
In C++0x, you'll be able to use user-defined string literals, so you could add something like 7chars..id or "7chars.."id:
template <char...> constexpr unsigned long long operator ""id();
constexpr unsigned long long operator ""id(const char *, size_t);
Although I'm not sure you can use constexpr for the second one.