Unique Property of Strings to build an efficient Hash Table - c++

What is the unique property of strings in C++? Why can they be compared by relational operators (e.g. when trying to sort an array of strings alphabetically)? I am trying to capitalize on this "property" in order to build a fine hashing function for a table with no collisions for every possible string. Also, what data structure would work for this? I'm thinking a vector because I will have to go through a document without knowing how many unique words are in it, and I want to go through the document just once.

C++ standard strings are essentially vectors of characters. Comparing strings thus means to compare them character by character from the beginning.
I'm not sure what you mean by 'unique property', but for your usecase any hashing algorithm should do.
If I understand your usecase correctly, you might want to use a std::set< YourHashType > or std::map. That way you wouldn't have to take care of finding out whether a word was already added or not.

The most simple algorithm that calculates the hash key for a null-terminated C-style string is the following:
UINT HashKey(const char* key) const
{
UINT nHash = 0;
while (*key)
nHash = (nHash<<5) + nHash + *key++;
return nHash;
}

I am trying to capitalize on this "property" in order to build a fine hashing function for a table with no collisions for every possible string.
As an example of the pigeonhole principle, you can't have a collision free hash function. Strings sort uniquely when you compare them lexically (e.g. letter by letter) using a function like std::strcmp, but that only gives you a unique ordering using comparison and not an intrinsic unique property of a string.
If you have a finite set of keys, you can design a collision free hash function though, which is referred to as perfect hashing.

Related

Can std::hash<std::string> return the same value for different strings?

Below link it is mentioned chances of collision but I am trying to use it for finding duplicate entry:
http://www.cplusplus.com/reference/functional/hash/
I am using std::hash<std::string> and storing the return value in std::unordered_set. if emplace is fails, I am marking string as it is duplicate string.
Hashes are generally functions from a large space of values into a small space of values, e.g. from the space of all strings to 64-bit integers. There are a lot more strings than 64-bit integers, so obviously multiple strings can have the same hash. A good hash function is such that there's no simple rule relating strings with the same hash value.
So, when we want to use hashes to find duplicate strings (or duplicate anything), it's always a two-phase process (at least):
Look for strings with identical hash (i.e. locate the "hash bucket" for your string)
Do a character-by-character comparison of your string with other strings having the same hash.
std::unordered_set does this - and never mind the specifics. Note that it does this for you, so it's redundant for you to hash yourself, then store the result in an std::unordered_set.
Finally, note that there are other features one could use for initial duplicate screening - or for searching among the same-hash values. For example, string length: Before comparing two strings character-by-character, you check their lengths (which you should be able to access without actually iterating the strings); different lengths -> non-equal strings.
Yes, it is possible that two different strings will share the same hash. Simply put, let's imagine you have a hash stored in an 8bit type (unsigned char).
That is 2^8 = 256 possible values. That means you can only have 256 unique hashes of arbitrary inputs.
Since you can definitely create more than 256 different strings, there is no way the hash would be unique for all possible strings.
std::size_t is a 64bit type, so if you used this as a storage for the hash value, you'd have 2^64 possible hashes, which is marginally more than 256 possible unique hashes, but it's still not enough to differentiate between all the possible strings you can create.
You just can't store an entire book in only 64 bits.
Yes it can return the same result for different strings. This is a natural consequence of reducing an infinite range of possibilities to a single 64-bit number.
There exist things called "perfect hash functions" which produce a hash function that will return unique results. However, this is only guaranteed for a known set of inputs. An unknown input from outside might produce a matching hash number. That possibility can be reduced by using a bloom filter.
However, at some point with all these hash calculations the program would have been better off doing simple string comparisons in an unsorted linear array. Who cares if the operation is O(1)+C if C is ridiculously big.
Yes, std::hash return same result for different std::string.
The creation of buckets is different by different compiler.
Compiler based implementation found at link:
hashing and rehashing for std::unordered_set

Hash function for String Data

I'm working on hash table in C++ language and I need a hash function for string data. One hash function that I have tried is add ascii code and use modulo (%100).
My actual requirement is to find the words which exactly matches or started with a given pattern.
Ex: Given pattern is "comp". Then I want get all the words starting with comp. (Ex: company, computer, comp etc) Can I do this using a hash because the tried hash function can find only exact matches.
So can anyone suggest me a hash function suitable for this requirement.
Prefix matched is better handled with a trie.
Basically this is a tree structure that holds on each node one character from the key. The concatenating the characters from the different nodes in the path from the root to a given node will produce the key for that node.
Searching is a matter of descending the trie comparing each character of the searched key with the child nodes. Once you consumed all the characters, the remaining subtree are all the keys that have as prefix the searched key.
Sounds like what you really need is lexicographical sorting. You can do that by using a sorted data structure, like a std set or map, or by using vector and the std::sort algorithm. Note that C++ sort is faster than std C qsort.

Inserting strings into an AVL tree in C++?

I understand how an AVL tree works with integers.. but I'm having a hard time figuring out a way to insert strings into one instead. How would the strings be compared?
I've thought of just using the ASCII total value and sorting that way.. but in that situation, inserting two identical ASCII words (such as "tied" and "diet") would seem to return an error.
How do you get around this? Am I thinking about it in the wrong way, and need a different way to sort the nodes?
And no they don't need to be alphabetical or anything... just in an AVL tree so I can search for them quickly.
When working with strings, you normally use a lexical comparison -- i.e., you start with the first character of each string. If one is less than the other (e.g., with "diet" vs. "tied", "d" is less than "t") the comparison is based on that letter. If and only if the first letters are equal, you go to the second letter, and so on. The two are equal only if every character (in order) from beginning to end of the strings are equal.
Well, since an AVL tree is an ordered structure, the int string::compare(const string&) const routine should be able to give you an indication of how to order the strings.
If order of the items is actually irrelevant, you'll get better performance out of an unordered structure that can take better advantage of what you're trying to do: a hash table.
The mapping of something like a string to a fixed-size key is called a hash function, and the phenomenon where multiple keys are mapped to the same value is called a collision. Collisions are expected to happen occasionally when hashing, and a basic data structure would needs to be extended to handle it, perhaps by making each node a "bucket" (linked list, vector, array, what have you) of all the items that have colliding hash values that is then searched linearly.

What are some good methods to replace string names with integer hashes

Usually, entities and components or other parts of the game code in data-driven design will have names that get checked if you want to find out which object you're dealing with exactly.
void Player::Interact(Entity *myEntity)
{
if(myEntity->isNearEnough(this) && myEntity->GetFamilyName() == "guard")
{
static_cast<Guard*>(myEntity)->Say("No mention of arrows and knees here");
}
}
If you ignore the possibility that this might be premature optimization, it's pretty clear that looking up entities would be a lot faster if their "name" was a simple 32 bit value instead of an actual string.
Computing hashes out of the string names is one possible option. I haven't actually tried it, but with a range of 32bit and a good hashing function the risk of collision should be minimal.
The question is this: Obviously we need some way to convert in-code (or in some kind of external file) string-names to those integers, since the person working on these named objects will still want to refer to the object as "guard" instead of "0x2315f21a".
Assuming we're using C++ and want to replace all strings that appear in the code, can this even be achieved with language-built in features or do we have to build an external tool that manually looks through all files and exchanges the values?
Jason Gregory wrote this on his book :
At Naughty Dog, we used a variant of the CRC-32 algorithm to hash our strings, and we didn't encounter a single collision in over two years of development on Uncharted: Drake's Fortune.
So you may want to look into that.
And about the build step you mentioned, he also talked about it. They basically encapsulate the strings that need to be hashed in something like:
_ID("string literal")
And use an external tool at build time to hash all the occurrences. This way you avoid any runtime costs.
This is what enums are for. I wouldn't dare to decide which resource is best for the topic, but there are plenty to choose from: https://www.google.com/search?q=c%2B%2B+enum
I'd say go with enums!
But if you already have a lot of code already using strings, well, either just keep it that way (simple and usually enough fast on a PC anyway) or hash it using some kind of CRC or MD5 into an integer.
This is basically solved by adding an indirection on top of a hash map.
Say you want to convert strings to integers:
Write a class wraps both an array and a hashmap. I call these classes dictionaries.
The array contains the strings.
The hash map's key is the string (shared pointers or stable arrays where raw pointers are safe work as well)
The hash map's value is the index into the array the string is located, which is also the opaque handle it returns to calling code.
When adding a new string to the system, it is searched for already existing in the hashmap, returns the handle if present.
If the handle is not present, add the string to the array, the index is the handle.
Set the string and the handle in the map, and return the handle.
Notes/Caveats:
This strategy makes getting the string back from the handle run in constant time (it is merely an array deference).
handle identifiers are first come first serve, but if you serialize the strings instead of the values it won't matter.
Operator[] overloads for both the key and the value are fairly simple (registering new strings, or getting the string back), but wrapping the handle with a user-defined class (wrapping an integer) adds a lot of much needed type safety, and also avoids ambiguity if you want the key and the values to be the same types (overloaded[]'s wont compile and etc)
You have to store the strings in RAM, which can be a problem.

Given 200 strings, what is a good way to key a LUT of relationship values

I've got 200 strings. Each string has a relationship (measured by a float between 0 and 1) with every other string. This relationship is two-way; that is, relationship A/B == relationship B/A. This yields n(n-1)/2 relationships, or 19,800.
What I want to do is store these relationships in a lookup table so that given any two words I can quickly find the relationship value.
I'm using c++ so I'd probably use a std::map to store the LUT. The question is, what's the best key to use for this purpose.
The key needs to be unique and needs to be able to be calculated quickly from both words.
My approach is going to be to create a unique identifier for each word pair. For example given the words "apple" and "orange" then I combine them together as "appleorange" (alphabetical order, smallest first) and use that as the key value.
Is this a good solution or can someone suggest something more cleverer? :)
Basically you are describing a function of two parameters with the added property that order of parameters is not significant.
Your approach will work if you do not have ambiguity between words when changing order (I would suggest putting a coma or like between the two words to remove possible ambiguities). Any 2D array would also work.
I would probably convert each keyword to some unique identifier (using a simple map) before trying to find the relationship value, but it does not change much from what you are proposing.
If boost/tr1 is acceptable, I would go for an unordered_map with the pair of strings as key. The main question would then be: what with the order of the strings? This could be handled by the hash-function, which starts with the lexical first string.
Remark: this is just a suggestion after reading the design-issue, not a study.
How "quickly" is quickly? Given you don't care about the order of the two words, you could try a map like this:
std::map<std::set<std::string>, double> lut;
Here the key is a set of the two words, so if you insert "apple" and "orange", then the order is the same as "orange" "apple", and given set supports the less than operator, it can function as a key in a map. NOTE: I intentionally did not use a pair for a key, given the order matters there...
I'd start with something fairly basic like this, profile and see how fast/slow the lookups etc. are before seeing if you need to do anything smarter...
If you create a sorted array with the 200 strings, then you can binary search it to find the matching indices of the two strings, then use those two indices in a 2D array to find the relationship value.
If your 200 strings are in an array, your 20,100 similarity values can be in a one dimensional array too. It's all down to how you index into that array. Say x and y are the indexes of the strings you want the similarity for. Swap x and y if necessary so that y>=x, then look at entry i= x + y(y+1)/2 in the large array.
(x,y) of (0,0),(0,1),(1,1),(0,2),(1,2),(2,2),(0,3),(1,3)... will take you to entry 0,1,2,3,4,5,6,7...
So this uses space optimally and it gives faster look up than a map would. I'm assuming efficiency is at least mildly important to you since you are using C++!
[if you're not interested in self similarity values where y=x, then use i = x + y(y-1)/2 instead].