Hash function for String Data - c++

I'm working on hash table in C++ language and I need a hash function for string data. One hash function that I have tried is add ascii code and use modulo (%100).
My actual requirement is to find the words which exactly matches or started with a given pattern.
Ex: Given pattern is "comp". Then I want get all the words starting with comp. (Ex: company, computer, comp etc) Can I do this using a hash because the tried hash function can find only exact matches.
So can anyone suggest me a hash function suitable for this requirement.

Prefix matched is better handled with a trie.
Basically this is a tree structure that holds on each node one character from the key. The concatenating the characters from the different nodes in the path from the root to a given node will produce the key for that node.
Searching is a matter of descending the trie comparing each character of the searched key with the child nodes. Once you consumed all the characters, the remaining subtree are all the keys that have as prefix the searched key.

Sounds like what you really need is lexicographical sorting. You can do that by using a sorted data structure, like a std set or map, or by using vector and the std::sort algorithm. Note that C++ sort is faster than std C qsort.

Related

Unique Property of Strings to build an efficient Hash Table

What is the unique property of strings in C++? Why can they be compared by relational operators (e.g. when trying to sort an array of strings alphabetically)? I am trying to capitalize on this "property" in order to build a fine hashing function for a table with no collisions for every possible string. Also, what data structure would work for this? I'm thinking a vector because I will have to go through a document without knowing how many unique words are in it, and I want to go through the document just once.
C++ standard strings are essentially vectors of characters. Comparing strings thus means to compare them character by character from the beginning.
I'm not sure what you mean by 'unique property', but for your usecase any hashing algorithm should do.
If I understand your usecase correctly, you might want to use a std::set< YourHashType > or std::map. That way you wouldn't have to take care of finding out whether a word was already added or not.
The most simple algorithm that calculates the hash key for a null-terminated C-style string is the following:
UINT HashKey(const char* key) const
{
UINT nHash = 0;
while (*key)
nHash = (nHash<<5) + nHash + *key++;
return nHash;
}
I am trying to capitalize on this "property" in order to build a fine hashing function for a table with no collisions for every possible string.
As an example of the pigeonhole principle, you can't have a collision free hash function. Strings sort uniquely when you compare them lexically (e.g. letter by letter) using a function like std::strcmp, but that only gives you a unique ordering using comparison and not an intrinsic unique property of a string.
If you have a finite set of keys, you can design a collision free hash function though, which is referred to as perfect hashing.

Instant sort when put new value in array C++

I have a dynamically allocated array containing structs with a key pair value. I need to write an update(key,value) function that puts new struct into array or if struct with same key is already in the array it needs to update its value. Insert and Update is combined in one function.
The problem is:
Before adding a struct I need to check if struct with this key already existing.
I can go through all elements of array and compare key (very slow)
Or I can use binary search, but (!) array must be sorted.
So I tried to sort array with each update (sloooow) or to sort it when calling binary search funtion.....which is each time updating
Finally, I thought that there must be a way of inserting a struct into array so it would be placed in a right place and be always sorted.
However, I couldn't think of an algorithm like that so I came here to ask for some help because google refuses to read my mind.
I need to make my code faster because my array accepts more that 50 000 structs and I'm using bubble sort (because I'm dumb).
Take a look at Red Black Trees: http://en.wikipedia.org/wiki/Red%E2%80%93black_tree
They will ensure the data is always sorted, and it has a complexity of O ( log n ) for inserts.
A binary heap will not suffice, as a binary heap does not have guaranteed sort order, your only guarantee is that the top element is either min or max.
One possible approach is to use a different data structure. As there is no genuine need to keep the structs ordered, there is only need to detect if the struct with the same key exits, so the costs of maintaining order in a balanced tree (for instance by using std::map) are excessive. A more suitable data structure would be a hash table. C++11 provides such in the standard library under obscure name std::unordered_map (http://en.cppreference.com/w/cpp/container/unordered_map).
If you insist on using an array, a possible approach might be to combine these algorithms:
Bloom filter (http://en.wikipedia.org/wiki/Bloom_filter)
Partial sort (http://en.cppreference.com/w/cpp/algorithm/partial_sort)
Binary search
Maintain two ranges in the array -- first goes a range that is already sorted, then goes a range that is not yet. When you insert a struct, first check with the bloom filter if a matching struct might already exist. If the bloom filter gives a negative answer, then just insert the struct at the end of the array. After that the sorted range does not change, the unsorted range grows by one.
If the bloom filter gives a positive answer, then apply partial sort algorithm to make the entire array sorted and then use binary search to check if such an object actually exists. If so, replace this element. After that the sorted range is the entire array, and the unsorted range is empty.
If the binary search has shown that the bloom filter was wrong, and the matching struct is not there, then you just put the new struct at the end of the array. After that the sorted range is entire array minus one, and the unsorted range is the last element in the array.
Each time you insert an element, binary search to find if it exists. If it doesn't exist, the binary search will give you the index at which you can insert it.
You could use std::set, which does not allow duplicate elements and places elements in sorted position. This assumes that you are storing the key and value in a struct, and not separately. In order for the sorting to work properly, you will need to define a comparison function for the structs.

Inserting strings into an AVL tree in C++?

I understand how an AVL tree works with integers.. but I'm having a hard time figuring out a way to insert strings into one instead. How would the strings be compared?
I've thought of just using the ASCII total value and sorting that way.. but in that situation, inserting two identical ASCII words (such as "tied" and "diet") would seem to return an error.
How do you get around this? Am I thinking about it in the wrong way, and need a different way to sort the nodes?
And no they don't need to be alphabetical or anything... just in an AVL tree so I can search for them quickly.
When working with strings, you normally use a lexical comparison -- i.e., you start with the first character of each string. If one is less than the other (e.g., with "diet" vs. "tied", "d" is less than "t") the comparison is based on that letter. If and only if the first letters are equal, you go to the second letter, and so on. The two are equal only if every character (in order) from beginning to end of the strings are equal.
Well, since an AVL tree is an ordered structure, the int string::compare(const string&) const routine should be able to give you an indication of how to order the strings.
If order of the items is actually irrelevant, you'll get better performance out of an unordered structure that can take better advantage of what you're trying to do: a hash table.
The mapping of something like a string to a fixed-size key is called a hash function, and the phenomenon where multiple keys are mapped to the same value is called a collision. Collisions are expected to happen occasionally when hashing, and a basic data structure would needs to be extended to handle it, perhaps by making each node a "bucket" (linked list, vector, array, what have you) of all the items that have colliding hash values that is then searched linearly.

finding the longest prefix contained in a dictionary

I'm implementing Lempel-Ziv compression and a question springs to mind.
Given a 'dictionary' and a string of characters. I want to be able to compute the longest prefix og the string, that is contained in the dictionary.
That is given strings:
0 : AABB
1 : ABA
2 : AAAB
and the query string 'AABBABA' I would like to be able to do the a lookup that returns '0' this should be done in time linear to the length of the prefix.
Next of I would like to be able to add the new prefix 'AABBAB' to the dictionary in constant time.
Is there a standard, and easy way/algorithm for doing this?
My original idea was to build a standart n-way tree with a list of pointers and just search this?
You are describing a simple trie lookup, except that you would return a leaf node even when there are excess characters.
Not sure what you're thinking of with an n-way tree, but most likely it's exactly the same, since it's the obvious solution :v) . If you want to be more efficient, you can look into different kinds of tries.

C++ Boggle Solver: Finding Prefixes in a Set

This is for a homework assignment, so I don't want the exact code, but would appreciate any ideas that can help point me in the right direction.
The assignment is to write a boggle solving program. I've got the recursive part down I feel, but I need some insight on how to compare the current sequence of characters to the dictionary.
I'm required to store the dictionary in either a set or sorted list. I've been trying a way to implement this using a set. In order to make the program run faster and not follow dead end paths, I need to check and see if the current sequence of characters exists as a prefix to anything in the set (dictionary).
I've found that set.find() operation only returns true if the string is an exact match. In the lab requirements, the professor mentioned that:
"If the dictionary is stored in a Set, many data structure libraries provide a way to find the string in the Set that is closest to the one you are searching for. Such an operation could be used to quickly find a word with a given prefix."
I've been searching today for a what the professor is describing. I've found a lot of information on tries, but since I'm required to use a list or set, I don't think that will work.
I've also tried looking up algorithms for autocomplete functions, but the ones that I've found seem extremely complicated for what I'm trying to accomplish here.
I also was thinking of using strncmp() to compare the current sequence to a word from the dictionary set, but again, I don't know how exactly that would function in this situation, if at all.
Is it worth it to continue investigating how this would work in a set or should I just try using a sorted list to store my dictionary?
Thanks
As #Raymond Hettinger mentions in his answer, a trie would be extremely useful here. However, if you either are uncomfortable writing a trie or would prefer to use off-the-shelf components, you can use a cute property of how words are ordered alphabetically to check in O(log n) time whether a given prefix exists. The idea is as follows - suppose for example that you are checking for the prefix "thr." If you'll note, every word that begins with the prefix "thr" must be sandwiched between the strings "thr" and "ths." For example, thr ≤ through < ths, and thr ≤ throat < ths. If you are storing your words in a giant sorted array, you can use a modified version of binary search to find the first word alphabetically at least the prefix of your choice and the first word alphabetically at least the next prefix (formed by taking the last letter of the prefix and incrementing it). If these are the same word, then nothing is between them and the prefix doesn't exist. If they're not, then something is between them and the prefix does it.
Since you're using C++, you can potentially do with a std::vector and the std::lower_bound algorithm. You could also throw all the words into a std::set and use the set's version of lower_bound. For example:
std::set<std::string> dictionary;
std::string prefix = /* ... */
/* Get the next prefix. */
std::string nextPrefix = prefix;
nextPrefix[nextPrefix.length() - 1]++;
/* Check whether there is something with the prefix. */
if (dictionary.lower_bound(prefix) != dictionary.lower_bound(nextPrefix)) {
/* ... something has that prefix ... */
} else {
/* ... no word has that prefix ... */
}
That said, the trie is probably a better structure here. If you're interested, there is another data structure called a DAWG (Directed Acyclic Word Graph) that is similar to the trie but uses substantially less memory; in the Stanford introductory CS courses (where Boggle is an assignment), students actually are provided a DAWG containing all the words in the language. There is also another data structure called a ternary search tree that is somewhere in-between a binary search tree and a trie that may be useful here, if you'd like to look into it.
Hope this helps!
The trie is the preferred data structure of choice for this problem.
If you're limited to sets and dictionaries, I would choose a dictionary that maps prefixes to an array of possible matches:
asp -> aspberger aspire
bal -> balloon balance bale baleen ...