String completion and matching algorithm - c++

You have two sets: S1={B,C,D,T,M,...} and S2={with each other letter of alphabet not present in S1}.
Now, I have some string constructed with consonants from S1 (ie. BBWRD) which I want to be transformed into words/sentences based of provided dictionary (ie. dict from spelling mechanism).
Algorithm can fill spaces between each of letter in 'base word' with any amount of letters from S2. Order can't be changed and letters/consonants from S1 can't be used.
The only thing that came to my mind is usage of regexp. Can you propose any other, better approach? Or at least give name to this kind of algorithms, so I could search further.

I'd think about creating a search tree. Each node would have |S1| subnodes and leaves would contain a list of possible words, which may be constructed from a given acronym (such that - for example = D on the path W->R->D would contain "Word". Searching in such tree would be pretty fast, though it would require a noticeable amount of memory to be stored for quick access.

Related

The fastest C++ algorithm for string testing against a list of predefined seeds (case insensitive)

I have list of seed strings, about 100 predefined strings. All strings contain only ASCII characters.
std::list<std::wstring> seeds{ L"google", L"yahoo", L"stackoverflow"};
My app constantly receives a lot of strings which can contain any characters. I need check each received line and decide whether it contain any of seeds or not. Comparison must be case insensitive.
I need the fastest possible algorithm to test received string.
Right now my app uses this algo:
std::wstring testedStr;
for (auto & seed : seeds)
{
if (boost::icontains(testedStr, seed))
{
return true;
}
}
return false;
It works well, but I'm not sure that this is the most efficient way.
How is it possible to implement the algorithm in order to achieve better performance?
This is a Windows app. App receives valid std::wstring strings.
Update
For this task I implemented Aho-Corasick algo. If someone could review my code it would be great - I do not have big experience with such algorithms. Link to implementation: gist.github.com
If there are a finite amount of matching strings, this means that you can construct a tree such that, read from root to leaves, similar strings will occupy similar branches.
This is also known as a trie, or Radix Tree.
For example, we might have the strings cat, coach, con, conch as well as dark, dad, dank, do. Their trie might look like this:
A search for one of the words in the tree will search the tree, starting from a root. Making it to a leaf would correspond to a match to a seed. Regardless, each character in the string should match to one of their children. If it does not, you can terminate the search (e.g. you would not consider any words starting with "g" or any words beginning with "cu").
There are various algorithms for constructing the tree as well as searching it as well as modifying it on the fly, but I thought I would give a conceptual overview of the solution instead of a specific one since I don't know of the best algorithm for it.
Conceptually, an algorithm you might use to search the tree would be related to the idea behind radix sort of a fixed amount of categories or values that a character in a string might take on at a given point in time.
This lets you check one word against your word-list. Since you're looking for this word-list as sub-strings of your input string, there's going to be more to it than this.
Edit: As other answers have mentioned, the Aho-Corasick algorithm for string matching is a sophisticated algorithm for performing string matching, consisting of a trie with additional links for taking "shortcuts" through the tree and having a different search pattern to accompany this. (As an interesting note, Alfred Aho is also a contributor to the the popular compiler textbook, Compilers: Principles, Techniques, and Tools as well as the algorithms textbook, The Design And Analysis Of Computer Algorithms. He is also a former member of Bell Labs. Margaret J. Corasick does not seem to have too much public information on herself.)
You can use Aho–Corasick algorithm
It builds trie/automaton where some vertices marked as terminal which would mean string has seeds.
It's built in O(sum of dictionary word lengths) and gives the answer in O(test string length)
Advantages:
It's specifically works with several dictionary words and check time doesn't depend on number of words (If we not consider cases where it doesn't fit to memory etc)
The algorithm is not hard to implement (comparing to suffix structures at least)
You may make it case insensitive by lowering each symbol if it's ASCII (non ASCII chars don't match anyway)
You should try a pre-existing regex utility, it may be slower than your hand-rolled algorithm but regex is about matching multiple possibilities, so it is likely it will be already several times faster than a hashmap or a simple comparison to all strings. I believe regex implementations may already use the Aho–Corasick algorithm mentioned by RiaD, so basically you will have at your disposal a well tested and fast implementation.
If you have C++11 you already have a standard regex library
#include <string>
#include <regex>
int main(){
std::regex self_regex("google|yahoo|stackoverflow");
regex_match(input_string ,self_regex);
}
I expect this to generate the best possible minimum match tree, so I expect it to be really fast (and reliable!)
One of the faster ways is to use suffix tree https://en.wikipedia.org/wiki/Suffix_tree, but this approach has huge disadvantage - it is difficult data structure with difficult constructing. This algorithm allows to build tree from string in linear complexity https://en.m.wikipedia.org/wiki/Ukkonen%27s_algorithm
Edit: As Matthieu M. pointed out, the OP asked if a string contains a keyword. My answer only works when the string equals the keyword or if you can split the string e.g. by the space character.
Especially with a high number of possible candidates and knowing them at compile time using a perfect hash function with a tool like gperf is worth a try. The main principle is, that you seed a generator with your seed and it generates a function that contains a hash function which has no collisions for all seed values. At runtime you give the function a string and it calculates the hash and then it checks if it is the only possible candidate corresponding to the hashvalue.
The runtime cost is hashing the string and then comparing against the only possible candidate (O(1) for seed size and O(1) for string length).
To make the comparison case insensitive you just use tolower on the seed and on your string.
Because number of string is not big (~100), you can use next algo:
Calculate max length of word you have. Let it be N.
Create int checks[N]; array of checksum.
Let's checksum will be sum of all characters in searching phrase. So, you can calculate such checksum for each word from your list (that is known at compile time), and create std::map<int, std::vector<std::wstring>>, where int is checksum of string, and vector should contain all your strings with that checksum.
Create array of such maps for each length (up to N), it can be done at compile time also.
Now move over big string by pointer. When pointer points to X character, you should add value of X char to all checks integers, and for each of them (numbers from 1 to N) remove value of (X-K) character, where K is number of integer in checks array. So, you will always have correct checksum for all length stored in checks array.
After that search on map does there exists strings with such pair (length & checksum), and if exists - compare it.
It should give false-positive result (when checksum & length is equal, but phrase is not) very rare.
So, let's say R is length of big string. Then looping over it will take O(R).
Each step you will perform N operations with "+" small number (char value), N operations with "-" small number (char value), that is very fast. Each step you will have to search for counter in checks array, and that is O(1), because it's one memory block.
Also each step you will have to find map in map's array, that will also be O(1), because it's also is one memory block.
And inside map you will have to search for string with correct checksum for log(F), where F is size of map, and it will usually contain no more then 2-3 strings, so we can in general pretend that it is also O(1).
Also you can check, and if there is no strings with same checksum (that should happens with high chance with just 100 words), you can discard map at all, storing pairs instead of map.
So, finally that should give O(R), with quite small O.
This way of calculating checksum can be changed, but it's quite simple and completely fast, with really rare false-positive reactions.
As a variant on DarioOO’s answer, you could get a possibly faster implementation of a regular expression match, by coding a lex parser for your strings. Though normally used together with yacc, this is a case where lex on its own does the job, and lex parsers are usually very efficient.
This approach might fall down if all your strings are long, as then an algorithm such as Aho-Corasick, Commentz-Walter or Rabin-Karp would probably offer significant improvements, and I doubt that lex implementations use any such algorithm.
This approach is harder if you have to be able to configure the strings without reconfiguration, but since flex is open source you could cannibalise its code.

When do we actually use a Trie?

I am starting to read about Trie. I got also references from friends here in: Tutorials on Trie
I am not clear on the following:
It seems that to go on and use a Trie one assumes that all the input strings that will be the search space and used to build the Trie are separated in distinct word boundaries.
E.g. all the example tutorials I have seen use input such as:
S={ball, bid, byte, car, cat, mac, map etc...}
Then we build the trie from S and do our searches (really fast)
My question is: How did we end up with S to begin with?
I mean before starting to read about tries I imagined that S would be an arbitrarily long text e.g. A Shakespeare passage.
Then using a Trie we could find things really fast.
But it seems this is not the case.
Is the assumption here that the input passage (of Shakespeare for example) is pre-processed first extracting all the words to get S?
So if one wants to search for patterns (same way as you do when you Google and see all pages having also spaces in your search query) a Trie is not appropriate?
When can we know if a Trie is the data structure that we can actually use?
Tries are useful where you have a fixed dictionary you want to look up quickly. Compared to a hashtable it may require less storage for a large dictionary but may well take longer to look up. One example place I have used it is for mapping URLs to operations on a web server were there may be inheritance of functionality based on the prefix. Here recursing down a trie enables appropriate lookup of all of the methods that need to be called for a particular url. It would also be efficient for storing a dictionary.
For doing text searches you would typically represent documents using a token vector of leximes with weights (perhaps based on occurance frequency), and then search against that to get a ranking of documents against a particular search vector. There a number of standard libraries to do this which I would suggest using rather than writing your own - particularly for removing stopwords, dealing with synonyms and stemming.
We can use tries for sub string searching in linear time, without pre processing the string every time. You can get a best tutorial on suffix tree generation #
Ukkonen's suffix tree algorithm in plain English?
As the other examples have said, a trie is useful because it provides fast string look-ups (or, more generally, look-ups for any sequence). Some examples of where I've used tries:
My answer to this question uses a (slightly modified) trie for matching sentences: it is a trie based on a sequence of words, rather than a sequence of characters. (The other answers to that question probably demonstrate the trie in action more clearly.)
I've also used a trie in a game which had a large number of rooms with names (the total number and the names were defined at run time), each of these names has to be unique and one had to be able to search for a room with a given name quickly. A hash table could also have been used, but in some ways a trie is simpler to implement and faster when using strings. (My trie implementation ended up being ~50 lines of C.)
The trie tag probably has many more examples.
There are multiple ways to use tries. The typical example is a lookup such as the one you have presented. However Tries can also be used to fully index a complete text. Either you use the Ukkonen suffix tree algorithm, to produce a suffix trie, or you explicetly construct the suffix trie by storing suffixes (much slower than Ukkonens algorithm, but also much simpler). As this is preprocessing, which needs to be done only once speed is not that crucial.
For this you would just take your text, insert the full text, then chop of the first letter, insert the resulting text, chop of second letter, insert...
So if we have the text "The Text" we would insert the following set:
{"The Text", "he Text", "e Text", " Text", "Text", "ext", "xt", "t"}
In the resulting suffix trie we can easily search for any kind of prefix. Also this is space efficient, because we do not need to store the whole string, since common prefixes are stored only once.
If you need to store much longer strings space efficiently it is best not only to store prefixes together but also suffixes. In that case you could build up a directed acyclic word graph (DAWG), which is very similar to a trie in conception.
So a trie in that sense allows finding arbitrary substrings, including partial words. If you are only interested in storing words, a different data structure should be used, for example a inverted list (if order is important) or a vector space based retrieval algorithm (in case word order does not matter).

C++ Boggle Solver: Finding Prefixes in a Set

This is for a homework assignment, so I don't want the exact code, but would appreciate any ideas that can help point me in the right direction.
The assignment is to write a boggle solving program. I've got the recursive part down I feel, but I need some insight on how to compare the current sequence of characters to the dictionary.
I'm required to store the dictionary in either a set or sorted list. I've been trying a way to implement this using a set. In order to make the program run faster and not follow dead end paths, I need to check and see if the current sequence of characters exists as a prefix to anything in the set (dictionary).
I've found that set.find() operation only returns true if the string is an exact match. In the lab requirements, the professor mentioned that:
"If the dictionary is stored in a Set, many data structure libraries provide a way to find the string in the Set that is closest to the one you are searching for. Such an operation could be used to quickly find a word with a given prefix."
I've been searching today for a what the professor is describing. I've found a lot of information on tries, but since I'm required to use a list or set, I don't think that will work.
I've also tried looking up algorithms for autocomplete functions, but the ones that I've found seem extremely complicated for what I'm trying to accomplish here.
I also was thinking of using strncmp() to compare the current sequence to a word from the dictionary set, but again, I don't know how exactly that would function in this situation, if at all.
Is it worth it to continue investigating how this would work in a set or should I just try using a sorted list to store my dictionary?
Thanks
As #Raymond Hettinger mentions in his answer, a trie would be extremely useful here. However, if you either are uncomfortable writing a trie or would prefer to use off-the-shelf components, you can use a cute property of how words are ordered alphabetically to check in O(log n) time whether a given prefix exists. The idea is as follows - suppose for example that you are checking for the prefix "thr." If you'll note, every word that begins with the prefix "thr" must be sandwiched between the strings "thr" and "ths." For example, thr ≤ through < ths, and thr ≤ throat < ths. If you are storing your words in a giant sorted array, you can use a modified version of binary search to find the first word alphabetically at least the prefix of your choice and the first word alphabetically at least the next prefix (formed by taking the last letter of the prefix and incrementing it). If these are the same word, then nothing is between them and the prefix doesn't exist. If they're not, then something is between them and the prefix does it.
Since you're using C++, you can potentially do with a std::vector and the std::lower_bound algorithm. You could also throw all the words into a std::set and use the set's version of lower_bound. For example:
std::set<std::string> dictionary;
std::string prefix = /* ... */
/* Get the next prefix. */
std::string nextPrefix = prefix;
nextPrefix[nextPrefix.length() - 1]++;
/* Check whether there is something with the prefix. */
if (dictionary.lower_bound(prefix) != dictionary.lower_bound(nextPrefix)) {
/* ... something has that prefix ... */
} else {
/* ... no word has that prefix ... */
}
That said, the trie is probably a better structure here. If you're interested, there is another data structure called a DAWG (Directed Acyclic Word Graph) that is similar to the trie but uses substantially less memory; in the Stanford introductory CS courses (where Boggle is an assignment), students actually are provided a DAWG containing all the words in the language. There is also another data structure called a ternary search tree that is somewhere in-between a binary search tree and a trie that may be useful here, if you'd like to look into it.
Hope this helps!
The trie is the preferred data structure of choice for this problem.
If you're limited to sets and dictionaries, I would choose a dictionary that maps prefixes to an array of possible matches:
asp -> aspberger aspire
bal -> balloon balance bale baleen ...

C++ - How to efficiently find out if any string in a vector can be assembled from a set of letters

I am implementing a text-based version of Scrabble for a college project.
I have a vector containing around 400K strings (my dictionary), and, at some point in every turn, I'm going to have to check if there's still a word in the dictionary which can be formed with the pieces in the player's hand. I'm checking if the player has any move left... If not, it's game over for the player in question...
My only solution to this is iterating through the string, one by one, and using a sub-routine I have to check if the string in question can be formed from the player's pieces. I'll implement a quickfail checking if the user has any vowels, but it'll still be woefully inefficient.
The text-file containing the dictionary is already alphabetically ordered, so the vector is sorted.
Any suggestions?
A problem was presented in the comments below: Any suggestion on how do I take the letters already on the board into account?
Without giving you any specific code (since this is homework after all), one general approach to consider is to map from the sorted letters in the word to the actual legal words.
That is to say, if your dictionary file had only the words ape, gum, and mug, your data structure would look like:
aep -> ape
gmu -> gum, mug
Then you can simply go through permutations of the player's letters, and quickly identify if that key exists in the map.
You pay a little bit of processing time setting up the dictionary at startup, but then you only have to perform a few quick lookups rather than iterating through the whole list every time.
Sounds like a variation of the subset sum problem: http://en.wikipedia.org/wiki/Subset_sum_problem
Maybe some of the described algorithms would help you.
There have been numerous papers and questions on Scrabble on this site.
There are many strategies available too.
The representation of your dictionary is inadequate, there are much clever methods available. For example, check what a Trie is on wikipedia.
Using this you can implement a backtracking algorithm to quickly determine which words you can form.
{'as', 'ape', 'gum'}
Trie:
void -a-> (n) -p-> (n) -e-> (y)
-s-> (y)
-g-> (n) -u-> (n) -m-> (y)
Where 'n' means that it does not form a word and y means that it does.
Now, you just have to walk the Trie, keeping in mind what letters are available.
Say that you have {'a', 'p', 'g', 'm', 'u'}:
1. I have a 'a' (but 'a' is not a word)
2. I have a 'p' (but 'ap' is not a word)
3. I don't have any 'e' so I can't go further, let's backtrack
4. I don't have any 's' so...
5. I have a 'g', but it's not a word
6. I have a 'u', but 'gu' is not a word
7. I have a 'm' and 'gum' is a word, I store it somewhere, I can't go further
The point is to maintain a set of the available letters, when you take the -a-> branch, you remove 'a' from this set, then when you take -a-> in reverse (while backtracking) you add it back in the set.
This structure is much more space efficient, it actually models a Finite Automaton which recognize the language of your dictionary instead of blindly saving all words
The runtime should be much faster as well, since you'll never go deep in the tree-structure (you only have 7 letters available)
It's certainly not what I'd do, since it does not take the board into account :p
' ' letters mean you can take any of the available branches. You don't need to use a blank if you have the required letter.
You could also store the strings with characters sorted in ASCIIbetical order into a std::set, then sort the player's letters into the same order and search the map for each substring of the player's letters.
How about keeping the pairs {word from the dictionary, string consisting of the same letters but in ascending order (sorted)}
Then sort the vector of those pairs based on the second string and compare using binary search with a string consisting of sorted letters from players hand.
There are some good answers here already, and I think a trie is probably the right way to go, but this is an interesting problem so I'll toss in my two cents' worth...
The naive approach would be to generate all permutations of the available letters and of all distinct subsets, then search for each potential word in the dictionary. The problem is that, while it's not hard to do this, there is a surprisingly large number of potential words, and most of them are invalid.
On the positive side, checking the dictionary can be sped up with a binary search or something similar. On the negative side, you'd be doing this so many times that the program would grind to a halt for long lists of letters.
We definitely need to preprocess the dictionary to make it more useful, and what we really need is to have a way to quickly rule out most of the potential matches, even if the method has occasional false positives.
One way to do this would be to represent which letters a word uses in a bit map. In other words, precalculate a 32-bit number for each word in the dictionary, where each bit is set if the corresponding letter of the alphabet is used in the word at least once. This would allow you to find all potential words by doing a linear scan of the dictionary and keeping only the ones that use only letters you have available. I suspect that, with a bit of cleverness and indexing, you can do better than linear.
Of the candidates you find, some will require more instances of a letter than you have available, so these will be false positives. That means you need to do a final check on all of the candidates you generated to eliminate the almost-hits. There are many ways to do this, but one of the simplest is to go through your list of letters and replace the first occurrence of that letter in the potential word with a dash. When you're done, if the potential word has anything but dashes, it's a failure. A more elegant solution, though not necessarily faster, would be to generate an array of letter frequencies and compare them.
Again, I think tries are probably the way to go, but I hope these ideas are useful to you.
edit
Let me toss out an example of how you could do better than a full linear search on the initial search: use the radix. Keep a simple index that lets you look up the first word that starts with a given letter. Then, when doing the search, skip over all words that start with a letter that you don't have. This is not a gigantic speedup, but it's an improvement.

Prefix search in a radix tree/patricia trie

I'm currently implementing a radix tree/patricia trie (whatever you want to call it). I want to use it for prefix searches in a dictionary on a severely underpowered piece of hardware. It's supposed to work more or less like auto-completion, i. e. showing a list of words that the typed prefix matches.
My implementation is based on this article, but the code therein doesn't include prefix searches, though the author says:
[...] Say you want to enumerate all the nodes that have keys with a common prefix "AB". You can perform a depth first search starting at that root, stopping whenever you encounter back edges.
But I don't see how that is supposed to work. For example, if I build a radix tree from these words:
illness
imaginary
imagination
imagine
imitation
immediate
immediately
immense
in
I will get the exact same "best match" for the prefixes "i" and "in" so that it seems difficult to me to gather all matching words just by traversing the tree from that best match.
Additionally, there is a radix tree implementation in Java that has an implemented prefix search in RadixTreeImpl.java. That code explicitly checks all nodes (starting from a certain node) for a prefix match - it actually compares bytes.
Can anyone point me to a detailed description on implementing a prefix search on radix trees? Is the algorithm used in the Java implementation the only way to do it?
Think about what your trie encodes. At each node, you have the path that lead you to that node, so in your example, you start at Λ (that's a capital Lambda, this greek font kind of sucks) the root node corresponding to an empty string. Λ has children for each letter used, so in your data set, you have one branch, for "i".
Λ
Λ→"i"
At the "i" node, there are two children, one for "m" and one for "n". The next letter is "n", so you take that,
Λ→"i"→"n"
and since the only word that starts "i","n" in your data set is "in", there are no children from "n". That's a match.
Now, let's say the data set, instead of having "in", had "infindibulum". (What SF I'm referencing is left as an exercise.) You'd still get to the "n" node the same way, but then if the next letter you get is "q", you know the word doesn't appear in your data set at all, because there's no "q" branch. At that point, you say "okay, no match." (Maybe you then start adding the word, maybe not, depending on the application.)
But if the next letter is "f", you can keep going. You can short circuit that with a little craft, though: once you reach a node that represents a unique path, you can hang the whole string off that node. When you get to that node, you know that the rest of the string must be "findibulum", so you've used the prefix to match the whole string, and return it.
How your you use that? in a lot of non-UNIX command interpreters, like the old VAX DCL, you could use any unique prefix of a command. So, the equivalent of ls(1) was DIRECTORY, but no other command started with DIR, so you could type DIR and that was as good as doing the whole word. If you couldn't remember the correct command, you could type just 'D', and hit (I think) ESC; the DCL CLI would return you all the commands that started with D, which it could search extremely fast.
It turns out the GNU extensions for the standard c++ lib includes a Patricia trie implementation. It's found under the policy-based data-structures extension. See http://gcc.gnu.org/onlinedocs/libstdc++/ext/pb_ds/trie_based_containers.html
An alternative algorithm: Keep It Simple Stupid!
Just make a sorted list of your keywords. When you have a prefix, binary search to find where that prefix would be located in the list. All of your possible completions will be found starting at that index, ready to be accessed in place.
This algorithm will will require only 5% of the code of a Patricia trie and will be easy to maintain, understand, and update. It is almost certain this simple list search will be more efficient as well.
The only downside is if you have huge numbers of long keywords with similar prefixes, a trie can save some storage since it doesn't need to keep the full prefix for every entry. In practice, if you have less than a few million words, this is not a savings because the pointer overhead of the tree will dominate. This savings is more for applications like searching databases of DNA strings with millions of characters, not text keywords.
Another alternative algo is a ternary search tree (more memory efficient) https://github.com/varunpant/TernaryTree/tree/master/TernaryTree