check possible english words in long random string (C++) - c++

Given a random string:
KUHPVIBQKVOSHWHXBPOFUXHRPVLLDDAPPLEWPREDDVVIDWQRBHBGLLBBPKQUNRVOHQEIRLWOKKRDD
How do i check if the random string contains possible english words in it?
What's the most efficient way of searching for all possible English words embedded in this string?
I already downloaded english dictionary text file.
I would like to compare the string and english dictionary text file to find the possible words.
Can anyone give some hints how to do for this?

I recommend the brute force approach. After getting this method working, you can optimize later.
The brute force algorithm:
For each word in the dictionary,
search the string for that word.
Other methods may take longer. You will have to ask yourself, "is spending time making this algorithm more efficient worthwhile?"
For infrequent uses, the answer would be no. As an answer to an Online Judge, maybe you will need to improve the efficiency. If you have a lot of strings like this, then maybe you should optimize the algorithm.

You can build a DAG from the words in your dictionary and use this to search for hits. For example, if your dictionary contains the words
auto
autobahn
austria
This would lead to a graph like this
a -> u -> t -> o -> 'hit'
| |
| |-> b -> a -> h -> n -> 'hit'
|
-> s -> t -> r -> i -> a -> 'hit'
Based on this data structurce (here is a library for this) you can start feeding letters starting from each position in your random string until there is no edge to follow or until you obtain a hit.
Since the DAG is not updated, this can be done in parallel by starting at different positions in your random string.
Here is how to build such a search structure:
// Inserts keys into a simple dawg.
dawgdic::DawgBuilder dawg_builder;
dawg_builder.Insert("auto");
dawg_builder.Insert("autobahn");
dawg_builder.Insert("austria");
// Finishes building a simple dawg.
dawgdic::Dawg dawg;
dawg_builder.Finish(&dawg);
// Builds a dictionary from a simple dawg.
dawgdic::Dictionary dic;
dawgdic::DictionaryBuilder::Build(dawg, &dic);
// Checks if a key exists or not.
if (dic.Contains("auto"))
std::cout << "auto: found" << std::endl;
// Finds a key and gets its associated record.
if (dic.Find("august") < 0)
std::cout << "august: not found" << std::endl;

Related

How to group by elements of tuple in Scala and map to list of new elements?

I going through some Scala exercises in order to understand better high order functions. I have the following problem, which I cant understand how to solve. I have the following list:
val input = List((1,"a"), (1,"b"), (1,"c"), (2,"d"), (2,"y"), (2,"e"), (3, "u"), (3,"a"), (3,"d"))
I want to create a Map from this list which maps each letter encountered on the input list to a list of all the numbers which were previously in the same tuple as that letter. Example output:
Map(a -> List(1,3), b->List(1), c->List(1), d->List(2,3), y->List(2), e->List(2), u->List(3)
What I have tried so far is:
val output = input.groupBy(_._2)
However this gives as output:
Map(e -> List((2,e)), y -> List((2,y)), u -> List((3,u)), a -> List((1,a), (3,a)), b -> List((1,b)), c -> List((1,c)), d -> List((2,d), (3,d)))
Could someone help me understand how I could go about solving this? I appreciate any help as I am new to functional programming
As #sinanspd said:
input
.groupBy {
case (numer, letter) => letter
} map {
case (key, list) => key -> list.map {
case (number, letter) => number
}
}
// Or equivalently
input.groupBy(_._2).view.mapValues(list => list.map(_._1)).toMap
// Or even simpler if you have access to Scala 2.13
input.groupMap(_._2)(_._1)
Every time you want to transform some collection of values where the transformation is one to one, you just need a map, so in this case, you want to transform the values of the Map returned by the groupBy, which are Lists, and then you want to transform every element of those Lists to just return the first component of the tuple. So it is a map inside another map.
Also, the Scaldoc is your friend. Usually, there are many useful methods, like mapValues or groupMap.
Here is a more detailed explanation of my comment. This is what you need to do:
val output = input.groupBy(_._2).mapValues(_.map(_._1))
It is important to remember in functional programming we love pure functions. Meaning we will not have shady side effects and we like to stick to one function one purpose principle
This allows us to chain these pure functions and reason about them linearly. Why am I telling you this? While your instinct might be to look for or implement a single function to do this, don't
In this case, we first groupBy the key, and then map each value where we take the value List and map it to extract the _._1 of the tuple.
This is the best I could do, to make it sorted too , as I see the other responses are not sorted , maybe someone can improve this too:
import scala.collection.immutable.ListMap
ListMap(input.groupBy(_._2).mapValues(_.map(_._1)).toSeq.sortBy(_._1):_*)

What's a better way of doing this in Visual C++?

I don't normally work in Visual C++ but I was wondering what I could do to speed up this logic...and if there's a better way of doing this.
I have a map<wstring, wstring> with contents like this:
\Device\CdRom0\, E:\
\Device\CdRom1\, F:\
\Device\HarddiskVolume1\,
\Device\HarddiskVolume4\, C:\
\Device\HarddiskVolume5\, D:\
And I have a huge list of strings that have the following format:
L"\\Device\\HarddiskVolume4\\Users\\User\\Documents\\Visual Studio 2013\\Projects\\FileLocker\\FileLocker\\bin\\Debug\\Test.txt";
My whole purpose is to take strings in the above format, use the map as a type of lookup, and convert these strings into the following format (example converting the above string to a drive path):
L"C:\\Users\\User\\Documents\\Visual Studio 2013\\Projects\\FileLocker\\FileLocker\\bin\\Debug\\Test.txt";
The way I am doing it currently is as follows (for each string):
std::wstring test = ...
for (map<wstring, wstring>::iterator i = volumes.begin(); i != volumes.end(); ++i)
{
if (test.find((*i).first.c_str()) == 0)
{
test = test.replace(0, wcslen((*i).first.c_str()), (*i).second.c_str());
}
}
But there's a lot of strings here, and performance can really take a hit! What are some better ways of performing this lookup and assigning to the string at hand?
If you know there are always exactly two \ separated terms to match, extract just that part of the string then search for that in the map - or try a hashmap.
If you want to stick to the map and same style of logic, you could replace...
if (test.find((*i).first.c_str()) == 0)
...with test.compare(0, i->first.size(), i->first), so it doesn't try to match at every position along the string.
You could also build a tree of resolution steps:
\Device\ ---> Cdrom ---> 0
| 1
|
---> HardDiskVolume ---> 1
4
5
The C++ Standard library doesn't provide a convenient container type for modelling this though - if the depth is always 3 you can hardcode a few maps (last numeric one could even be an array), otherwise there's e.g. boost graph.
After a successful replace, use break; to exit the for loop. That will double the performance by eliminating attempts to match other drives. If the frequency of the appearance of the drives is roughly known, ordering the map by that frequency will add to the effectiveness of the break.

Checking if a string contains an English sentence

As of right now, I decided to take a dictionary and iterate through the entire thing. Every time I see a newline, I make a string containing from that newline to the next newline, then I do string.find() to see if that English word is somewhere in there. This takes a VERY long time, each word taking about 1/2-1/4 a second to verify.
It is working perfectly, but I need to check thousands of words a second. I can run several windows, which doesn't affect the speed (Multithreading), but it still only checks like 10 a second. (I need thousands)
I'm currently writing code to pre-compile a large array containing every word in the English language, which should speed it up a lot, but still not get the speed I want. There has to be a better way to do this.
The strings I'm checking will look like this:
"hithisisastringthatmustbechecked"
but most of them contained complete garbage, just random letters.
I can't check for impossible compinations of letters, because that string would be thrown out because of the 'tm', in between 'thatmust'.
You can speed up the search by employing the Knuth–Morris–Pratt (KMP) algorithm.
Go through every dictionary word, and build a search table for it. You need to do it only once. Now your search for individual words will proceed at faster pace, because the "false starts" will be eliminated.
There are a lot of strategies for doing this quickly.
Idea 1
Take the string you are searching and make a copy of each possible substring beginning at some column and continuing through the whole string. Then store each one in an array indexed by the letter it begins with. (If a letter is used twice store the longer substring.
So the array looks like this:
a - substr[0] = "astringthatmustbechecked"
b - substr[1] = "bechecked"
c - substr[2] = "checked"
d - substr[3] = "d"
e - substr[4] = "echecked"
f - substr[5] = null // since there is no 'f' in it
... and so forth
Then, for each word in the dictionary, search in the array element indicated by its first letter. This limits the amount of stuff that has to be searched. Plus you can't ever find a word beginning with, say 'r', anywhere before the first 'r' in the string. And some words won't even do a search if the letter isn't in there at all.
Idea 2
Expand upon that idea by noting the longest word in the dictionary and get rid of letters from those strings in the arrays that are longer than that distance away.
So you have this in the array:
a - substr[0] = "astringthatmustbechecked"
But if the longest word in the list is 5 letters, there is no need to keep any more than:
a - substr[0] = "astri"
If the letter is present several times you have to keep more letters. So this one has to keep the whole string because the "e" keeps showing up less than 5 letters apart.
e - substr[4] = "echecked"
You can expand upon this by using the longest words starting with any particular letter when condensing the strings.
Idea 3
This has nothing to do with 1 and 2. Its an idea that you could use instead.
You can turn the dictionary into a sort of regular expression stored in a linked data structure. It is possible to write the regular expression too and then apply it.
Assume these are the words in the dictionary:
arun
bob
bill
billy
body
jose
Build this sort of linked structure. (Its a binary tree, really, represented in such a way that I can explain how to use it.)
a -> r -> u -> n -> *
|
b -> i -> l -> l -> *
| | |
| o -> b -> * y -> *
| |
| d -> y -> *
|
j -> o -> s -> e -> *
The arrows denote a letter that has to follow another letter. So "r" has to be after an "a" or it can't match.
The lines going down denote an option. You have the "a or b or j" possible letters and then the "i or o" possible letters after the "b".
The regular expression looks sort of like: /(arun)|(b(ill(y+))|(o(b|dy)))|(jose)/ (though I might have slipped a paren). This gives the gist of creating it as a regex.
Once you build this structure, you apply it to your string starting at the first column. Try to run the match by checking for the alternatives and if one matches, more forward tentatively and try the letter after the arrow and its alternatives. If you reach the star/asterisk, it matches. If you run out of alternatives, including backtracking, you move to the next column.
This is a lot of work but can, sometimes, be handy.
Side note I built one of these some time back by writing a program that wrote the code that ran the algorithm directly instead of having code looking at the binary tree data structure.
Think of each set of vertical bar options being a switch statement against a particular character column and each arrow turning into a nesting. If there is only one option, you don't need a full switch statement, just an if.
That was some fast character matching and really handy for some reason that eludes me today.
How about a Bloom Filter?
A Bloom filter, conceived by Burton Howard Bloom in 1970 is a
space-efficient probabilistic data structure that is used to test
whether an element is a member of a set. False positive matches are
possible, but false negatives are not; i.e. a query returns either
"inside set (may be wrong)" or "definitely not in set". Elements can
be added to the set, but not removed (though this can be addressed
with a "counting" filter). The more elements that are added to the
set, the larger the probability of false positives.
The approach could work as follows: you create the set of words that you want to check against (this is done only once), and then you can quickly run the "in/not-in" check for every sub-string. If the outcome is "not-in", you are safe to continue (Bloom filters do not give false negatives). If the outcome is "in", you then run your more sophisticated check to confirm (Bloom filters can give false positives).
It is my understanding that some spell-checkers rely on bloom filters to quickly test whether your latest word belongs to the dictionary of known words.
This code was modified from How to split text without spaces into list of words?:
from math import log
words = open("english125k.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
costsum = 0
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
costsum += c
i -= k
return costsum
Using the same dictionary of that answer and testing your string outputs
>>> infer_spaces("hithisisastringthatmustbechecked")
294.99768817854056
The trick here is finding out what threshold you can use, keeping in mind that using smaller words makes the cost higher (if the algorithm can't find any usable word, it returns inf, since it would split everything to single-letter words).
In theory, I think you should be able to train a Markov model and use that to decide if a string is probably a sentence or probably garbage. There's another question about doing this to recognize words, not sentences: How do I determine if a random string sounds like English?
The only difference for training on sentences is that your probability tables will be a bit larger. In my experience, though, a modern desktop computer has more than enough RAM to handle Markov matrices unless you are training on the entire Library of Congress (which is unnecessary- even 5 or so books by different authors should be enough for very accurate classification).
Since your sentences are mashed together without clear word boundaries, it's a bit tricky, but the good news is that the Markov model doesn't care about words, just about what follows what. So, you can make it ignore spaces, by first stripping all spaces from your training data. If you were going to use Alice in Wonderland as your training text, the first paragraph would, perhaps, look like so:
alicewasbeginningtogetverytiredofsittingbyhersisteronthebankandofhavingnothingtodoonceortwiceshehadpeepedintothebookhersisterwasreadingbutithadnopicturesorconversationsinitandwhatistheuseofabookthoughtalicewithoutpicturesorconversation
It looks weird, but as far as a Markov model is concerned, it's a trivial difference from the classical implementation.
I see that you are concerned about time: Training may take a few minutes (assuming you have already compiled gold standard "sentences" and "random scrambled strings" texts). You only need to train once, you can easily save the "trained" model to disk and reuse it for subsequent runs by loading from disk, which may take a few seconds. Making a call on a string would take a trivially small number of floating point multiplications to get a probability, so after you finish training it, it should be very fast.

Given a string, find all its permutations that are a word in dictionary

This is an interview question:
Given a string, find all its permutations that are a word in dictionary.
My solution:
Put all words of the dictionary into a suffix tree and then search each permutation of the string in the tree.
The search time is O(n), where n is the size of the string. But the string may have n! permutations.
How do I improve the efficiency?
Your general approach isn't bad.
However, you can prevent having to search for each permutation by rearranging your word so that all it's characters are in alphabetical order, then searching on a dictionary where each word is similarly re-arranged into alphabetical order and mapped to the original word.
I realise that might be a little hard to grasp as is, so here's an example. Say your word is leap. Rearrange this to aelp.
Now in your dictionary you might have the words plea and pale. Having done as suggested, your dictionary will (among other things) contain the following mappings:
...
aelp -> pale
aelp -> plea
...
So now, to find your anagrams you need only find entries for aelp (using, for example, a suffix-tree approach as suggested), rather than for all 4! = 24 permutations of leap.
A quick alternative solution - all depends on the sizes of data structures in question.
If the dictionary is reasonable small and the string is reasonably long, you can go over each entry in the dictionary and figure out if they are a permutation of the string. You can be smarter - you can sort the dictionary and skip certain entries.
You can build a map from a sorted list of characters to a list of words.
For example, given these:
Array (him, hip, his, hit, hob, hoc, hod, hoe, hog, hon, hop, hos, hot)
you would sort them internally:
Array (him, hip, his, hit, bho, cho, dho, eho, gho, hno, hop, hos, hot)
sort the result:
Array (bho, cho, dho, eho, gho, him, hip, his, hit, hno, hop, hos, hot)
In this small sample, we don't have a match, but for a particular word, you would sort it internally, and with this as key look into your map.
Why don't you use a hash map to store the dictionary words? So you get O(1) lookup time. And if your input is in english, you can build another table to tell all the possible letters in your dictionary, using this table, you can filter some inputs at the beginning. Following is an example:
result_list = empty;
for(char in input)
{
if(char not in letter_table)
{
return result_list;
}
}
for(entry in permutations of input)
{
if(entry in dictionary_hash_table)
{
result_list->add_entry();
}
}
return result_list
You should put the words into a trie. Then you can look up the word as you generate the permutations. You can skip over whole blocks of permutations with the first part is not in the trie.
http://en.wikipedia.org/wiki/Trie
Another simple solution could be as algorithm below,
1) Use "next_permutation" to find a unique permutation.
2) Use "find/find_if" to find it against a dictionary.

Autocompletion library in C++

I need an auto-completion routine or library in C++ for 1 million words. I guess I can find a routine on the net like Rabin–Karp. Do you know a library that does this. I don't see it in Boost.
Also, is it a crazy idea to use MySql LIKE SQL request to do that ?
Thank you
EDIT: It is true that it is more suggestions than auto-completion that I need (propose ten words when the user typed the first 2 letters). I actually also have expressions "Nikon digital camera". But for a first version, I only need suggestions on "Ni" of Nikon and not on "digital camera".
You don't have to use any crazy algorithm if you begin by preparing an index.
A simple Trie/Binary Search Tree structure, that keeps the words ordered alphabetically, would allow efficient prefix searches.
In C++, for example, the std::map class has the lower_bound member which would point in O(log N) to the first element that could possibly extend your word.
hmmmm, if you're thinking about using like, it means that most probably, you want to have classical autocompletion (begin of word is matching).
What about organising (nicely) your data into a 26-tree (one entry per letter, or if you support other than letters, an well chosen x-tree). That way, you organize your data once and then, you have quick result by tree parsing. if you want to limit the amount of results proposed into your autocompletion, you can adapt your tree parsing algorithm. Seems simple and efficient (a like syntax in SQL will have to compare all your items in your table each time, whereas my solution is much quicker once the data is correctly set)
Other solution, you can peek at Qt implementation of QCompleter (might be overkill to depend on Qt on your code, I don't know)
I worked on a project once that did something like this using CLucene. It worked fine.
You can use a trie (prefix tree) to store your words.
struct trie
{
std::map<char, trie*> next;
bool is_word;
void insert(std::string w)
{
trie * n = this;
for (int i = 0; i < w.size(); ++i) {
if (n->next.find(w[i]) == n->next.end()) {
n->next[w[i]] = new trie();
}
n = n->next[w[i]];
}
n->is_word = true;
}
};
Then you can easily get prefix matches iterating on subtrees.
You could write your own simple auto-completion function with using Damerau-Levenshtein distance.