SQL Hash table for words - c++

I'm trying to solve the "find all possible words for a set of letters" problems. There are some good answers out there, but I still can't figure it out.
In my first test, I put the whole dictionary in an array and then looped through each letter. This is super fast, but it takes forever to load the dictionary in the array, and requires huge amount of memory.
So I need to store the dictionary (750,000) letter is a sql database.
I guess there are two solutions to find all the possible words:
Make an advance query that returns all the possible words
Make a simple query that return a fraction of the database with words that might be possible, and then quickly loop through that array and valide the words.
The problem?:
It must be super fast. An iPhone 4 need to be able to get all possible words in under 5-6 seconds so it doesn't hinder the game.
Here's a similar questions:
IOS: Sqlite. Find record fast
Sulthans answer seems like a good idea. Create a hash table, and then:
Bitmask for ASCII letters (ignoring any non-ASCII alphabets). Bit at
position 0 means the word contains "a", at position 1 contains "b"
etc. If we create the same bitmask for our letters, we can select
words such as (wordMask & ~lettersMask) == 0
How do you make the bitmask, hash table, and how do you construct the sql query?
Thanks

sql is probably not the best option. The traditional data structure for storing a collection of words is called a Trie. I'm sure there implementations out there you can find. Someone else will have an answer to that.
The algorithm I envision is to permute the letters you are given, and check each permutation to see if it is in the Trie.

Related

Backend challenge in Python

For the past few days I have taken a challenge to write an algorithm in Python 2.7, to find a passphrase given a few hints.
Specifically:
An anagram of the passphrase is: "poultry outwits ants"
The MD5 hash of the secret phrase is "4624d200580677270a54ccff86b9610e"
A Wordlist
What approaches would you use to find the password? I know that brute-forcing all the possible combinations is the sure way but also the one taking way too long to complete (never, i think it is around 20^20 possible combinations).
What I have come up with is to filter the wordlist based on the letters that exist in the anagram and the words. Meaning, if I find a word that contains a letter that is not in the passphrase, I discard it. In addition, I wanted to take into consideration the frequencies of the characters. So I removed from the wordlist any word that has characters with frequency higher than the character frequency of the passphrase. The above was performed on word-level, meaning I checked each word individual. Eventually, from 90k+ words from the wordlist I narrowed them down to 1700 unique words that can be used for the passphrase.
Question 1: Is there anything more that can be done in order to reduce the number of possibilities? Is there a more clever way to take into account the frequencies of the letters?
Next, I thought that since I narrowed it down to 1.7k words maybe I could try permutations (itertools.permutations()) of these words that could possible match the md5 hash of passphrase. Since the anagram of the passphrase contains 2 space characters, I assumed that the passphrase is also three words and not just scrambled characters (maybe I am wrong, but at least I should try it first). As a result, I tried checking permutations of three words from the filtered wordlist. As it turned out, neither that kind of approach is fast enough for my laptop to get any results. The program reaches the memory limit, and the computer freezes.
I also thought about taking into consideration pairs of words instead of triplets and somehow match the letter frequencies in order to filter out some possibilities but I did not come up with a way to do it yet.
Question 2: Is there a way that I can get any more information about the passphrase without checking all the permutations, since this task is prohibitive for my laptop (at least for 1.7k words and above).
I tried using hashcat but I found it too complicated. I ran a couple mask attacks on the md5 hash but with no success. I tried brute-forcing it too, since it can use the GPU but it was still impossible. The main reason was that I could not understand the kind of arguments I needed to give. I know there is an extensive wiki, but as someone with close to no background about hash cracking it was not really helpful. In addition, I would prefer if there was a way to do it on my own and use other programs as less as possible.
If you have any suggestions about solving this please let me know. I am doing this for education purposes, so any input on this will be greatly appreciated.
Thanks

Clojure dictionary of words

I want a dictionary of English words available, to pick random english words. I have a dictionary text file that I downloaded form the internet which has almost 1 million words, what's the best way to go about using this list in Clojure, given that most of the time I'll only need 1 randomly selected word?
Edit:
To answer the comments, this is for some tests which I may turn into load tests which is why I want a decent number of random words and I guess access speed is the most important thing. I do not want to use a database for this. I originally thought of a dictionary just because that's the first thing that popped into my mind but I think a random sequence of letters and numbers would be good enough, perhaps I will just use a UUID as a string.
Read all the words into a Vector and then call rand-nth , e.g.
(rand-nth all-words)
rand-nth uses the nth function on the underlying data structure and Clojure Vectors have log32N performance for index based retrieval.
Edit: This is assuming that it is for a test environment as you described in your question. A more memory efficient method would be to use RandomAccessFile and seek to a random location in the file of words, read until you find the first word delimiter (e.g. comma, EOL) and then read the following bytes until the next delimiter which will give you a random word.

How to use environments for lookups

My question builds upon the topic of matching a string against multiple patterns. One solution discussed here is to use sapply(keywords, grepl, strings, ignore.case=TRUE) which yields a two-dimensional matrix.
However, I run into significant speed issues, when applying this approach to 5K+ keywords and 60K+ strings..(I cancelled the process after 12hrs).
One idea is to use hash tables, or environments in R. However, I don't get how "translate/convert" my strings into an environment while keeping the numerical index?
I have strings[1]... till strings[60000]
e <- new.env(hash=TRUE)
for (i in 1:length(strings)) {
assign(x=i, value=strings, envir=e)
}
As x in assign must be a character, I can't use it like this, but I hope you get my idea..I want to be able to index the environment with the same numbers like in my string[...] vector
Thanks for your help!
R environments are not used as much as perl hashes are, I think
just because there are not widely understood 'idioms' for doing
so. In your case the key question is, do you really want the
numerical index? If so it should be the value. The key is your
string, that's the whole point of the exercise.
e <- new.env(hash=T)
strings <- as.character(chickwts$feed) # note! not unique
sapply(1:length(strings), function(i)assign(strings[i], i, e))
e$horsebean # returns 10
In this example only the last index associated with each string
is kept, but you can assign anything that might be useful to each
key, such as a vector of indices.
You can then lookup your data in a number of ways. You can regex search
for keys using ls, for example, and retrieve the values using mget():
# find all keys containing 'beans'
ls(e, patt='bean')
# retrieve bean data
mget(ls(e, pat='bean'),e)

Checking if a string contains an English sentence

As of right now, I decided to take a dictionary and iterate through the entire thing. Every time I see a newline, I make a string containing from that newline to the next newline, then I do string.find() to see if that English word is somewhere in there. This takes a VERY long time, each word taking about 1/2-1/4 a second to verify.
It is working perfectly, but I need to check thousands of words a second. I can run several windows, which doesn't affect the speed (Multithreading), but it still only checks like 10 a second. (I need thousands)
I'm currently writing code to pre-compile a large array containing every word in the English language, which should speed it up a lot, but still not get the speed I want. There has to be a better way to do this.
The strings I'm checking will look like this:
"hithisisastringthatmustbechecked"
but most of them contained complete garbage, just random letters.
I can't check for impossible compinations of letters, because that string would be thrown out because of the 'tm', in between 'thatmust'.
You can speed up the search by employing the Knuth–Morris–Pratt (KMP) algorithm.
Go through every dictionary word, and build a search table for it. You need to do it only once. Now your search for individual words will proceed at faster pace, because the "false starts" will be eliminated.
There are a lot of strategies for doing this quickly.
Idea 1
Take the string you are searching and make a copy of each possible substring beginning at some column and continuing through the whole string. Then store each one in an array indexed by the letter it begins with. (If a letter is used twice store the longer substring.
So the array looks like this:
a - substr[0] = "astringthatmustbechecked"
b - substr[1] = "bechecked"
c - substr[2] = "checked"
d - substr[3] = "d"
e - substr[4] = "echecked"
f - substr[5] = null // since there is no 'f' in it
... and so forth
Then, for each word in the dictionary, search in the array element indicated by its first letter. This limits the amount of stuff that has to be searched. Plus you can't ever find a word beginning with, say 'r', anywhere before the first 'r' in the string. And some words won't even do a search if the letter isn't in there at all.
Idea 2
Expand upon that idea by noting the longest word in the dictionary and get rid of letters from those strings in the arrays that are longer than that distance away.
So you have this in the array:
a - substr[0] = "astringthatmustbechecked"
But if the longest word in the list is 5 letters, there is no need to keep any more than:
a - substr[0] = "astri"
If the letter is present several times you have to keep more letters. So this one has to keep the whole string because the "e" keeps showing up less than 5 letters apart.
e - substr[4] = "echecked"
You can expand upon this by using the longest words starting with any particular letter when condensing the strings.
Idea 3
This has nothing to do with 1 and 2. Its an idea that you could use instead.
You can turn the dictionary into a sort of regular expression stored in a linked data structure. It is possible to write the regular expression too and then apply it.
Assume these are the words in the dictionary:
arun
bob
bill
billy
body
jose
Build this sort of linked structure. (Its a binary tree, really, represented in such a way that I can explain how to use it.)
a -> r -> u -> n -> *
|
b -> i -> l -> l -> *
| | |
| o -> b -> * y -> *
| |
| d -> y -> *
|
j -> o -> s -> e -> *
The arrows denote a letter that has to follow another letter. So "r" has to be after an "a" or it can't match.
The lines going down denote an option. You have the "a or b or j" possible letters and then the "i or o" possible letters after the "b".
The regular expression looks sort of like: /(arun)|(b(ill(y+))|(o(b|dy)))|(jose)/ (though I might have slipped a paren). This gives the gist of creating it as a regex.
Once you build this structure, you apply it to your string starting at the first column. Try to run the match by checking for the alternatives and if one matches, more forward tentatively and try the letter after the arrow and its alternatives. If you reach the star/asterisk, it matches. If you run out of alternatives, including backtracking, you move to the next column.
This is a lot of work but can, sometimes, be handy.
Side note I built one of these some time back by writing a program that wrote the code that ran the algorithm directly instead of having code looking at the binary tree data structure.
Think of each set of vertical bar options being a switch statement against a particular character column and each arrow turning into a nesting. If there is only one option, you don't need a full switch statement, just an if.
That was some fast character matching and really handy for some reason that eludes me today.
How about a Bloom Filter?
A Bloom filter, conceived by Burton Howard Bloom in 1970 is a
space-efficient probabilistic data structure that is used to test
whether an element is a member of a set. False positive matches are
possible, but false negatives are not; i.e. a query returns either
"inside set (may be wrong)" or "definitely not in set". Elements can
be added to the set, but not removed (though this can be addressed
with a "counting" filter). The more elements that are added to the
set, the larger the probability of false positives.
The approach could work as follows: you create the set of words that you want to check against (this is done only once), and then you can quickly run the "in/not-in" check for every sub-string. If the outcome is "not-in", you are safe to continue (Bloom filters do not give false negatives). If the outcome is "in", you then run your more sophisticated check to confirm (Bloom filters can give false positives).
It is my understanding that some spell-checkers rely on bloom filters to quickly test whether your latest word belongs to the dictionary of known words.
This code was modified from How to split text without spaces into list of words?:
from math import log
words = open("english125k.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
costsum = 0
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
costsum += c
i -= k
return costsum
Using the same dictionary of that answer and testing your string outputs
>>> infer_spaces("hithisisastringthatmustbechecked")
294.99768817854056
The trick here is finding out what threshold you can use, keeping in mind that using smaller words makes the cost higher (if the algorithm can't find any usable word, it returns inf, since it would split everything to single-letter words).
In theory, I think you should be able to train a Markov model and use that to decide if a string is probably a sentence or probably garbage. There's another question about doing this to recognize words, not sentences: How do I determine if a random string sounds like English?
The only difference for training on sentences is that your probability tables will be a bit larger. In my experience, though, a modern desktop computer has more than enough RAM to handle Markov matrices unless you are training on the entire Library of Congress (which is unnecessary- even 5 or so books by different authors should be enough for very accurate classification).
Since your sentences are mashed together without clear word boundaries, it's a bit tricky, but the good news is that the Markov model doesn't care about words, just about what follows what. So, you can make it ignore spaces, by first stripping all spaces from your training data. If you were going to use Alice in Wonderland as your training text, the first paragraph would, perhaps, look like so:
alicewasbeginningtogetverytiredofsittingbyhersisteronthebankandofhavingnothingtodoonceortwiceshehadpeepedintothebookhersisterwasreadingbutithadnopicturesorconversationsinitandwhatistheuseofabookthoughtalicewithoutpicturesorconversation
It looks weird, but as far as a Markov model is concerned, it's a trivial difference from the classical implementation.
I see that you are concerned about time: Training may take a few minutes (assuming you have already compiled gold standard "sentences" and "random scrambled strings" texts). You only need to train once, you can easily save the "trained" model to disk and reuse it for subsequent runs by loading from disk, which may take a few seconds. Making a call on a string would take a trivially small number of floating point multiplications to get a probability, so after you finish training it, it should be very fast.

Given a string, find all its permutations that are a word in dictionary

This is an interview question:
Given a string, find all its permutations that are a word in dictionary.
My solution:
Put all words of the dictionary into a suffix tree and then search each permutation of the string in the tree.
The search time is O(n), where n is the size of the string. But the string may have n! permutations.
How do I improve the efficiency?
Your general approach isn't bad.
However, you can prevent having to search for each permutation by rearranging your word so that all it's characters are in alphabetical order, then searching on a dictionary where each word is similarly re-arranged into alphabetical order and mapped to the original word.
I realise that might be a little hard to grasp as is, so here's an example. Say your word is leap. Rearrange this to aelp.
Now in your dictionary you might have the words plea and pale. Having done as suggested, your dictionary will (among other things) contain the following mappings:
...
aelp -> pale
aelp -> plea
...
So now, to find your anagrams you need only find entries for aelp (using, for example, a suffix-tree approach as suggested), rather than for all 4! = 24 permutations of leap.
A quick alternative solution - all depends on the sizes of data structures in question.
If the dictionary is reasonable small and the string is reasonably long, you can go over each entry in the dictionary and figure out if they are a permutation of the string. You can be smarter - you can sort the dictionary and skip certain entries.
You can build a map from a sorted list of characters to a list of words.
For example, given these:
Array (him, hip, his, hit, hob, hoc, hod, hoe, hog, hon, hop, hos, hot)
you would sort them internally:
Array (him, hip, his, hit, bho, cho, dho, eho, gho, hno, hop, hos, hot)
sort the result:
Array (bho, cho, dho, eho, gho, him, hip, his, hit, hno, hop, hos, hot)
In this small sample, we don't have a match, but for a particular word, you would sort it internally, and with this as key look into your map.
Why don't you use a hash map to store the dictionary words? So you get O(1) lookup time. And if your input is in english, you can build another table to tell all the possible letters in your dictionary, using this table, you can filter some inputs at the beginning. Following is an example:
result_list = empty;
for(char in input)
{
if(char not in letter_table)
{
return result_list;
}
}
for(entry in permutations of input)
{
if(entry in dictionary_hash_table)
{
result_list->add_entry();
}
}
return result_list
You should put the words into a trie. Then you can look up the word as you generate the permutations. You can skip over whole blocks of permutations with the first part is not in the trie.
http://en.wikipedia.org/wiki/Trie
Another simple solution could be as algorithm below,
1) Use "next_permutation" to find a unique permutation.
2) Use "find/find_if" to find it against a dictionary.