Recently I was looking through some interview questions, and found some interesting one:
You are given a list of word. Find if two words can be joined to-gather to form a palindrome. eg Consider a list {bat, tab, cat} Then bat and tab can be joined to gather to form a palindrome.
Expecting a O(nk) solution where n = number of works and k is length
There can be multiple pairs, just return true if found one.
Also, in the comments one of the approaches was this:
1) Add the first word to the trie ( A B)
2) Take the second word (D E E D B A) and reverse it (A B D E E D)
3) See how many letters in the reversed word you can match in the trie (the first 2)
4) Take the rest of the string (D E E D) see if it is a palindrome if it is you are done return true
5) add the second word to the trie (D E E D B A)
6) go back to step 2 with the next word
7) when out of words return false
But in my opinion this is not an O(nk) solution.
Can anyone suggest a solution?? Or explain why the algorithm described above is O(nk)??
The algorithms is correct, or at least it gets quite close. There are minor technical issues. In step 4. one should save the proposition of a solution if it's better than the current one, and in step 7. return it, or say it was impossible to make a palindrome.
The main idea is to process words into cores and prefixes. If a core is a palindrome, then we need to match the prefix with other word. Trie serves as a "database" for processed strings, so with each new word, one can check all possible extensions. If words were kept separately one would need to compare prefixes of each word separately.
(Edit: I think there still is a small loophole, in case there are two words in a trie which starts the same, and the incoming one would make a palindrome with the shorter one, but not the longer, but I won't go into details. Handling it would complicate the algo but wouldn't affect complexity.)
It also is O(n*k). Adding and checking a prefix vs a trie takes number of steps proportional to the number of characters. So in this case this is bound by k. Just like tree operations are O(h) where h is the height of the tree. So in conclusion:
k steps.
takes k steps.
also takes at most k steps.
also takes less than k steps but we can bound it by k.
also takes k steps.
Steps 2 to 5 are done n-1 times.
Of course each step has a different dominant operation, so it is hard to specify the exact constant, but all of them are bound by k so the complexity is O(c*(n-1)*k) which essentially is O(n*k).
There's a really interesting discussion of this in an article from Dr. Dobbs, way back in 2004. The full explanation is a little long, but the general idea is:
Suppose you start with Lion, where the pivot is left of the actual word. I can calculate the center of the string, which is position two. The pivot is at zero, so the string is too heavy on the right, but at the moment, Lion qualifies as a partial palindrome. The "dot" at the pivot point matches the dot at the pivot point, so there is at least one correct character, albeit the same character. You now wish to prepend words that end with noil, attempting to convert the string to noil.Lion. I use to mean any string of characters. If you're successful, then you need to locate words starting with so that they can be appended to the string.
Note that he defines a partial palindrome as:
A string is a partial palindrome if, working from the pivot point outwards, either the left or right end of the string is encountered before a mismatch occurs.
Related
L = {words such that the substring 'bb' is not present in in it
Given that the alphabet is A = {a,b}, is this language regular? If so, is there a regular expression that represents it?
Yes, this language is regular. Since this looks like homework, here's a hint: if the string bb isn't present, then the string consists of lots of blocks of strings of the form a* or a*b. Try seeing how to assemble the solution from this starting point.
EDIT: If this isn't a homework problem, here's one possible solution:
(a*(ba+)*b?)?
The idea is to decompose the string into a lot of long sequences of as with some b's interspersed in-between them. The first block of a's is at the front. Then, we repeatedly place down a b, at least one a, and then any number of additional as. Finally, we may optionally have one b at the end. As an alternative, we could have the empty string, so the entire thing is guarded by a ?.
Hope this helps!
I'm learning by myself formal languages (Aho's,Hopcroft) but I'm having a hard time with regular expressions.
I've been able to tackle simple tasks but this one has posed a challenge, at least for me. How to solve this if you can't count so far, I'm not used to this type of computation.
There must be some property or something that let me generalize the answer that much that i can put it as a regular expresion.
So far I've devised that is possible that there may be at least 2 o 3 cases:
sums mod3=0 if sum=3k
sums mod3=1 if sum=3k+1
sums mod3=2 if sum=3k+2.
But I've come to realize that there may be many combinations for a sum to happen so can't find the pattern the regular expression must follow.
The string for ex. {122211}0 (braces are for easy read sake) has the zero at the end as it holds that {sum=3k}0, if the sum is "10" from a string for ex. {1222111}1 the case may be {sum=3k+1} so the one has to be at the end, and so on.
This may or not be the right track to tackle the problem but I'm open to any suggestions please, any help is very appreciated.
Here's a hint: think of what distinct final states you can possibly be in. You certainly have at least 3 states, since the number of values can be three different things mod three. Also, you need to have a distinct start state, since the empty string cannot be accepted. Do you need more states?
Hint2: I think you can easily do this with a DFA using a start state and nine other states, of which exactly three will be accepting.
EDIT: Once you have a DFA, you can use Kleene's Theorem to construct an equivalent regular expression. If you'd rather go straight for a regular expression, here's another hint: if you're looking at any string of length 3k, you can append: 0; any string of length 1, followed by 1; any string of length 2, followed by 2. So if you can write regular expressions for strings of lengths 3k, 1, and 2, you're practically done.
I've been tasked with creating a simple spell checker for an assignment but have given next to no guidance so was wondering if anyone could help me out. I'm not after someone to do the assignment for me, but any direction or help with the algorithm would be awesome! If what I'm asking is not within the guildlines of the site then I'm sorry and I'll look elsewhere. :)
The project loads correctly spelled lower case words and then needs to make spelling suggestions based on two criteria:
One letter difference (either added or subtracted to get the word the same as a word in the dictionary). For example 'stack' would be a suggestion for 'staick' and 'cool' would be a suggestion for 'coo'.
One letter substitution. So for example, 'bad' would be a suggestion for 'bod'.
So, just to make sure I've explained properly.. I might load in the words [hello, goodbye, fantastic, good, god] and then the suggestions for the (incorrectly spelled) word 'godd' would be [good, god].
Speed is my main consideration here so while I think I know a way to get this work, I'm really not too sure about how efficient it'll be. The way I'm thinking of doing it is to create a map<string, vector<string>> and then for each correctly spelled word that's loaded in, add the correctly spelled work in as a key in the map and the populate the vector to be all the possible 'wrong' permutations of that word.
Then, when I want to look up a word, I'll look through every vector in the map to see if that word is a permutation of one of the correctly spelled word. If it is, I'll add the key as a spelling suggestion.
This seems like it would take up HEAPS of memory though, cause there would surely be thousands of permutations for each word? It also seems like it'd be very very slow if my initial dictionary of correctly spelled words was large?
I was thinking that maybe I could cut down time a bit by only looking in the keys that are similar to the word I'm looking at. But then again, if they're similar in some way then it probably means that the key will be a suggestion meaning I don't need all those permutations!
So yeah, I'm a bit stumped about which direction I should look in. I'd really appreciate any help as I really am not sure how to estimate the speed of the different ways of doing things (we haven't been taught this at all in class).
The simpler way to solve the problem is indeed a precomputed map [bad word] -> [suggestions].
The problem is that while the removal of a letter creates few "bad words", for the addition or substitution you have many candidates.
So I would suggest another solution ;)
Note: the edit distance you are describing is called the Levenshtein Distance
The solution is described in incremental step, normally the search speed should continuously improve at each idea and I have tried to organize them with the simpler ideas (in term of implementation) first. Feel free to stop whenever you're comfortable with the results.
0. Preliminary
Implement the Levenshtein Distance algorithm
Store the dictionnary in a sorted sequence (std::set for example, though a sorted std::deque or std::vector would be better performance wise)
Keys Points:
The Levenshtein Distance compututation uses an array, at each step the next row is computed solely with the previous row
The minimum distance in a row is always superior (or equal) to the minimum in the previous row
The latter property allow a short-circuit implementation: if you want to limit yourself to 2 errors (treshold), then whenever the minimum of the current row is superior to 2, you can abandon the computation. A simple strategy is to return the treshold + 1 as the distance.
1. First Tentative
Let's begin simple.
We'll implement a linear scan: for each word we compute the distance (short-circuited) and we list those words which achieved the smaller distance so far.
It works very well on smallish dictionaries.
2. Improving the data structure
The levenshtein distance is at least equal to the difference of length.
By using as a key the couple (length, word) instead of just word, you can restrict your search to the range of length [length - edit, length + edit] and greatly reduce the search space.
3. Prefixes and pruning
To improve on this, we can remark than when we build the distance matrix, row by row, one world is entirely scanned (the word we look for) but the other (the referent) is not: we only use one letter for each row.
This very important property means that for two referents that share the same initial sequence (prefix), then the first rows of the matrix will be identical.
Remember that I ask you to store the dictionnary sorted ? It means that words that share the same prefix are adjacent.
Suppose that you are checking your word against cartoon and at car you realize it does not work (the distance is already too long), then any word beginning by car won't work either, you can skip words as long as they begin by car.
The skip itself can be done either linearly or with a search (find the first word that has a higher prefix than car):
linear works best if the prefix is long (few words to skip)
binary search works best for short prefix (many words to skip)
How long is "long" depends on your dictionary and you'll have to measure. I would go with the binary search to begin with.
Note: the length partitioning works against the prefix partitioning, but it prunes much more of the search space
4. Prefixes and re-use
Now, we'll also try to re-use the computation as much as possible (and not just the "it does not work" result)
Suppose that you have two words:
cartoon
carwash
You first compute the matrix, row by row, for cartoon. Then when reading carwash you need to determine the length of the common prefix (here car) and you can keep the first 4 rows of the matrix (corresponding to void, c, a, r).
Therefore, when begin to computing carwash, you in fact begin iterating at w.
To do this, simply use an array allocated straight at the beginning of your search, and make it large enough to accommodate the larger reference (you should know what is the largest length in your dictionary).
5. Using a "better" data structure
To have an easier time working with prefixes, you could use a Trie or a Patricia Tree to store the dictionary. However it's not a STL data structure and you would need to augment it to store in each subtree the range of words length that are stored so you'll have to make your own implementation. It's not as easy as it seems because there are memory explosion issues which can kill locality.
This is a last resort option. It's costly to implement.
You should have a look at this explanation of Peter Norvig on how to write a spelling corrector .
How to write a spelling corrector
EveryThing is well explain in this article, as an example the python code for the spell checker looks like this :
import re, collections
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)
Hope you can find what you need on Peter Norvig website.
for spell checker many data structures would be useful for example BK-Tree. Check Damn Cool Algorithms, Part 1: BK-Trees I have done implementation for the same here
My Earlier code link may be misleading, this one is correct for spelling corrector.
off the top of my head, you could split up your suggestions based on length and build a tree structure where children are longer variations of the shorter parent.
should be quite fast but i'm not sure about the code itself, i'm not very well-versed in c++
Say I have 100 keywords (that can include spaces) and I need to find out how many times they occur in a big piece of text. What would the fast way be to accomplish this?
My current idea is as follows:
turn the keywords into a suffix tree
walk through the text following the nodes and whenever a char does not occur (i.e. node->next == NULL) in the suffix tree, skip to next word and search again
The suffix tree struct would look something like this:
struct node {
int count; //number of occurences (only used at leaf node)
/* for each lower-case char, have a pointer to either NULL or next node */
struct node *children[26];
};
I am sure there is a faster way to do this, but what is it? Space efficiency is not really a big deal for this case (hence the children array for faster lookup), but time efficiency really is. Any suggestions?
The problem with the suffix tree approach is that you have to start the suffix search for each letter of the text to be searched. I think the best way to go would be to arrange a search for each keyword in the text, but using some fast search method with precomputed values, such as Boyer-Moore.
EDIT:
OK, You may be sure the trie may be faster. Boyer-Moore is very fast in the average case. Consider, for example, that strings have a mean length of m. BM can be as fast as O(n/m) for "normal" strings. That would make 100*O(n/m). The trie would be O(n*m) in mean (but it is true it can be much faster in real life), so if 100 >> m then the trie would win.
Now for random ideas on optimization. In some compression algorithms that have to do backward searchs, I've seen partial hash tables indexed by two characters of the string. That is, if the string to check is the sequence of characters c1, c2, and c3, you can check wether:
if (hash_table[c1 * 256 + c2] == true) check_strings_begining with [c1,c2]
then for c2 and c3, and so on. It is surprising how many cases you avoid by doing this simple check, as this hash will be true only each 100/65536 times (0.1%).
This is what I would do.
Put all your keywords in a hash table of key-value pairs, with the number of occurences of the keyword as the value and the keyword as the (you guessed it) key.
check each word in the text blob against the hash table. if the word is in the hash table, increment the occurrence count associated with it.
This is a good way because hash table look up is (or should be) amortized O(1) time. The whole algorithm has linear complexity :).
EDIT: If your keywords can contain spaces, you would need to make a sort of DFA. Scan the file until you find a word which one of your key "phrases" starts with. If the second (or however many) next words are part of the "key phrase", then increment the occurence count.
You seem to be groping your way towards http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
To quote:
The complexity of the algorithm is linear in the length of the patterns plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches (e.g. dictionary = a, aa, aaa, aaaa and input string is aaaa).
If it is a industrial application, use Boost Regex
It is tested, fast and chances are that it will save you a lot of pain.
I am stumped by this practice problem (not for marks):
{w is an element of {a,b}* : the number of a's is even and the number of b's is even }
I can't seem to figure this one out.
In this case 0 is considered even.
A few acceptable strings: {}, {aa}, {bb}, {aabb}, {abab}, {bbaa}, {babaabba}, and so on
I've done similar examples where the a's must be a prefix, where the answer would be:
(aa)(bb)
but in this case they can be in any order.
Kleene stars (*), unions (U), intersects (&), and concatenation may be used.
Edit: Also have trouble with this one
{w is an element of {0,1}* : w = 1^r 0 1^s 0 for some r,s >= 1}
This is kind of ugly, but it should work:
ε U ( (aa) U (bb) U ((ab) U (ba) (ab) U (ba)) )*
For the second one:
11*011*0
Generally I would use a+ instead of aa* here.
Edit: Undeleted re: the comments in NullUserException's answer.
1) I personally think this one is easier to conceptualize if you first construct a DFA that can accept the strings. I haven't written it down, but off the top of my head I think you can do this with 4 states and one accept state. From there you can create an equivalent regex by removing states one at a time using an algorithm such as this one. This is possible because DFAs and regexes are provably equivalent.
2) Consider the fact that the Kleene star only applies to the nearest regular expression. Hence, if you have two individual ungrouped atoms (an atom itself is a regex!), it only applies to the second one (as in, ab* would match a single a and then any number - including 0 - b's). You can use this to your advantage in a case where you want something to exist, but you're not sure of how many there are.