Word lexical families - c++

I am given a set of N words, and an integer K. 2 words are in the same group if they have exactly the first k letters and the last k letters identical. If they have more than k letters identical or less than k letters identical then the words are not in the same group. For example:
For k=3.
"abcdefg" and "abczefg" are in the same group
"abcddefg" and "abcdzefg" are not in the same group (the first k+1 letters are identical)
"abc" and "abc" are in the same group
A word can be in more than 1 groups. For example (k=3):
"abczefg" and "abcefg" form a group
"abczaefg" and "abcefg" form a group
"abczaefg" and "abczefg" are not in the same group (the first k+1 letters are identical)
The problem asks me to find the number of groups which contain the maximum number of words.
I thought about using a Trie (or Prefix Tree) and I assume this is the right data structure for this problem but I don't know how can I adapt them for this problem, because the part where if 2 words have more than k letters identical are not in the same group confuse me. My ideea has the complexity O(N*N*K) and considering that N<=10,000 and K<=100 I don't think that this ideea is fast enough. I would like to explain you my ideea, but it is not cleary yet even for me and I don't even know if it is correct, so I will skip this part.
My question is if there is a way I could solve this problem using a faster algorithm, and if there is such algorithm, I kindly ask you to explain it a little bit. Thank you in advance and I am sorry for the gramatical mistakes and if I didn't explain the problem clearly!

First group all the words that share the first k letters and last k letters. Your largest group must sit inside one of these groups, since there's no way two words that differ at their starts and ends can be in the same solution.
So, within each of these groups (of words that share the same k letters at their start and end), you need to find a maximal set of words such that no two share the k+1'th letter, nor the k+1'th letter from the end.
Construct a graph where vertices are the pairs of letters that are (k+1) from each end (de-duping) from words in one of these groups, and edges occur between (a, b) and (c, d) if a=c or b=d.
You need to find a subgraph of this which has no edges in it. This reduced problem is an instance of the "maximum independent subgraph" problem, which is NP-hard, so you'll need to solve it by using a search and hoping the set of words you're given isn't too nasty. Perhaps there's something about the graphs here to give a faster solution, but I don't see it.
The solution to the entire problem is the largest solution to one of the reduced problems described above.
Hope this helps!

Related

How to Solve this Modified Word Ladder Problem?

Here is the word ladder problem:
Given two words (beginWord and endWord), and a dictionary's word list, find the length of the shortest transformation sequence from beginWord to endWord, such that:
Only one letter can be changed at a time.
Each transformed word must exist in the word list. Note that beginWord is not a transformed word.
Now along with the modification, we are allowed to delete or add an element.
We have to find minimum steps if possible to convert string1 to string2.
This problem has a nice BFS structure. Let's illustrate this using the example in the problem statement.
beginWord = "hit",
endWord = "cog",
wordList = "hot","dot","dog","lot","log","cog"
Since only one letter can be changed at a time, if we start from "hit", we can only change to those words which have exactly one letter different from it (in this case, "hot"). Putting in graph-theoretic terms, "hot" is a neighbor of "hit". The idea is simply to start from the beginWord, then visit its neighbors, then the non-visited neighbors of its neighbors until we arrive at the endWord. This is a typical BFS structure.
But now since we are allowed to add/delete also how should I proceed further?

String Finding Alg w/ Lowest Freq Char

I have 3 text files. One with a set of text to be searched through
(ex. ABCDEAABBCCDDAABC)
One contains a number of patterns to search for in the text
(ex. AB, EA, CC)
And the last containing the frequency of each character
(ex.
A 4
B 4
C 4
D 3
E 1
)
I am trying to write an algorithm to find the least frequent occurring character for each pattern and search a string for those occurrences, then check the surrounding letters to see if the string is a match. Currently, I have the characters and frequencies in their own vectors, respectively. (Where i=0 for each vector would be A 4, respectively.
Is there a better way to do this? Maybe a faster data structure? Also, what are some efficient ways to check the pattern string against the piece of the text string once the least frequent letter is found?
You can run the Aho-Corasick algorithm. Its complexity (once the preprocessing - whose complexity is unrelated to the text - is done), is Θ(n + p), where
n is the length of the text
p is the total number of matches found
This is essentially optimal. There is no point in trying to skip over letters that appear to be frequent:
If the letter is not part of a match, the algorithm takes unit time.
If the letter is part of a match, then the match includes all letters, irrespective of their frequency in the text.
You could run an iteration loop that keeps a count of instances and has a check to see if a character has appeared more than a percentage of times based on total characters searched for and total length of the string. i.e. if you have 100 characters and 5 possibilities, any character that has appeared more than 20% of the hundred can be discounted, increasing efficiency by passing over any value matching that one.

Permutation/Combination with license plates

Question, with a California license plate, it has #LLL### where L = Alphabet. I know with the combination is 10^4 * 10^3 for all possible solution. How about if I excluded a certain word, such as "FSS", where any combination of car license plate would not include the word "FSS".
How do I go upon this? I can still use the letters, but the three can't be together. Its throwing me for a loop. Do I use permutation to exclude the repetition word? Any help is appreciated.
EDIT- the # = digits. So from 0-9, there are ten possibilities, sorry didn't clarify
There are only so many ways you can have FSS in a string of seven characters.
FSS####
#FSS###
##FSS##
###FSS#
####FSS
So there are five different license plates with the string FSS in them. If there is no constraint on the four numbers, that means you have 9,999 different license plates for each position of "FSS".
You would want to subtract 9,999 * 5 from you total answer to get the plates allowed.
Edit:
So you want all permutations of 0-9 in the first, fifth, sixth, and seventh positions. And all permutations of A-Z of the second, third and fourth positions, except for F in the second, S in the third and S in the fourth, right? If so, it would be 10*25*25*25*10*10*10, or 10^4 * 25^3. Did I get your problem right?

Checking if a string contains an English sentence

As of right now, I decided to take a dictionary and iterate through the entire thing. Every time I see a newline, I make a string containing from that newline to the next newline, then I do string.find() to see if that English word is somewhere in there. This takes a VERY long time, each word taking about 1/2-1/4 a second to verify.
It is working perfectly, but I need to check thousands of words a second. I can run several windows, which doesn't affect the speed (Multithreading), but it still only checks like 10 a second. (I need thousands)
I'm currently writing code to pre-compile a large array containing every word in the English language, which should speed it up a lot, but still not get the speed I want. There has to be a better way to do this.
The strings I'm checking will look like this:
"hithisisastringthatmustbechecked"
but most of them contained complete garbage, just random letters.
I can't check for impossible compinations of letters, because that string would be thrown out because of the 'tm', in between 'thatmust'.
You can speed up the search by employing the Knuth–Morris–Pratt (KMP) algorithm.
Go through every dictionary word, and build a search table for it. You need to do it only once. Now your search for individual words will proceed at faster pace, because the "false starts" will be eliminated.
There are a lot of strategies for doing this quickly.
Idea 1
Take the string you are searching and make a copy of each possible substring beginning at some column and continuing through the whole string. Then store each one in an array indexed by the letter it begins with. (If a letter is used twice store the longer substring.
So the array looks like this:
a - substr[0] = "astringthatmustbechecked"
b - substr[1] = "bechecked"
c - substr[2] = "checked"
d - substr[3] = "d"
e - substr[4] = "echecked"
f - substr[5] = null // since there is no 'f' in it
... and so forth
Then, for each word in the dictionary, search in the array element indicated by its first letter. This limits the amount of stuff that has to be searched. Plus you can't ever find a word beginning with, say 'r', anywhere before the first 'r' in the string. And some words won't even do a search if the letter isn't in there at all.
Idea 2
Expand upon that idea by noting the longest word in the dictionary and get rid of letters from those strings in the arrays that are longer than that distance away.
So you have this in the array:
a - substr[0] = "astringthatmustbechecked"
But if the longest word in the list is 5 letters, there is no need to keep any more than:
a - substr[0] = "astri"
If the letter is present several times you have to keep more letters. So this one has to keep the whole string because the "e" keeps showing up less than 5 letters apart.
e - substr[4] = "echecked"
You can expand upon this by using the longest words starting with any particular letter when condensing the strings.
Idea 3
This has nothing to do with 1 and 2. Its an idea that you could use instead.
You can turn the dictionary into a sort of regular expression stored in a linked data structure. It is possible to write the regular expression too and then apply it.
Assume these are the words in the dictionary:
arun
bob
bill
billy
body
jose
Build this sort of linked structure. (Its a binary tree, really, represented in such a way that I can explain how to use it.)
a -> r -> u -> n -> *
|
b -> i -> l -> l -> *
| | |
| o -> b -> * y -> *
| |
| d -> y -> *
|
j -> o -> s -> e -> *
The arrows denote a letter that has to follow another letter. So "r" has to be after an "a" or it can't match.
The lines going down denote an option. You have the "a or b or j" possible letters and then the "i or o" possible letters after the "b".
The regular expression looks sort of like: /(arun)|(b(ill(y+))|(o(b|dy)))|(jose)/ (though I might have slipped a paren). This gives the gist of creating it as a regex.
Once you build this structure, you apply it to your string starting at the first column. Try to run the match by checking for the alternatives and if one matches, more forward tentatively and try the letter after the arrow and its alternatives. If you reach the star/asterisk, it matches. If you run out of alternatives, including backtracking, you move to the next column.
This is a lot of work but can, sometimes, be handy.
Side note I built one of these some time back by writing a program that wrote the code that ran the algorithm directly instead of having code looking at the binary tree data structure.
Think of each set of vertical bar options being a switch statement against a particular character column and each arrow turning into a nesting. If there is only one option, you don't need a full switch statement, just an if.
That was some fast character matching and really handy for some reason that eludes me today.
How about a Bloom Filter?
A Bloom filter, conceived by Burton Howard Bloom in 1970 is a
space-efficient probabilistic data structure that is used to test
whether an element is a member of a set. False positive matches are
possible, but false negatives are not; i.e. a query returns either
"inside set (may be wrong)" or "definitely not in set". Elements can
be added to the set, but not removed (though this can be addressed
with a "counting" filter). The more elements that are added to the
set, the larger the probability of false positives.
The approach could work as follows: you create the set of words that you want to check against (this is done only once), and then you can quickly run the "in/not-in" check for every sub-string. If the outcome is "not-in", you are safe to continue (Bloom filters do not give false negatives). If the outcome is "in", you then run your more sophisticated check to confirm (Bloom filters can give false positives).
It is my understanding that some spell-checkers rely on bloom filters to quickly test whether your latest word belongs to the dictionary of known words.
This code was modified from How to split text without spaces into list of words?:
from math import log
words = open("english125k.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
costsum = 0
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
costsum += c
i -= k
return costsum
Using the same dictionary of that answer and testing your string outputs
>>> infer_spaces("hithisisastringthatmustbechecked")
294.99768817854056
The trick here is finding out what threshold you can use, keeping in mind that using smaller words makes the cost higher (if the algorithm can't find any usable word, it returns inf, since it would split everything to single-letter words).
In theory, I think you should be able to train a Markov model and use that to decide if a string is probably a sentence or probably garbage. There's another question about doing this to recognize words, not sentences: How do I determine if a random string sounds like English?
The only difference for training on sentences is that your probability tables will be a bit larger. In my experience, though, a modern desktop computer has more than enough RAM to handle Markov matrices unless you are training on the entire Library of Congress (which is unnecessary- even 5 or so books by different authors should be enough for very accurate classification).
Since your sentences are mashed together without clear word boundaries, it's a bit tricky, but the good news is that the Markov model doesn't care about words, just about what follows what. So, you can make it ignore spaces, by first stripping all spaces from your training data. If you were going to use Alice in Wonderland as your training text, the first paragraph would, perhaps, look like so:
alicewasbeginningtogetverytiredofsittingbyhersisteronthebankandofhavingnothingtodoonceortwiceshehadpeepedintothebookhersisterwasreadingbutithadnopicturesorconversationsinitandwhatistheuseofabookthoughtalicewithoutpicturesorconversation
It looks weird, but as far as a Markov model is concerned, it's a trivial difference from the classical implementation.
I see that you are concerned about time: Training may take a few minutes (assuming you have already compiled gold standard "sentences" and "random scrambled strings" texts). You only need to train once, you can easily save the "trained" model to disk and reuse it for subsequent runs by loading from disk, which may take a few seconds. Making a call on a string would take a trivially small number of floating point multiplications to get a probability, so after you finish training it, it should be very fast.

Comparing the contents of two lists in prolog

I am having some kind of homework and I am stuck to one point. I am given some facts like those:
word([h,e,l,lo]).
word([m,a,n]).
word([w,o,m,a,n]). etc
and I have to make a rule so that the user will input one list of letters and I should compare the list with the words I have and correct any possible mistakes. Here is the code I am using if the first letter is in the correct place:
mistake_letter([],[]).
mistake_letter([X|L1],[X|L2]):-
word([X|_]),
mistake_letter(L1,L2).
The problem is I don't know how to move to the next letter in the word fact. The next time the backtrack will run it will use the head of the word while I would like to use the second letter in the list. Any ideas on how to solve this?
I am sorry for any grammatical mistakes and I appreciate your help.
In order to move to the next letter in the word fact, you need to make the word from the fact a third argument, and take it along for the ride. In your mistake_letter/2, you will pick words one by one, and call mistake_letter/3, passing the word you picked along, like this:
mistake_letter(L1,L2):-
word(W),
mistake_letter(L1,L2,W).
The you'll need to change your base case to do something when the letters in the word being corrected run out before the letters of the word that you picked. What you do depends on your assignment: you could backtrack mistake_letter([],[],[])., declare a match mistake_letter([],[],_)., attach word's tail to the correction mistake_letter([],W,W). or do something else.
You also need an easy case to cover the situation when the first letter of the word being corrected matches the first letter of the word that you picked:
mistake_letter([X|L1],[X|L2],[X|WT]):-
mistake_letter(L1, L2, WT).
Finally, you need the most important case: what to do when the initial letters do not match. This is probably the bulk of your assignment: the rest is just boilerplate recursion. In order to get it right, you may need to change mistake_letter/3 to mistake_letter/4 to be able to calculate the number of matches, and later compare it to the number of letters in the original word. This would let you drop "corrections" like [w,o,r,l,d] --> [h,e,l,l,o] as having only 20% of matching letters.