Levenshtein algorithm: How do I meet this text editing requirements? - c++

I'm using levenshtein algorithm to meet these requirements:
When finding a word of N characters, the words to suggest as correction in my dictionary database are:
Every dictionary word of N characters that has 1 character of difference with the found word.
Example:
found word:bearn, dictionary word: bears
Every dictionary word of N+1 characters that has N characters equal to the found word.
Example:
found word:bear, dictionary word: bears
Every dictionary word of N-1 characters that has N-1 characters equal to the found word.
Example:
found word: bears, dictionary word: bear
I'm using this implementation of Levenshtein algorithm in C++ to find when a word has a Levenshtein number of 1 (which is the Levenshtein number for all three cases), but then how do I choose the word to suggest? I read about Boyer-Moore-Horspool and Knuth-Morris-Pratt but I'm not sure on how either of them can be helpful.
#include <string>
#include <vector>
#include <algorithm>
using namespace std;
int levenshtein(const string &s1, const string &s2)
{
string::size_type N1 = s1.length();
string::size_type N2 = s2.length();
string::size_type i, j;
vector<int> T(N2+1);
for ( i = 0; i <= N2; i++ )
T[i] = i;
for ( i = 0; i < N1; i++ ) {
T[0] = i+1;
int corner = i;
for ( j = 0; j < N2; j++ ) {
int upper = T[j+1];
if ( s1[i] == s2[j] )
T[j+1] = corner;
else
T[j+1] = min(T[j], min(upper, corner)) + 1;
corner = upper;
}
}
return T[N2];
}

You may also want to add Norvig's excellent article on spelling correction to your reading.
It's been a while since I've read it but I remember it being very similar to what your writing about.

As I've said elsewhere, Boyer-Moore isn't really apt for this. Since you want to search for multiple stings simultanously, the algorithm of Wu and Manber should be more to your liking.
I've posted a proof of concept C++ code in answer to another question. Heed the caveats mentioned there.

Why restrict the suggestion to a single word, why not include a set of words? If you are restricted to a single word, you can order your results by some pre-calculated frequency of usage or something. This frequency could be updated based on what users select from the suggestion.
Also, in the case where there isn't a spelling error in the original word, you might want to prioritize the N+1 cases, which would be more like an autocomplete. Anyway I don't think there is one correct way to do it, maybe if your requirements are more specific, it would be easier to narrow down.
Also, you don't need to know Python to understand the algorithms described in Norvig's article.

If I understand you correctly, then there is no correct answer to your question. You are identifying up to three suggestions for a given word using Levenshtein - it is up to you to come up with a rule to decide which one to use and which ones to filter out. Or perhaps you should use them all?
Just as a matter of interest, the Damerau extension to Levenshtein might be of interest to you, where two swapped characters are also considered to give a score of 1, instead of 2, which is what vanilla Levenshtein returns.

Related

C++ function to find a pattern in a list of floats?

I have a list of 18 floats. Contained in this list of floats is a single pattern that occurs twice. I would like to be able to pick out this pattern without providing this specific pattern to the program.
I have tried to plan this several times both on a whiteboard and in Visual Studio. I get stuck trying to iterate forward/backward to find a continued pattern once I find the first float of each pattern. I also looked online for any examples but I could not find any that found any existing patterns in the list without being given a specific pattern.
Thanks and I appreciate all input/help!
Where I get stuck:
std::vector<float> RandomFloats =
{
8.74,
7.76,
9.45,
7.41, // Pattern Begin
8.91,
9.55,
7.01,
9.63, // Pattern End
10.0,
8.67,
7.78,
7.41, // Pattern Begin
8.91,
9.55,
7.01,
9.63, // Pattern End
7.58,
9.65,
8.18
};
const static int FloatCount = RandomFloats.size();
for (int i = 0; i < FloatCount; i++)
{
float fParent = RandomFloats.at(i);
for (int j = 0; j < FloatCount; j++)
{
float fChild = RandomFloats.at(j);
if (fChild != fParent)
continue;
// Check for continued pattern here.
}
}
It looks like you're on the right track with your current approach. It's not particularly efficient, but it'll get the job done.
One problem is that the j loop starts at index 0. This is unnecessary, and is going to cause you extra confusion. Think about the meaning ("semantics") of that code. The goal is to find the next occurrance of the value at position i. Right now, you're searching the entire vector for it, which will actually find the value you're currently on, or even one that occurs earlier. You don't want that!
So, start the loop at position i + 1, not at 0:
for (int i = 0; i < FloatCount; i++) {
for (int j = i + 1; j < FloatCount; j++) {
// ^^^^^^^^^
}
}
Then, you just need to write one more loop at your "Check for continued pattern here" part of the code. Think about what that needs to do. It's looking for a sequence of matching values such that the sequences don't overlap.
Think about what you know at that point. You've found that index i and index j mark the potential start of a sequence, because the values in the vector are equal. Now you need to check each value that follows until they either don't match, or you reach the end of the vector.
Putting those words into code:
int ii = i + 1;
int jj = j + 1;
while (jj < FloatCount && //<-- don't run off end of array
ii < j && //<-- don't allow sequences to overlap
RandomFloats[ii] == RandomFloats[jj])
{
++ii;
++jj;
}
After that, you know that both ii and jj indices are one-past-the-end of the sequence. So the calculation of its length is simple:
int sequenceLength = ii - i;
The last bit is an exercise for you:
If a sequence might contain two identical values, or more generally if any value can appear more than once anywhere at all, then you also need to check whether the sequence you've found is better than any sequence you've found before.
To do that, you'll want some variables that remember the best sequence length so far, and where the two starting points of that sequence are. Then, you can easily check this whenever you find a repeated sequence, and update it if necessary.
If I'm not wrong, what you are looking for is an adaptation of the longest repeated substring problem.
In the original problem, given a string, you want to find the longest repeated subtring, this can be done in O(n) operations using a suffix tree. Your question is exactly the same, but with numbers (each number is a letter), thus, the suffix tree should be altered so it stores numbers and not characters.
Going for the same approach can work in here as well, you can check it here. (I have checked with your input, and it works perfectly, even though it is looking for recurring letters and not numbers, thus, less efficient).
Generally speaking, sometimes you want to find a problem that is similar, but has a solution, like here, it is very common to take a solution from one field, and import it to another.

Reversing the positions of words in a string without changing order of special characters in O(1) space limit

During mock interview I come up with this question. Interviewer first ask this question without any space limitations. Then he continued with space-limited version. To be on the same page. In the question a string and container class consist of delimiters are given. This is up to you to decide suitable container class and the language of response. I think sample input and output would be enough to understand what really question is.
Input:
"Reverse#Strings Without%Changing-Delimiters"
Output:
"Delimiters#Changing Without%Strings-Reverse"
Note That: Position of "#", "%"," ","-" is not changed
I came up with the solution below:
string ChangeOrderWithoutSpecial(string s, unordered_set<char> delimiter)
{
stack<string> words; // since last words outs first
queue<char> limiter; // since first delimiter outs first
string response =""; //return value
int index=-1; // index of last delimiter visited
int len=s.length();
for (int i =0 ; i <len;i++)
{
if(delimiter.find(s[i]) != delimiter.end()) // i-th char is a delimiter character
{
string temp=s.substr(index+1,i-index-1);
words.push(temp);
char t =s.at(i);
limiter.push(t);
index=i;
}
// i realized that part after interview under assumption starting with word and no double delimiters ie, each word followed by one delimiter
if(index!=s.length()-1)
{
string temp=s.substr(index+1,s.length()-index-1);//until the end;
cout<<temp<<endl;
words.push(temp);
}
while(!limiter.empty())
{
response+=words.top()+limiter.front();
words.pop();
limiter.pop();
}
response+=words.top();
return response;
}
However I couldnt find a o(1) space solution ? Anyone know how ? I also could not figure out if there are multiple delimiters , that also be appricated. Thank you anyone spend time even reading.
Find the first word and the last word. Rotate the string by length(last_word)-length(first_word): this would put the middle part in the correct position. In the example, that'll produce
ersReverse#Strings Without%Changing-Delimit
Then rotate the first and last part of the string, skipping the middle, by length(first_word):
Delimiters#Strings Without%Changing-Reverse
Repeat this algorithm for the substring between the two outermost delimiters.
"Rotate by m" operation can be performed in O(1) space and O(n) time, where n is the length of the sequence being rotated.
Instead of rotating the string, it can be also solved by successive reversing the string.
Reverse the whole string. This is O(n) operation. In your case the string becomes sretimileD-gnignahC%tuohtiW sgnirtS#esreveR.
Find all words and reverse each of them. This is O(n) operation. String is now equal to Delimiters-Changing%Without Strings#Reverse.
Reverse delimiters. This is O(n) operation. You'll get wanted result: Delimiters#Changing Without%Strings-Reverse.
Each of these operations can be done in place, so the total memory complexity is O(1) and time complexity is O(n).
It is worth noting that with this approach each character will be visited 4 times (first reverse, finding words, reverse word, reverse delimiter), so (in general case) it should be faster than Igor Tandetnik's answer where characters in the middle of the string are visited many times. However, in special case where each word has the same length, Igor's solution will be faster because the first rotate operation won't exists.
Edit:
Reverse delimiters can be done in O(n) without extra memory in the similar way as the standard reverse. Just iterate through delimiters instead of whole set of characters:
Iterate forward until you reach delimiter;
Reverse iterate until you reach delimiter from the back;
Swap the current delimiters;
Continue procedure until your iterators meet.
Here is procedure in C++ which will do this job
void reverseDelimiters(string& s, unordered_set<char>& delimiters)
{
auto i = s.begin(); auto j = s.end() - 1; auto dend = delimiters.end();
while (i < j) {
while (i < j && delimiters.find(*i) == dend) i++;
while (i < j && delimiters.find(*j) == dend) j--;
if (i < j) swap(*i, *j), i++, j--;
}
}

Choosing an efficient data structure to find rhymes

I've been working on a program that reads in a whole dictionary, and utilizes the WordNet from CMU that splits every word to its pronunciation.
The goal is to utilize the dictionary to find the best rhymes and alliterations of a given word, given the number of syllables in the word we need to find and its part of speech.
I've decided to use std::map<std::string, vector<Sound> > and std::multimap<int, std::string> where the map maps each word in the dictionary to its pronunciation in a vector, and the multimap is returned from a function that finds all the words that rhyme with a given word.
The int is the number of syllables of the corresponding word, and the string holds the word.
I've been working on the efficiency, but can't seem to get it to be more efficient than O(n). The way I'm finding all the words that rhyme with a given word is
vector<string> *rhymingWords = new vector<string>;
for (iterator it : map<std::string, vector<Sound> >) {
if(rhymingSyllables(word, it.first) >= 1 && it.first != word) {
rhymingWords->push_back(it.first);
}
}
return rhymingWords;
And when I find the best rhyme for a word (a word that rhymes the most syllables with the given word), I do
vector<string> rhymes = *getAllRhymes(rhymesWith);
int x = 0;
for (string s : rhymes) {
if (countSyllables(s) == numberOfSyllables) {
int a = rhymingSyllables(s, rhymesWith);
if (a > x) {
maxRhymes = thisRhyme;
bestRhyme = s;
}
}
}
return bestRhyme;
The drawback is the O(n) access time in terms of the number of words in the dictionary. I'm thinking of ideas to drop this down to O(log n) , but seem to hit a dead end every time. I've considered using a tree structure, but can't work out the specifics.
Any suggestions? Thanks!
The rhymingSyllables function is implemented as such:
int syllableCount = 0;
if((soundMap.count(word1) == 0) || (soundMap.count(word2) == 0)) {
return 0;
}
vector<Sound> &firstSounds = soundMap.at(word1), &secondSounds = soundMap.at(word2);
for(int i = firstSounds.size() - 1, j = secondSounds.size() - 1; i >= 0 && j >= 0; --i, --j){
if(firstSounds[i] != secondSounds[j]) return syllableCount;
else if(firstSounds[i].isVowel()) ++syllableCount;
}
return syllableCount;
P.S.
The vector<Sound> is the pronunciation of the word, where Sound is a class that contains every different pronunciation of a morpheme in English: i.e,
AA vowel AE vowel AH vowel AO vowel AW vowel AY vowel B stop CH affricate D stop DH fricative EH vowel ER vowel EY vowel F fricative G stop HH aspirate IH vowel IY vowel JH affricate K stop L liquid M nasal N nasal NG nasal OW vowel OY vowel P stop R liquid S fricative SH fricative T stop TH fricative UH vowel UW vowel V fricative W semivowel Y semivowel Z fricative ZH fricative
Perhaps you could group the morphemes that will be matched during rhyming and compare not the vectors of morphemes, but vectors of associated groups. Then you can sort the dictionary once and get a logarithmic search time.
After looking at rhymingSyllables implementation, it seems that you convert words to sounds, and then match any vowels to each other, and match other sounds only if they are the same. So applying advice above, you could introduce an extra auxiliary sound 'anyVowel', and then during dictionary building convert each word to its sound, replace all vowels with 'anyVowel' and push that representation to dictionary. Once you're done sort the dictionary. When you want to search a rhyme for a word - convert it to the same representation and do a binary search on the dictionary, first by last sound as a key, then by previous and so on. This will give you m*log(n) worst case complexity, where n is dictionary size and m is word length, but typically it will terminate faster.
You could also exploit the fact that for best rhyme you consider words only with certain syllable numbers, and maintain a separate dictionary per each syllable count. Then you count number of syllables in word you look rhymes for, and search in appropriate dictionary. Asymptotically it doesn't give you any gain, but a speedup it gives may be useful in your application.
I've been thinking about this and I could probably suggest an approach to an algorithm.
I would maybe first take the dictionary and divide it into multiple buckets or batches. Where each batch represents the number of syllables each word has. The traversing of the vector to store into different buckets should be linear as you are traverse a large vector of strings. From here since the first bucket will have all words of 1 syllable there is nothing to do at the moment so you can skip to bucket two and each bucket after will need to take each word and separate the syllables of each word. So if you have say 25 buckets, where you know the first few and the last few are not going to hold many words their time shouldn't be significant and should be done first, however the buckets in the middle that have say 3-5 or 3-6 syllables in length will be the largest to do so you could run each of these buckets on a separate thread if their size is over a certain amount and have them run in parallel. Now once you are done; each bucket should return a std::vector<std::shared_ptr<Word>> where your structure might look like this:
enum SpeechSound {
SS_AA,
SS_AE,
SS_...
SS_ZH
};
enum SpeechSoundType {
ASPIRATE,
...
VOWEL
};
struct SyllableMorpheme {
SpeechSound sound;
SpeechSoundType type;
};
class Word {
public:
private:
std::string m_strWord;
// These Two Containers Should Match In Size! One String For Each
// Syllable & One Matching Struct From Above Containing Two Enums.
std::vector<std::string> m_vSyllables
std::vector<SyllableMorpheme> m_vMorphemes;
public:
explicit Word( const std::string& word );
std::string getWord() const;
std::string getSyllable( unsigned index ) const;
unsigned getSyllableCount() const;
SyllableMorpheme getMorhpeme( unsigned index ) const;
bool operator==( const ClassObj& other ) const;
bool operator!=( const ClassObj& other ) const;
private:
Word( const Word& c ); // Not Implemented
Word& operator=( const Word& other ) const; // Not Implemented
};
This time you will now have new buckets or vectors of shared pointers of these class objects. Then you can easily write a function to traverse through each bucket or even multiple buckets since the buckets will have the same signature only a different amount of syllables. Remember; each bucket should already be sorted alphabetically since we only added them in by the syllable count and never changed the order that was read in from the dictionary.
Then with this you can easily compare if two words are equal or not while checking For Matching Syllables and Morphemes. And these are contained in std::vector<std::shared_ptr<Word>>. So you don't have to worry about memory clean up as much either.
The idea is to use linear search, separation and comparison as much as possible; yet if your container gets too large, then create buckets and run in parallel multiple threads, or maybe use a hash table if it will suite your needs.
Another possibility with this class structure is that you could even add more to it later on if you wanted or needed to such as another std::vector for its definitions, and another std::vector<string> for its part of speech {noun, verb, etc.} You could even add in other vector<string> for things such as homonyms, homophomes and even a vector<string> for a list of all words that rhyme with it.
Now for your specific task of finding the best matching rhyme you may find that some words may end up having a list of Words that would all be considered a Best Match or Fit! Due to this you wouldn't want to store or return a single string, but rather a vector of strings!
Case Example:
To Too Two Blue Blew Hue Hew Knew New,
Bare Bear Care Air Ayre Heir Fair Fare There Their They're
Plain, Plane, Rain, Reign, Main, Mane, Maine
Yes these are all single syllable rhyming words, but as you can see there are many cases where there are multiple valid answers, not just a single best case match. This is something that does need to be taken into consideration.

Increase string overlap matrix building efficiency

I have a huge list (N = ~1million) of strings 100 characters long that I'm trying to find the overlaps between. For instance, one string might be
XXXXXXXXXXXXXXXXXXAACTGCXAACTGGAAXA (and so on)
I need to build an N by N matrix that contains the longest overlap value for every string with every other string. My current method is (pseudocode)
read in all strings to array
create empty NxN matrix
compare each string to every string with a higher array index (to avoid redoing comparisons)
Write longest overlap to matrix
There's a lot of other stuff going on, but I really need a much more efficient way to build the matrix. Even with the most powerful computing clusters I can get my hands on this method takes days.
In case you didn't guess, these are DNA fragments. X indicates "wild card" (probe gave below a threshold quality score) and all other options are a base (A, C, T, or G). I tried to write a quaternary tree algorithm, but this method was far too memory intensive.
I'd love any suggestions you can give for a more efficient method; I'm working in C++ but pseudocode/ideas or other language code would also be very helpful.
Edit: some code excerpts that illustrate my current method. Anything not particularly relevant to the concept has been removed
//part that compares them all to each other
for (int j=0; j<counter; j++) //counter holds # of DNA
for (int k=j+1; k<counter; k++)
int test = determineBestOverlap(DNArray[j],DNArray[k]);
//boring stuff
//part that compares strings. Definitely very inefficient,
//although I think the sheer number of comparisons is the main problem
int determineBestOverlap(string str1, string str2)
{
int maxCounter = 0, bestOffset = 0;
//basically just tries overlapping the strings every possible way
for (int j=0; j<str2.length(); j++)
{
int counter = 0, offset = 0;
while (str1[offset] == str2[j+offset] && str1[offset] != 'X')
{
counter++;
offset++;
}
if (counter > maxCounter)
{
maxCounter = counter;
bestOffset = j;
}
}
return maxCounter;
} //this simplified version doesn't account for flipped strings
Do you really need to know the match between ALL string pairs? If yes, then you will have to compare every string with every other string, which means you will need n^2/2 comparisons, and you will need one half terabyte of memory even if you just store one byte per string pair.
However, i assume what you really are interested in is long strings, those that have more than, say, 20 or 30 or even more than 80 characters in common, and you probably don't really want to know if two string pairs have 3 characters in common while 50 others are X and the remaining 47 don't match.
What i'd try if i were you - still without knowing if that fits your application - is:
1) From each string, extract the largest substring(s) that make(s) sense. I guess you want to ignore 'X'es at the start and end entirely, and if some "readable" parts are broken by a large number of 'X'es, it probably makes sense to treat the readable parts individually instead of using the longer string. A lot of this "which substrings are relevant?" depends on your data and application that i don't really know.
2) Make a list of these longest substrings, together with the number of occurences of each substring. Order this list by string length. You may, but don't really have to, store the indexes of every original string together with the substring. You'll get something like (example)
AGCGCTXATCG 1
GAGXTGACCTG 2
.....
CGCXTATC 1
......
3) Now, from the top to the bottom of the list:
a) Set the "current string" to the string topmost on the list.
b) If the occurence count next to the current string is > 1, you found a match. Search your original strings for the substring if you haven't remembered the indexes, and mark the match.
c) Compare the current string with all strings of the same length, to find matches where some characters are X.
d) Remove the 1st character from the current string. If the resulting string is already in your table, increase its occurence counter by one, else enter it into the table.
e) Repeat 3b with the last, instead of the first, character removed from the current string.
f) Remove the current string from the list.
g) Repeat from 3a) until you run out of computing time, or your remaining strings become too short to be interesting.
If this is a better algorithm depends very much on your data and which comparisons you're really interested in. If your data is very random/you have very few matches, it will probably take longer than your original idea. But it might allow you to find the interesting parts first and skip the less interesting parts.
I don't see many ways to improve the fact that you need to compare each string with each other including shifting them, and that is by itself super long, a computation cluster seems the best approach.
The only thing I see how to improve is the string comparison by itself: replace A,C,T,G and X by binary patterns:
A = 0x01
C = 0x02
T = 0x04
G = 0x08
X = 0x0F
This way you can store one item on 4 bits, i.e. two per byte (this might not be a good idea though, but still a possible option to investigate), and then compare them quickly with a AND operation, so that you 'just' have to count how many consecutive non zero values you have. That's just a way to process the wildcard, sorry I don't have a better idea to reduce the complexity of the overall comparison.

counting of the occurrence of substrings

Is there an efficient algorithm to count the total number of occurrence of a sub-string X in a longer string Y ?
To be more specific, what I want is, the total number of ways of selecting A.size() elements from B such that there exists a permutation of the selected elements that matches B.
An example is as follows: search the total number of occurrence of X=AB in string Y=ABCDBFGHIJ ?
The answer is 2 : first A and second B, and first A and 5-th B.
I know we can generate all permutations of the long string (which will be N! length N strings Y) and use KMP algorithm to search/count the occurrence of X in Y.
Can we do better than that ?
The original problem I try to solve is as follows: let's say we have a large matrix M of size r by c (r and c in the range of 10000's). Given a small matrix P of size a by b (a and b are in the range of 10's). Find the total number of different selections of a rows and b columns of M (this will give us an a by b "submatrix" H) so that there exists a permutation of the rows and columns of H that gives us a matrix which matches P.
I think once I can solve 1-D case, 2-D may follow the solution.
After research, I find out that this is a sub-graph isomorphism problem and it is NP-hard. There are some algorithms solve this efficiently. One can google it and see many papers on this.
After having read, then re-read the question (at #Charlie 's suggestion), I have concluded that these answers are not addressing the real issue. I have concluded also that I still do not know exactly what the issue is, but if OP answer's my questions and clarifies the issue, then I will come back and make a better attempt at addressing it. For now, I will leave this as a place holder...
To find occurrences of a letter or other character:
char buf[]="this is the string to search";
int i, count=0, len;
len = strlen(buf);
for(i=0;i<len;i++)
{
if(buf[i] == 's') count++;
}
or, using strtok(), find occurrences of a sub-string:
Not pretty, brute force method.
// strings to search
char str1[]="is";
char str2[]="s";
int count = 0;
char buf[]="this is the string to search";
char *tok;
tok = strtok(buf, str1);
while(tok){
count++;
tok = strtok(NULL, str1);
}
tok = strtok(buf, str2);
while(tok){
count++;
tok = strtok(NULL, str2);
}
count should contain the total of occurrences of "s", + occurrences of "is"
[EDIT]
First, let me ask for a technical clarification of your question, given A = "AR", B = "START", the solutions would be "A", "R" and "AR", in this case all found in the 3rd and 4th letters of B. Is that correct?. If so, that's easy enough. You can do that with some small modifications and additions to what I have already done above. And if you have questions about that code, I would be happy to address them if I can.
The second part is your real question: Searching with better than, or at least with the same efficiency as the KMP algorithm - that's the real trick. If choosing the best approach is the real question, then some Google searching is in order. Because once you find, and settle on the best approach (efficiency >= KPM) to solving the sub-string search, then the implementation will be a set of simple steps (if you give it enough time), possibly, but not necessarily using some of the same components of C used above. (Pointer manipulation will be faster than using the string functions I think.) But these techniques are just implementation, and should always follow a good design. Here are a few Google searches to help you get started with a search... (you may have already been to some of these)
Validating KMP
KMP - Can we do better?
KMP - Defined
KMP - Improvements using Fibonacci String
If once you have made your algorithm selection, and begin to implement your design, you have questions about techniques, or coding suggestions, Post them. My guess is there are several people here who would enjoy helping with such a useful algorithm.
If X is a substring in Y, then each character of X must be in Y. So we first iterate through X and find the counts of each character, in an array counts.
Then for each character that has count >= 1, we count the number of times it appears in Y which can be done trivially in O(n).
From here the answer should just be the multiplication of the combinations C(count(Y),count(X)).
If after the 3rd time reading your question I finally understand it correctly.