Levenstein program variance? - c++

I have a C++ task and I'm not sure how to approach this problem.
So you have a startString and an ENDSTRING. The point is to transform the startString into ENDSTRING with the given set of operations:
insert character
detele character
replace substring from dictionary
move substring (to the left)
reverse substring
AND the operations you use should be as low as possible
So I searched google to see that this is the string reconstruction problem.
This is the edit distance and particularly the Levenshtein distance algorithm.
BUT Levenshtein algorithm does NOT give you the steps you make - it gives you only the number of steps. I have to write an algorithm to reconstruct the givenString to the ENDSTRING with minimum operations as possible and to write a file which describes the steps taken so far.
Can you please guide me which algorithm should I use because that Levenshtein one only gives you the number of steps, but I need their number as well and a list with the actual steps.
Thanks

Basically, you would have to modify the Levishtein algorithm a bit. As it is an example of Dynamic Programming, the calculation of one specific state of the Dynamic Program state space corresponds to a specific choice of alternatives. To my understanding, you have two options:
1.) use some auxiliary data strucute where for each state you store the choice to which its
value corresponds;
2.) use no auxiliary data strucuture, but as soon as the sate space is evaluated, you use backtracking to see which choices have been possible.
Option 2.) might result in less code, but option 1.) might be easier to understand.

Related

How to find the row in a matrix which is closest to a test row?

I am currently trying to come up with a smart way to store my data, such that I can easily search, sort and insert.
My data consist of a std::pair<vector<int>,string> which is stored in a std::vector, forming sort of an 2d matrix.
Something like this:
Each vector has a number sequence that matches a string.
My problem here is when it should be tested.
A test, consist of testing it with a test vector, which might be the same length or be smaller than those stored in the matrix. how do I find out which string the test vector best matches with?
How do i search for that efficiently?
Some thought of ideas on how to solve the problem, and some of the problems with them:
One idea was to make sub sums, depending on the length of the test vector, and then see which of the sub sums, best matches the sum of the test vector.
Problem: I am looking to the same pattern, so the same sum could occur given another pattern.
Another idea was to make a copy of the matrix, sort it column wise, and make a search for a each index, and keep track of which the string were matching the best..
Problem: This though requires sorting all the columns - the matrix is way larger than showed above, it has around 1000 columns, and it seems too expensive to make one search for a one sort - given the amount of time I would spent on it - A insert possibility also need to be implemented, so something efficient would be appreciated.

The fastest C++ algorithm for string testing against a list of predefined seeds (case insensitive)

I have list of seed strings, about 100 predefined strings. All strings contain only ASCII characters.
std::list<std::wstring> seeds{ L"google", L"yahoo", L"stackoverflow"};
My app constantly receives a lot of strings which can contain any characters. I need check each received line and decide whether it contain any of seeds or not. Comparison must be case insensitive.
I need the fastest possible algorithm to test received string.
Right now my app uses this algo:
std::wstring testedStr;
for (auto & seed : seeds)
{
if (boost::icontains(testedStr, seed))
{
return true;
}
}
return false;
It works well, but I'm not sure that this is the most efficient way.
How is it possible to implement the algorithm in order to achieve better performance?
This is a Windows app. App receives valid std::wstring strings.
Update
For this task I implemented Aho-Corasick algo. If someone could review my code it would be great - I do not have big experience with such algorithms. Link to implementation: gist.github.com
If there are a finite amount of matching strings, this means that you can construct a tree such that, read from root to leaves, similar strings will occupy similar branches.
This is also known as a trie, or Radix Tree.
For example, we might have the strings cat, coach, con, conch as well as dark, dad, dank, do. Their trie might look like this:
A search for one of the words in the tree will search the tree, starting from a root. Making it to a leaf would correspond to a match to a seed. Regardless, each character in the string should match to one of their children. If it does not, you can terminate the search (e.g. you would not consider any words starting with "g" or any words beginning with "cu").
There are various algorithms for constructing the tree as well as searching it as well as modifying it on the fly, but I thought I would give a conceptual overview of the solution instead of a specific one since I don't know of the best algorithm for it.
Conceptually, an algorithm you might use to search the tree would be related to the idea behind radix sort of a fixed amount of categories or values that a character in a string might take on at a given point in time.
This lets you check one word against your word-list. Since you're looking for this word-list as sub-strings of your input string, there's going to be more to it than this.
Edit: As other answers have mentioned, the Aho-Corasick algorithm for string matching is a sophisticated algorithm for performing string matching, consisting of a trie with additional links for taking "shortcuts" through the tree and having a different search pattern to accompany this. (As an interesting note, Alfred Aho is also a contributor to the the popular compiler textbook, Compilers: Principles, Techniques, and Tools as well as the algorithms textbook, The Design And Analysis Of Computer Algorithms. He is also a former member of Bell Labs. Margaret J. Corasick does not seem to have too much public information on herself.)
You can use Aho–Corasick algorithm
It builds trie/automaton where some vertices marked as terminal which would mean string has seeds.
It's built in O(sum of dictionary word lengths) and gives the answer in O(test string length)
Advantages:
It's specifically works with several dictionary words and check time doesn't depend on number of words (If we not consider cases where it doesn't fit to memory etc)
The algorithm is not hard to implement (comparing to suffix structures at least)
You may make it case insensitive by lowering each symbol if it's ASCII (non ASCII chars don't match anyway)
You should try a pre-existing regex utility, it may be slower than your hand-rolled algorithm but regex is about matching multiple possibilities, so it is likely it will be already several times faster than a hashmap or a simple comparison to all strings. I believe regex implementations may already use the Aho–Corasick algorithm mentioned by RiaD, so basically you will have at your disposal a well tested and fast implementation.
If you have C++11 you already have a standard regex library
#include <string>
#include <regex>
int main(){
std::regex self_regex("google|yahoo|stackoverflow");
regex_match(input_string ,self_regex);
}
I expect this to generate the best possible minimum match tree, so I expect it to be really fast (and reliable!)
One of the faster ways is to use suffix tree https://en.wikipedia.org/wiki/Suffix_tree, but this approach has huge disadvantage - it is difficult data structure with difficult constructing. This algorithm allows to build tree from string in linear complexity https://en.m.wikipedia.org/wiki/Ukkonen%27s_algorithm
Edit: As Matthieu M. pointed out, the OP asked if a string contains a keyword. My answer only works when the string equals the keyword or if you can split the string e.g. by the space character.
Especially with a high number of possible candidates and knowing them at compile time using a perfect hash function with a tool like gperf is worth a try. The main principle is, that you seed a generator with your seed and it generates a function that contains a hash function which has no collisions for all seed values. At runtime you give the function a string and it calculates the hash and then it checks if it is the only possible candidate corresponding to the hashvalue.
The runtime cost is hashing the string and then comparing against the only possible candidate (O(1) for seed size and O(1) for string length).
To make the comparison case insensitive you just use tolower on the seed and on your string.
Because number of string is not big (~100), you can use next algo:
Calculate max length of word you have. Let it be N.
Create int checks[N]; array of checksum.
Let's checksum will be sum of all characters in searching phrase. So, you can calculate such checksum for each word from your list (that is known at compile time), and create std::map<int, std::vector<std::wstring>>, where int is checksum of string, and vector should contain all your strings with that checksum.
Create array of such maps for each length (up to N), it can be done at compile time also.
Now move over big string by pointer. When pointer points to X character, you should add value of X char to all checks integers, and for each of them (numbers from 1 to N) remove value of (X-K) character, where K is number of integer in checks array. So, you will always have correct checksum for all length stored in checks array.
After that search on map does there exists strings with such pair (length & checksum), and if exists - compare it.
It should give false-positive result (when checksum & length is equal, but phrase is not) very rare.
So, let's say R is length of big string. Then looping over it will take O(R).
Each step you will perform N operations with "+" small number (char value), N operations with "-" small number (char value), that is very fast. Each step you will have to search for counter in checks array, and that is O(1), because it's one memory block.
Also each step you will have to find map in map's array, that will also be O(1), because it's also is one memory block.
And inside map you will have to search for string with correct checksum for log(F), where F is size of map, and it will usually contain no more then 2-3 strings, so we can in general pretend that it is also O(1).
Also you can check, and if there is no strings with same checksum (that should happens with high chance with just 100 words), you can discard map at all, storing pairs instead of map.
So, finally that should give O(R), with quite small O.
This way of calculating checksum can be changed, but it's quite simple and completely fast, with really rare false-positive reactions.
As a variant on DarioOO’s answer, you could get a possibly faster implementation of a regular expression match, by coding a lex parser for your strings. Though normally used together with yacc, this is a case where lex on its own does the job, and lex parsers are usually very efficient.
This approach might fall down if all your strings are long, as then an algorithm such as Aho-Corasick, Commentz-Walter or Rabin-Karp would probably offer significant improvements, and I doubt that lex implementations use any such algorithm.
This approach is harder if you have to be able to configure the strings without reconfiguration, but since flex is open source you could cannibalise its code.

Record all optimal sequence alignments when calculating Levenshtein distance in Julia

I'm working on the Levenshtein distance with Wagner–Fischer algorithm in Julia.
It would be easy to get the optimal value, but a little hard to get the optimal operation sequence, like insert or deletion, while backtrace from the right down corner of the matrix.
I can record the pointer information of each d[i][j], but it might give me 3 directions to go back to d[i-1][j-1] for substitution, d[i-1][j] for deletion and d[i][j-1] for insertion. So I'm trying to get all combination of the operation sets that gave me the optimal Levenshtein distance.
It seems that I can store one operation set in one array, but I don't know the total number of all combinations as well as there length, so it would be hard for me to define an array to store the operation set during the enumeration process. How can I generate arrays while store the former ones? Or I should use Dataframe?
If you implement the Wagner-Fischer algorithm, at some point, you choose the minimum over three alternatives (see Wikipedia pseudo-code). At this point, you save the chosen alternative in another matrix. Using a statement like:
c[i,j] = indmin([d[i-1,j]+1,d[i,j-1]+1,d[i-1,j-1]+1])
# indmin returns the index of the minimum element in a collection.
Now c[i,j] contains 1,2 or 3 according to deletion, insertion or substitution.
At the end of the calculation, you have the final d matrix element achieving the minimum distance, you then follow the c matrix backwards and read the action at each step. Keeping track of i and j allows reading the exact substitution by looking which element was in string1 at i and string2 at j in the current step. Keeping a matrix like c cannot be avoided because at the end of the algorithm, the information about the intermediate choices (done by min) would be lost.
I'm not sure that I got your question but anyway, vectors in Julia are dynamic data structures, so you are always able to grow it using appropriate function, e.g pull!() , append!() , preapend!() also its possible to reshape() the result vector to an array of desired size.
but one particular approach for the above case could be obtained using sparse() matrix:
import Base.zero
Base.zero(ASCIIString)=""
module GSparse
export insertion,deletion,substitude,result
s=sparse(ASCIIString[])
deletion(newval::ASCIIString)=begin
global s
s.n+=1
push!(s.colptr,last(s.colptr))
s[s.m,s.n]=newval
end
insertion(newval::ASCIIString)=begin
global s
s.m+=1
s[s.m,s.n]=newval
end
substitude(newval::ASCIIString)=begin
global s
s.n+=1
s.m+=1
push!(s.colptr,last(s.colptr))
s[s.m,s.n]=newval
end
result()=begin
global s
ret=zeros(ASCIIString,size(s))
le=length(s)
for (i = 1:le)
ret[le-i+1]=s[i]
end
ret
end
end
using GSparse
insertion("test");
insertion("testo");
insertion("testok");
substitude("1estok");
deletion("1stok");
result()
I like the approach because for large texts you could have many zero elements. also I fill data structure in forward way and create results by reversing.

Compare similarity algorithms

I want to use string similarity functions to find corrupted data in my database.
I came upon several of them:
Jaro,
Jaro-Winkler,
Levenshtein,
Euclidean and
Q-gram,
I wanted to know what is the difference between them and in what situations they work best?
Expanding on my wiki-walk comment in the errata and noting some of the ground-floor literature on the comparability of algorithms that apply to similar problem spaces, let's explore the applicability of these algorithms before we determine if they're numerically comparable.
From Wikipedia, Jaro-Winkler:
In computer science and statistics, the Jaro–Winkler distance
(Winkler, 1990) is a measure of similarity between two strings. It is
a variant of the Jaro distance metric (Jaro, 1989, 1995) and
mainly[citation needed] used in the area of record linkage (duplicate
detection). The higher the Jaro–Winkler distance for two strings is,
the more similar the strings are. The Jaro–Winkler distance metric is
designed and best suited for short strings such as person names. The
score is normalized such that 0 equates to no similarity and 1 is an
exact match.
Levenshtein distance:
In information theory and computer science, the Levenshtein distance
is a string metric for measuring the amount of difference between two
sequences. The term edit distance is often used to refer specifically
to Levenshtein distance.
The Levenshtein distance between two strings is defined as the minimum
number of edits needed to transform one string into the other, with
the allowable edit operations being insertion, deletion, or
substitution of a single character. It is named after Vladimir
Levenshtein, who considered this distance in 1965.
Euclidean distance:
In mathematics, the Euclidean distance or Euclidean metric is the
"ordinary" distance between two points that one would measure with a
ruler, and is given by the Pythagorean formula. By using this formula
as distance, Euclidean space (or even any inner product space) becomes
a metric space. The associated norm is called the Euclidean norm.
Older literature refers to the metric as Pythagorean metric.
And Q- or n-gram encoding:
In the fields of computational linguistics and probability, an n-gram
is a contiguous sequence of n items from a given sequence of text or
speech. The items in question can be phonemes, syllables, letters,
words or base pairs according to the application. n-grams are
collected from a text or speech corpus.
The two core
advantages of n-gram models (and algorithms that use
them) are relative simplicity and the ability to scale up – by simply
increasing n a model can be used to store more context with a
well-understood space–time tradeoff, enabling small experiments to
scale up very efficiently.
The trouble is these algorithms solve different problems that have different applicability within the space of all possible algorithms to solve the longest common subsequence problem, in your data or in grafting a usable metric thereof. In fact, not all of these are even metrics, as some of them don't satisfy the triangle inequality.
Instead of going out of your way to define a dubious scheme to detect data corruption, do this properly: by using checksums and parity bits for your data. Don't try to solve a much harder problem when a simpler solution will do.
String similarity helps in a lot of different ways. For example
google's did you mean results are calculated using string similarity.
string similarity is used to correct OCR errors.
string similarity is used to correct keyboard entering errors.
string similarity is used to find most matching sequence of two DNAs in bioinformatics.
But as one size does not fit all. Every string similarity algorithm is designed for a specific usage though most of them are similar. For example Levenshtein_distance is about how many char you change to make two strings equal.
kitten → sitten
Here distance is 1 character change. You may give different weights to deletion, addition and substitution. For example OCR errors and keyboard errors give less weight for some changes. OCR ( some chars are very similar to others ), keyboard some chars are very near to each other. Bioinformatic string similarity allows a lot of insertion.
Your second example of "Jaro–Winkler distance metric is designed and best suited for short strings such as person names"
Therefore you should keep in your mind about your problem.
I want to use string similarity functions to find corrupted data in my database.
How your data is corrupted? Is it a user error , similar to keyboard input error? Or is it similar to OCR errors? Or something else entirely?

How to compute multiple related Levenshtein distances?

Given two strings of equal length, Levenshtein distance allows to find the minimum number of transformations necessary to get the second string, given the first. However, I'd like to find a way to adjust the alogrithm for multiple pairs of strings, given that they were all generated in the same way.
Reading the comments, it appears that this is the problem:
You are given a set of pairs of strings, all the same length and each pair is the input to some function paired with the output from the function. So, for the pair A,B, we know that f(A)=B. The goal is to reverse engineer f() with a large set of A,B pairs.
Using Levenshtein distance on the entire set will, at most, tell you the maximum number of transformations that must take place.
A better start would be Hamming distance (modified to allow multiple characters) or Jaccard similarity to identify how many positions in strings do not change at all for all of the pairs. Then, you are left only with those that do change.
This will fail if the letters shift.
To detect shift, you want to use global alignment (Needleman-Wunsch). You will then see something like "ABCDE"=>"xABCD" to show that from the input to the output, there was a left shift.
Overall, I feel that Levenshtein distance will do very little to help you get at the original algorithm.