Is it always true that the edit distance of two strings is equal to the edit distance of their substrings? - levenshtein-distance

Suppose we have two strings:
ccttgg
gacgct
The edit distance of these two strings is 6.
Possible substrings are:
cctt--
gacg--
Their edit distance is 4.
The remaining parts to equal the original two strings are:
----gg
----ct
and their edit distance is 2.
So 4+2=6, that is the original edit distance.
Is this type of assumption always correct?
If it's not, is there a way to compute the edit distance between two strings using the edit distance of their substrings?
Edit: to be clearer my definition of edit distance is the Levenshtein distance with a cost of 1 for insertion, deletion and replace if the characters are not the same and 0 if the characters are equal.
I'm not considering the Damerau distance with transpositions.

No
Counterexample
Consider the strings:
aba
bab
They have an edit distance of 2 by deleting an "a" from the front and adding a "b" to the end.
If these are broken into substrings such as
ab, a
ba, b
then the first substrings have an edit distance of 2 and the second substrings have an edit distance of 1 for a total of 3.

Related

Fasteest algorithm to compute avg distance between two Sets of points

Please see picture :
Given the set of points marked in Red, I take two consecutive points (here 0 and 1 - these numbers are just for illustration , thus not the index in the array holding these points.).
I take their midpoint. From the midpoint I draw a normal on each segments in the Green set (segment = line between two consecutive points).
The blue line is such a normal. The intersection point falls between points 10 and 11. I record the length of it.
The black normal line however, is a normal on the line given by points 12 and 13. But, the intersection does not fall between 12 and 13 . So I reject it.
I want to get the median value of the lengths all such accepted lines, measured from the midpoints of the segments in the red set.
My Brute force algorithm is running at O(MN) time.
My questions :
Is there a standard algorithm for what I am seeking? That is to say, I do not know if the parameter I am measuring has a common name.
What is the fastest method of measuring it.
I would love to do some parallel processing, but I am using D, and I am getting :
"core.thread.threadbase.ThreadError#src/core/thread/threadbase.d(1219): Error creating thread"
Thank you.

Extracting operations from Damerau-Levenshtein

The Damerau-Levenshtein distance tells you the number of additions, deletions, substitutions and transpositions between two words (the latter is what differentiates DL from Levenshtein distance).
The algo is on wikipedia and relatively straightforward. However I want more than just the distance; I want the actual operations.
Eg a function that takes AABBCC, compares it to ABZ, and returns:
Remove A at index 0 -> ABBCC
Remove B at index 2 -> ABCC
Remove C at index 4 -> ABC
Substitute C at index 5 for Z -> ABZ
(ignore how the indices are affected by removals for now)
It seems you can do something with the matrix produced by the DL calculation. This site produces the output above. The text below says you should walk from the bottom right of the matrix, following each lowest cost operation in each cell (follow the bold cells):
If Delete is lowest cost, go up one cell
For Insert, go left one cell
Otherwise for Substitute, Transpose or Equality go up and left
It seems to prioritise equality or substitution over anything else if there's a tie, so in the example I provided, when the bottom-right cell is 4 for both substitution and removal, it picks substitution.
However once it reaches the top left cell, equality is the lowest scoring operation, with 0. But it has picked deletion, with score 2.
This seems to be the right answer, because if you strictly pick the lowest score, you end up with too many As at the start of the string.
But what's the real steps for picking an operation, if not lowest score? Are there other ways to pick operations out of a DL matrix, and if so do you have a reference?
I missed a vital part of fuzzy-string's explanation of how to reconstruct the operations:
But when you want to see the simplest path, it is determined by working backwards from bottom-right to top-left, following the direction of the minimum Change in each cell. (If the top or left edge is reached before the top-left cell, then the type of Change in the remaining cells is overwritten, with Inserts or Deletes respectively.)
...which explains why the equality operation in cell [1,1] is ignored and the delete is used instead!

"Drawing" Numbers on a grid using a "pencil"

How do I approach this problem? The instructions are as follows:
Imagine a square grid. Now "draw" numbers (0-9) using the following commands:
U - draw a line one square to the top
D - draw a line one square to the bottom
L - draw a line one square to the left
R - draw a line one square to the right
^ - lift the "pencil" off the "paper"
_ - put the "pencil" on the "paper"
Input:
first row: int N, represents the amount of numbers to check
the next N rows consist of a string which determines the order of commands
Example:
2
UL^D_RDLR^U
D^LLDRR_U
Output:
3 1
outputs a row of numbers, seperated by one space.
I hope i explained it well enough (English is not my first language).
Here's one possible approach.
Convert a sequence of commands to a sequence of line segments. Have a precomputed array of line segments for each digit you must recognize. (Bear in mind that 6 and 9 can be represented in two different ways!)
Now invent a way to compare two arrays of line segments, given that
the order of segments in a picture doesn't matter
segment direction doesn't matter
absolute coordinates don't matter, but relative coordinates do
number of times a segment is drawn doesn't matter if it's greater than zero
When one needs to compare two values and some aspects of these values don't matter, the common strategy is to transform both values such that these aspects are in their canonical form. For example, to compare two strings when case of the characters doesn't matter, one might transform both strings to the upper case, which would be a canonical form for the purpose of case-insensitive comparison. Your task is to come up with a canonical form for each of the enumerated things that don't matter.

finding the min number of moves by greedy algorithm

There is a 2-d grid which contains chocolates in random cells.In one move ,i can take all the chocolates contained in one row or in one column.What will be the minimum number of moves required to take all the chocolates?
Example:cells containing chocolates are:
0 0
1 1
2 2
min. no of moves =3
0 0
1 0
0 1
min no of moves=2
I guess there is a greedy algo solution to this problem.But how to approach this problem?
I think this is a variance of classic Set Cover problem which is proved to be NP-hard.
Therefore, a greedy algorithm could only get an approximation but not the optimal solution.
For this specific problem, it is not NP-hard. It can be solved in polynomial time. The solution is as below:
Transform the 2D grid to be a bipartite graph. The left side contains nodes represent row and the right side contains nodes represent column. For each cell containing chocolate, suppose it's coordinate is (x,y), add an edge linking row_x and column_y. After the bipartite graph is established, use Hungarian algorithm to get the maximum matching. Since in bipartite graph, the answer of maximum matching equals minimum vertex covering, the answer is exactly what you want.
The Hungarian algorithm is a O(V*E) algorithm. For more details, pls refer to Hungarian algorithm
For more information about bipartite graph, pls refer to Bipartite graph, you can find why maximum matching equals minimum vertex covering here.
PS: It's neither a greedy problem nor dynamic programming problem. It is a graph problem.

Finding the angle which is located in most intervals of angles

I've got angle intervals (in radians) [0,2π)
for example intervals [(2π)/3,(3π)/4],[π/2,π] etc.
but there may also be interval [(3π)/2,π/3]
I have to find the angle which is located in most intervals.
What's the best way to find it in C++?
How can I represent the angle intervals?
You could implement a simple sweep-line algorithm to solve this problem.
For each interval, add the start and end of the interval to a vector; sort this vector, then iterate through it. If you have any intervals that cross the 2π-boundary, simply split it into two intervals, which are both inside of (0, 2π).
As you iterate through the list, keep track of how many overlapping itervals there are at the current point, and what the best angle you've seen so far has been (and how many intervals were overlapping at that angle). Once you reach the end, you know what the optimum angle is.
If you need more than one angle, you can rather easily adapt this approach to remember intervals with maximal overlap, rather than single angles.
I'd do it by maintaining a partition of [0, 2π] into ranges corresponding to interval coverage, with a count for each range. First, here's how the algorithm would work under the condition that none of the intervals crosses 0 (or 2π). The intervals are also assumed to be normalized as follows: if an interval ends at 0, it is changed to end at 2π; if it starts at 2π, it is changed to start at 0.
create a list of (range, count) pairs, initialized with a single range [0, 2π] and a count of 0. (The list will be ordered by the start of the range. The ranges in the list will only overlap at their endpoints and will always cover [0, 2π]).
process each interval as described below
scan the list for a (range, count) pair with the highest count. Resolve ties arbitrarily. Return an arbitrary angle within the range.
To process an interval i:
Find the first (range, count) pair (call it s) for which i.start >= s.range.start (i.e., the range contains i.start). (Note that if i.start is the end of one range, then it will be the start of another; this pick the pair for which it is the start.)
Find the last (range, count) pair e for which i.end <= e.range.end. (Note that if i.end is the start of one range, then it will be the end of another; this picks the pair for which it is the end.)
If i.start > s.range.start (i.range starts in the interior of s), split s into two (range, count) pairs s1 = ([s.range.start, i.start], s.count) and s2 = ([i.start, s.range.end], s.count). Replace s in the list with s1 and s2 (in that order).
If i.end < e.range.end, replace e in a manner parallel to the previous step, using i.end to do the split.
For each pair from s (or s2 if s was split in step 3) up to and including e (or e1 if e was split in step 4), add 1 to the count.
If you don't care to keep track of the actual number of intervals that contain a particular angle, just that it's the maximum, the bookkeeping for intervals that cross 0 (or 2π) is easier: just take the complement of the interval (reverse the start and end) and subtract one from the counts in step 5 instead of adding. If you do need the absolute counts, then do the complement trick and then add 1 to every count on the list.
The above will not deal correctly with intervals that abut (e.g.: [0, π/3] and [π/3, π]; or [2π/3, 2π] and [0, 1]). In those cases, as I understand it, the angle at which they abut (π/3 or 0) should be counted as being in two intervals. The above algorithm could be tweaked so that when an interval start coincides with a range end point, a new (range, count) pair is inserted after the pair in question; the new pair would have a single-angle range (that is, range.start == range.end). A similar procedure would apply for the range that starts at the end of an interval. I think that with those tweaks the above algorithm correctly handles all cases.
My solution would involve a list of pairs of start of the interval and how many intervals overlap it:
1 2 3 2 1
|---------|--------|-----|---------------|------|
|------------------|
|--------------|
|---------------------|
|----------------------------|
So, sort all the start and end points and traverse the list assigning each new interval the count of intervals it overlaps with (increasing it if it's a start point, decreasing otherwise). Then take the maximum from the overlap counts.
I think you'll run into weird edge cases if you don't do this symbolically.
Your angular ranges are not only not exactly representable as binary fractions (introducing rounding errors) they're irrational. (Pi is greater than 3.14159265359 but less than 3.14159265360; how do you say that an angle equals Pi/2 besides symbolically?)
The most robust way I see to do it is to take all combinations of intervals in turn, determine their intersection, and see which of these combined intervals are the result of the intersection of the most individual intervals.
This also has a bonus of giving you not just one, but all of the angles that satisfy your condition.