Extracting operations from Damerau-Levenshtein - levenshtein-distance

The Damerau-Levenshtein distance tells you the number of additions, deletions, substitutions and transpositions between two words (the latter is what differentiates DL from Levenshtein distance).
The algo is on wikipedia and relatively straightforward. However I want more than just the distance; I want the actual operations.
Eg a function that takes AABBCC, compares it to ABZ, and returns:
Remove A at index 0 -> ABBCC
Remove B at index 2 -> ABCC
Remove C at index 4 -> ABC
Substitute C at index 5 for Z -> ABZ
(ignore how the indices are affected by removals for now)
It seems you can do something with the matrix produced by the DL calculation. This site produces the output above. The text below says you should walk from the bottom right of the matrix, following each lowest cost operation in each cell (follow the bold cells):
If Delete is lowest cost, go up one cell
For Insert, go left one cell
Otherwise for Substitute, Transpose or Equality go up and left
It seems to prioritise equality or substitution over anything else if there's a tie, so in the example I provided, when the bottom-right cell is 4 for both substitution and removal, it picks substitution.
However once it reaches the top left cell, equality is the lowest scoring operation, with 0. But it has picked deletion, with score 2.
This seems to be the right answer, because if you strictly pick the lowest score, you end up with too many As at the start of the string.
But what's the real steps for picking an operation, if not lowest score? Are there other ways to pick operations out of a DL matrix, and if so do you have a reference?

I missed a vital part of fuzzy-string's explanation of how to reconstruct the operations:
But when you want to see the simplest path, it is determined by working backwards from bottom-right to top-left, following the direction of the minimum Change in each cell. (If the top or left edge is reached before the top-left cell, then the type of Change in the remaining cells is overwritten, with Inserts or Deletes respectively.)
...which explains why the equality operation in cell [1,1] is ignored and the delete is used instead!

Related

Explore a matrix with undefined size

I am trying to explore an environment by modelling it with 2 dimensional matrix. However, I don't know the size of the matrix beforehand.
Currently, I am using std::vector< std::vector > structure to abstract the matrix and resize it to certain size. If my application reaches the limit of my original resize, I do that operation again.
I am exploring this matrix with a combination of DFS and A* algorithms. My explorer agent can move forward, backward, left and right. Every time the explorer reaches a position, he adds the neighbors to the stack of DFS. For example, if he is at position (25, 25), it will add the neighbors (25,24), (25, 26), (24, 25) and (26, 25).
So far, it has worked properly. However, there is a scenario that I did not thought. I was always testing my algorithm with the explorer beginning at a corner of the matrix, which behaves great. But, if the explorer starts at the middle of the room or any other position that is not in a corner, my algorithm does not work properly.
That happens because I start my explorer at position 0,0 in the matrix. Therefore, if the explorer begins at the middle of the room, some positions would not be explored, because they would generate negative index for my explorer. Does anyone has any idea of what I can do in order to solve this ?
One way is to simplify it like you said and force it to start from a corner.
The more complicated way would be to, whenever you encounter an index that WOULD be negative, resize the array and all indexes previously generated to force them positive. For performance, probably in large chunks, like simply adding 10 or 100 to everything.
So you add a check for negative numbers when you go to add neighbors and if any of them are negative you apply the same addition to all indexes you've generated so far to force every index positive.
It's just an imaginary coordinate system, the important part is their relative positions. At the end, decide which one should be 0,0 and subtract enough from x,y from it and ALL indexes to normalize the vector back.
Also a performance concern, if you start from a large enough positive number, you may be able to reduce or eliminate the need for this coordinate map shifting until the very end. Like if you start from 100,100 then you would need to travel 100 nodes before you got negative. If there were less than 100 nodes in any direction, you wouldn't have to translate until you've completed mapping.

Maze least turns

I have a problem that i can't solve.I have a maze and I have to find a path from a point S to a point E,which has the least turns.It is known that the point E is reacheable.I can move only in 4 directions,left,right,up and down.It doesn't have to be the shortest path,just to have least turns.
I tried to store the number of turns in a priority queue.For example when I reach a certain place in the maze I will add the numbers of turns till there.From there I would add his neighbours to the priority queue,if they weren't visited already or they weren't walls,with the value of the current block i was sitting,for example t + x which can have the following values ( 0-if the neighbour is facing in the same direction I was facing when i got near him,or 1 if it is in a different direction).It seems that this approach doesn't work for every case.
I will appreciate if somebody could offer me some hints, without any code.
You are on the right track. What you need to implement for this problem is Dijkstra's algorithm. You just need to consider not just points as graph vertices, but pair of (point,direction). From every vertex (p,d) you have 5 edges (although last one can be blocked by wall): (p,0), (p,1), (p,2), (p,3), (neighbour of p in direction d, d). First four edges are of weight 1 (as you turn here), and the last one is of weight 0 (no turn, just move forward). Algorithm is good enough to ignore loops and works fine for edges of weight 0. You should end when any vertex (end point, _) is extracted from priority queue.
This method has one issue, as too many verticies are inspected in the process. If your maze is small, that's not the problem. Otherwise, consider a slight modification known as A*. You need a good heuristic function, describing lower bound on number of turns to the goal.

Finding the angle which is located in most intervals of angles

I've got angle intervals (in radians) [0,2π)
for example intervals [(2π)/3,(3π)/4],[π/2,π] etc.
but there may also be interval [(3π)/2,π/3]
I have to find the angle which is located in most intervals.
What's the best way to find it in C++?
How can I represent the angle intervals?
You could implement a simple sweep-line algorithm to solve this problem.
For each interval, add the start and end of the interval to a vector; sort this vector, then iterate through it. If you have any intervals that cross the 2π-boundary, simply split it into two intervals, which are both inside of (0, 2π).
As you iterate through the list, keep track of how many overlapping itervals there are at the current point, and what the best angle you've seen so far has been (and how many intervals were overlapping at that angle). Once you reach the end, you know what the optimum angle is.
If you need more than one angle, you can rather easily adapt this approach to remember intervals with maximal overlap, rather than single angles.
I'd do it by maintaining a partition of [0, 2π] into ranges corresponding to interval coverage, with a count for each range. First, here's how the algorithm would work under the condition that none of the intervals crosses 0 (or 2π). The intervals are also assumed to be normalized as follows: if an interval ends at 0, it is changed to end at 2π; if it starts at 2π, it is changed to start at 0.
create a list of (range, count) pairs, initialized with a single range [0, 2π] and a count of 0. (The list will be ordered by the start of the range. The ranges in the list will only overlap at their endpoints and will always cover [0, 2π]).
process each interval as described below
scan the list for a (range, count) pair with the highest count. Resolve ties arbitrarily. Return an arbitrary angle within the range.
To process an interval i:
Find the first (range, count) pair (call it s) for which i.start >= s.range.start (i.e., the range contains i.start). (Note that if i.start is the end of one range, then it will be the start of another; this pick the pair for which it is the start.)
Find the last (range, count) pair e for which i.end <= e.range.end. (Note that if i.end is the start of one range, then it will be the end of another; this picks the pair for which it is the end.)
If i.start > s.range.start (i.range starts in the interior of s), split s into two (range, count) pairs s1 = ([s.range.start, i.start], s.count) and s2 = ([i.start, s.range.end], s.count). Replace s in the list with s1 and s2 (in that order).
If i.end < e.range.end, replace e in a manner parallel to the previous step, using i.end to do the split.
For each pair from s (or s2 if s was split in step 3) up to and including e (or e1 if e was split in step 4), add 1 to the count.
If you don't care to keep track of the actual number of intervals that contain a particular angle, just that it's the maximum, the bookkeeping for intervals that cross 0 (or 2π) is easier: just take the complement of the interval (reverse the start and end) and subtract one from the counts in step 5 instead of adding. If you do need the absolute counts, then do the complement trick and then add 1 to every count on the list.
The above will not deal correctly with intervals that abut (e.g.: [0, π/3] and [π/3, π]; or [2π/3, 2π] and [0, 1]). In those cases, as I understand it, the angle at which they abut (π/3 or 0) should be counted as being in two intervals. The above algorithm could be tweaked so that when an interval start coincides with a range end point, a new (range, count) pair is inserted after the pair in question; the new pair would have a single-angle range (that is, range.start == range.end). A similar procedure would apply for the range that starts at the end of an interval. I think that with those tweaks the above algorithm correctly handles all cases.
My solution would involve a list of pairs of start of the interval and how many intervals overlap it:
1 2 3 2 1
|---------|--------|-----|---------------|------|
|------------------|
|--------------|
|---------------------|
|----------------------------|
So, sort all the start and end points and traverse the list assigning each new interval the count of intervals it overlaps with (increasing it if it's a start point, decreasing otherwise). Then take the maximum from the overlap counts.
I think you'll run into weird edge cases if you don't do this symbolically.
Your angular ranges are not only not exactly representable as binary fractions (introducing rounding errors) they're irrational. (Pi is greater than 3.14159265359 but less than 3.14159265360; how do you say that an angle equals Pi/2 besides symbolically?)
The most robust way I see to do it is to take all combinations of intervals in turn, determine their intersection, and see which of these combined intervals are the result of the intersection of the most individual intervals.
This also has a bonus of giving you not just one, but all of the angles that satisfy your condition.

How to find the nonidentical elements from multiple vectors?

Given several vectors/sets, each of which contains multiple integer numbers which are different within one vector. Now I want to check, whether there exists a set which is composed by extracting only one element from each given vectors/sets, in the same time the extracted numbers are nonidentical from each other.
For example, given sets a, b, c, d as:
a <- (1,3,5);
b <- (3,6,8);
c <- (2,3,4);
d <- (2,4,6)
I can find out sets like (1, 8, 4, 6) or (3, 6, 2, 4) ..... actually, I only need to find out one such set to prove the existence.
applying brutal force search, there can be maximal m^k combinations to check, where m is the size of given sets, k is the number of given sets.
Are there any cleverer ways?
Thank you!
You can reformulate your problem as a matching in a bipartite graph:
the node of the left side are your sets,
the node of the right side are the integer appearing in the sets.
There is an edge between a "set" node and an "integer" node if the set contains the given integer. Then, you are trying to find a matching in this bipartite graph: each set will be associated to one integer and no integer will be used twice. The running time of a simple algorithm to find such a matching is O(|V||E|), here |V| is smaller than (m+1)k and |E| is equal to mk. So you have a solution in O(m^2 k^2). See: Matching in bipartite graphs.
Algorithm for bipartite matching:
The algorithm works on oriented graphs. At the beginning, all edges are oriented from left to right. Two nodes will be matched if the edge between them is oriented from right to left, so at the beginning, the matching is empty. The goal of the algorithm is to find "augmenting paths" (or alternating paths), i.e. paths that increase the size the matching.
An augmenting path is a path in the directed graph starting from an unmatched left node and ending at an unmatched right node. Once you have an augmenting path, you just have to flip all the edges along the path to one increment the size of the matching. (The size of the matching will be increased because you have one more edge not belonging to the matching. This is called an alternating path because the path alternate between edges not belonging to the matching, left to right, and edges belonging to the matching, right to left.)
Here is how you find an augmenting path:
all the nodes are marked as unvisited,
you pick an unvisited and unmatched left node,
you do a depth first search until you find an unmatched right node (then you have an augmenting path). If you cannot find an unmatched right node, you go to 2.
If you cannot find an augmenting path, then the matching is optimal.
Finding an augmenting path is of complexity O(|E|), and you do this at most min(k, m) times, since the size of best matching is bounded by k and m. So for your problem, the complexity will be O(mk min(m, k)).
You can also see this reference, section 1., for a more complete explanation with proofs.

Finding largest rectangle in 2D array

I need an algorithm which can parse a 2D array and return the largest continuous rectangle. For reference, look at the image I made demonstrating my question.
Generally you solve these sorts of problems using what are called scan line algorithms. They examine the data one row (or scan line) at a time building up the answer you are looking for, in your case candidate rectangles.
Here's a rough outline of how it would work.
Number all the rows in your image from 0..6, I'll work from the bottom up.
Examining row 0 you have the beginnings of two rectangles (I am assuming you are only interested in the black square). I'll refer to rectangles using (x, y, width, height). The two active rectangles are (1,0,2,1) and (4,0,6,1). You add these to a list of active rectangles. This list is sorted by increasing x coordinate.
You are now done with scan line 0, so you increment your scan line.
Examining row 1 you work along the row seeing if you have any of the following:
new active rectangles
space for existing rectangles to grow
obstacles which split existing rectangles
obstacles which require you to remove a rectangle from the active list
As you work along the row you will see that you have a new active rect (0,1,8,1), we can grow one of existing active ones to (1,0,2,2) and we need to remove the active (4,0,6,1) replacing it with two narrower ones. We need to remember this one. It is the largest we have seen to far. It is replaced with two new active ones: (4,0,4,2) and (9,0,1,2)
So at the send of scan line 1 we have:
Active List: (0,1,8,1), (1,0,2,2), (4,0,4,2), (9, 0, 1, 2)
Biggest so far: (4,0,6,1)
You continue in this manner until you run out of scan lines.
The tricky part is coding up the routine that runs along the scan line updating the active list. If you do it correctly you will consider each pixel only once.
Hope this helps. It is a little tricky to describe.
I like a region growing approach for this.
For each open point in ARRAY
grow EAST as far as possible
grow WEST as far as possible
grow NORTH as far as possible by adding rows
grow SOUTH as far as possible by adding rows
save the resulting area for the seed pixel used
After looping through each point in ARRAY, pick the seed pixel with the largest area result
...would be a thorough, but maybe not-the-most-efficient way to go about it.
I suppose you need to answer the philosophical question "Is a line of points a skinny rectangle?" If a line == a thin rectangle, you could optimize further by:
Create a second array of integers called LINES that has the same dimensions as ARRAY
Loop through each point in ARRAY
Determine the longest valid line to the EAST that begins at each point and save its length in the corresponding cell of LINES.
After doing this for each point in ARRAY, loop through LINES
For each point in LINES, determine how many neighbors SOUTH have the same length value or less.
Accept a SOUTHERN neighbor with a smaller length if doing so will increase the area of the rectangle.
The largest rectangle using that seed point is (Number_of_acceptable_southern_neighbors*the_length_of_longest_accepted_line)
As the largest rectangular area for each seed is calculated, check to see if you have a new max value and save the result if you do.
And... you could do this without allocating an array LINES, but I thought using it in my explanation made the description simpler.
And... I think you need to do this same sort of thing with VERTICAL_LINES and EASTERN_NEIGHBORS, or some cases might miss big rectangles that are tall and skinny. So maybe this second algorithm isn't so optimized after all.
Use the first method to check your work. I think Knuth said "...premature optimization is the root of all evil."
HTH,
Perry
ADDENDUM:Several edits later, I think this answer deserves a group upvote.
A straight forward approach would be to do a loop through all the potential rectangles in the grid, figure out their area, and if it is greater than the current highest area, select it as the highest:
var biggestFound
for each potential rectangle:
if area(this potential rectangle) > area(biggestFound)
biggestFound = this potential rectangle
Then you simply need to find the potential rectangles.
for each square in grid:
recursive loop 1:
if not occupied:
grow right until occupied, and return a rectangle
grow down one and recurse (call loop 1)
This will duplicate a lot of work (for example you will re-evaluate a lot of sub-rectangles), but it should give you an answer.
Edit
An alternate approach might be to start with a single square the size of the grid, and "subtract" occupied squares to end up with a final set of potential rectangles. There might be optimization opportunities here using quadtrees, and in ensuring that you keep split rectangles "in order", top to bottom, left to right, in case you need to re-combine rectangles farther down in the algorithm.
If you are actually starting out with rectangular data (for your "populated grid" set), instead of a loose pixel grid, then you could easily get better perf out of a rectangle/region subtracting algorithm.
I'm not going to post pseudo-code for this because the idea is completely experimental, and I have no idea if the perf will be any better for a loose pixel grid ;)
Windows system "regions" and "dirty rectangles", as well as general "temporal caching" might be good inspiration here for more efficiency. There are also a lot of z-buffer tricks if this is for a graphics algorithm...
Use dynamic programming approach. Consider a function S(x,y) such that S(x,y) holds the area of the largest rectangle where (x,y) are the lowest-right-most corner cell of the rectangle; x is the row co-ordinate and y is the column co-ordinate of the rectangle.
For example, in your figure, S(1,1) = 1, S(1,2)=2, S(2,1)=2, and S(2,2) = 4. But, S(3,1)=0, because this cell is filled. S(8,5)=40, which says that the largest rectangle for which the lowest-right cell is (8,5) has the area 40, which happens to be the optimum solution in this example.
You can easily write a dynamic programming equation of S(x,y) from the value of S(x-1,y), S(x,y-1) and S(x-1,y-1). Using that you can obtain the values of all S(x,y) in O(mn) time, where m and n are the row and column dimension of the given table. Once, S(x,y) are know for all 1<=x <= m, and for all 1 <= y <= n, we simply need to find the x, and y for which S(x,y) is the largest; this step also takes O(mn) time. By keeping addition data, you can also find the side-length of the largest rectangle.
The overall complexity is O(mn). To understand more on this, Read Chapter 15 or Cormen's algorithm book, specifically Section 15.4.