Pattern matching with finite automata - c++

Recently I was reading the famous Algorithm design book CLRS(Cormen, Leiserson, Rivest, Stain, 3-rd edition). And between the classical KMP and Rabin - Karp algorithm there is a part about string matching with finite automata. So the algorithm creates the automata according to the pattern and starts processing on the string.
So here in this example, the algorithm searches the pattern "ababaca" in the input string. So everything seems logical to me, beside two things.
Why there is no path to the previous states from state 4 when I get to "b", because in that case I will have "ababb", which already is a mismatch???? And what happen when I read "b" or "c" from the state 6?? Is there something that I misunderstood? Also there is no reading "c" case from the states 0 to 4 and so on..

Check the table (b).
All the states you are talking about are marked as 0. So you go back to the beginning.
In the image you would get a lot of edges back to 0 so they don't show them (for clarity).

Related

Levenshtein Automata

i implemented a levenshtein trie to find similar words to a given word.
my goal was to have a fast way to do spell correction.
However i found out that there is an even faster way to do that:
Levenshtein Automata
I just have a problem... I understand no word of that whats written
here.
Can someone explain me the idea and the basic functionality of a
levenshtein automata in easy words?
Can someone explain me the idea and the basic functionality of a levenshtein automata in easy words?
A deterministic finite automaton (DFA) is
an alphabet (set of possible input characters)
a set of states (just abstract objects with no special properties)
a transition function (given any state and an input character, it returns a unique state)
a distinguished start state
a set of accepting states.
You can draw a DFA as a diagram like those in the paper. Conventionally, circular nodes are states. Directed edges each labeled with one character are transitions. Accepting states are marked as double-line circles. The start state has an inward pointing arrow with nothing at its tail.
A DFA accepts word W if and only if you can move a marker from the start state along transitions whose concatenated labels equal W to an accepting state. That is, if T is the transition function, and W = "cat", then T(T(T(Start, 'c'), 'a'), 't') must be an accepting state. Cycles in the transition function allow DFAs to accept strings of arbitrary length even though the DFA is finite.
In software a DFA is a simple loop and a table T(state, char) that implements the transition function.
current_state = START
while not end-of-input
c = get character from input
current_state = T(current_state, c)
end
if current_state is an accepting state return ACCEPT, else REJECT
The Wikipedia page on DFAs is not bad.
DFAs have nice properties. Accepting/rejecting an input of length N requires O(N) time (as long as the transition function runs in constant time). There is a unique minimum version of every DFA (among all those accepting the same set of words) and an efficient algorithm to find that minimum DFA. It's easy to compare DFAs for equality in time linear in the DFAs' size.
A Levenshtein Automaton L(W, d) for word W and Levenshtein distance d is just a DFA that accepts all words having Levenshtein distance at most d from W. That is, the automaton accepts W plus a bunch of other words that are W with no more than d "mistakes" in the usual sense of Levenshtein distance.
The contribution of the paper is a fast algorithm for computing Levenshtein DFAs and then a more advanced algorithm that accomplishes the same thing without computing the DFA explicitly.
Gene's has provided a fantastic high-level description of Levenshtein Automata!
With that said, if you're looking for some code to further your understanding, you may find the Java LevenshteinAutomaton library useful. It implements both algorithms described in the paper you stumbled upon (among others), and is well-structured, easy to follow, and extensively commented. It is also maintained by yours truly :) .

Create regex from samples algorithm

AFAIK no one have implemented an algorithm that takes a set of strings and substrings and gives back one or more regular expressions that would match the given substrings inside the strings. So, for instance, if I'd give my algorithm this two samples:
string1 = "fwef 1234 asdfd"
substring1 = "1234"
string2 = "asdf456fsdf"
substring2 = "456"
The algorithm would give me the regular expression "[0-9]*" back. I know it could give more than one regex or even no possible regex back and you might find 1000 reasons why such algorithm would be close to impossible to implement to perfection. But what's the closest thing?
I don't really care about regex itself also. Basically what I want is an algorithm that takes samples as the ones above and then finds a pattern in them that can be used to easily find the "kind" of text I want to find in a string without having to write any regex or code manually.
I don't have proof but I suspect no such discrete algorithm with a finite output could exist since you are asking for the creation of a regular language which could be "large" in respect to the input size.
With that, I suggest you peek at txt2re which can break down sample texts one-by-one and help you build regexes.
FlashFill a new feature of MS Excel 2013 would do exactly the task you want, but it does not give you the regular expression. It's a NP-complete problem and an open question for practical purposes. If you're interested in how to synthesise string manipulation from multiple examples, Go Flash Fill official website and read a few papers. They have pseudo-code and demo. movies as well.
There are many such algorithm in fact. This is a research area called "Grammatical inference".
I know RPNI, for example. (you could also look on the probabilistic branch, alergia, MDI, DEES). These algorithms generate DSA (Deterministic State Automata). In fact you absolutely don't need to enter the strings in your example. Only substrings.
There are also some algorithms to generate directly Non deterministic automata.
Of course, get the regular expression from an Non Deterministic Automata is easy.
The main ideas are simple:
Generate a PTSA (Prefix Tree State Automata) from your sample.
Then, you have to try to "merge" some states. From these merge, will emerge loops (i.e. * in the regular expression). All the difficulty being to choose the right rule to merge.
Here you go:
re = '|'.join(substrings)
If you want anything more general, your algorithm is going to have to make educated guesses about what type of strings are acceptable as matches, and it's trivial to demonstrate that no procedure can account for all possible sets of possible inputs without simply enumerating them all. For instance, consider some of these scenarios:
Match all prime numbers
Match hexadecimal strings, but no strings containing 'f' are in the sample set
Match the same string repeated twice
Match any even-length string
The root problem is that your question is incompletely specified. If you have a more specific requirement, that might be solvable, depending on what it is.

Using the Levenshtein distance in a spell checker

I am working on a spell checker in C++ and I'm stuck at a certain step in the implementation.
Let's say we have a text file with correctly spelled words and an inputted string we would like to check for spelling mistakes. If that string is a misspelled word, I can easily find its correct form by checking all words in the text file and choosing the one that differs from it with a minimum of letters. For that type of input, I've implemented a function that calculates the Levenshtein edit distance between 2 strings. So far so good.
Now, the tough part: what if the inputted string is a combination of misspelled words? For example, "iloevcokies". Taking into account that "i", "love" and "cookies" are words that can be found in the text file, how can I use the already-implemented Levenshtein function to determine which words from the file are suitable for a correction? Also, how would I insert blanks in the correct positions?
Any idea is welcome :)
Spelling correction for phrases can be done in a few ways. One way requires having an index of word bi-grams and tri-grams. These of course could be immense. Another option would be to try permutations of the word with spaces inserted, then doing a lookup on each word in the resulting phrase. Take a look at a simple implementation of a spell checker by Peter Norvig from Google. Either way, consider using an n-gram index for better performance, there are libraries available in C++ for reference.
Google and other search engines are able to do spelling correction on phrases because they have a large index of queries and associated result sets, which allows them to calculate a statistically good guess. Overall, the spelling correction problem can become very complex with methods like context-sensitive correction and phonetic correction. Given that using permutations of possible sub-terms can become expensive you can utilize certain types of heuristics, however this can get out of scope quick.
You may also consider using and existing spelling library, such as aspell.
A starting point for an idea: one of the top hits of your L-distance for "iloevcokies" should be "cookies". If you can change your L-distance function to also track and return a min-index and max-index (i.e., this match is best starting from character 5 and going to character 10) then you could remove that substring and re-check L-distance for the string before it and after it, then concatenate those for a suggestion....
Just a thought, good luck....
I will suppose that you have an existing index, on which you run your levenshtein distance (for example, a Trie, but any sorted index generally work well).
You can consider the addition of white-spaces as a regular edit operation, it's just that there is a twist: you need (then) to get back to the root of your index for the next word.
This way you get the same index, almost the same route, approximately the same traversal, and it should not even impact your running time that much.

Is it possible to calucate the edit distance between a regexp and a string?

If so, please explain how.
Re: what is distance -- "The distance between two strings is defined as the minimal number of edits required to convert one into the other."
For example, xyz to XYZ would take 3 edits, so the string xYZ is closer to XYZ and xyz.
If the pattern is [0-9]{3} or for instance 123, then a23 would be closer to the pattern than ab3.
How can you find the shortest distance between a regexp and a non-matching string?
The above is the Damerau–Levenshtein distance algorithm.
You can use Finite State Machines to do this efficiently (that is, linear in time). If you use a transducer, you can even write the specification of the transformation fairly compactly and do far more nuanced transformations than simply inserts or deletes - see wikipedia for Finite State Transducer as a starting point, and software such as the FSA toolkit or FSA6 (which has a not entirely stable web-demo) too. There are lots of libraries for FSA manipulation; I don't want to suggest the previous two are your only or best options, just two I've heard of.
If, however, you merely want the efficient, approximate searching, a less flexibly but already-implemented-for-you option exists: TRE, which has an approximate matching function that returns the cost of the match - i.e., the distance to the match, from your perspective.
If you mean the string with the smallest levenshtein distance between the closest matched string and a sample, then I'm pretty sure it can be done, but you'd have to convert the Regex to a DFA yourself, then try to match and whenever something fails, non-deterministically continue as if it had passed and keep track of the number differences. you could use A* search or something similar for this, it would be quite inefficient though (O(2^n) worst case)

Short example of regular expression converted to a state machine?

In the Stack Overflow podcast #36 (https://blog.stackoverflow.com/2009/01/podcast-36/), this opinion was expressed:
Once you understand how easy it is to set up a state machine, you’ll never try to use a regular expression inappropriately ever again.
I've done a bunch of searching. I've found some academic papers and other complicated examples, but I'd like to find a simple example that would help me understand this process. I use a lot of regular expressions, and I'd like to make sure I never use one "inappropriately" ever again.
A rather convenient way to help look at this to use python's little-known re.DEBUG flag on any pattern:
>>> re.compile(r'<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>', re.DEBUG)
literal 60
subpattern 1
in
range (65, 90)
max_repeat 0 65535
in
range (65, 90)
range (48, 57)
at at_boundary
max_repeat 0 65535
not_literal 62
literal 62
subpattern 2
min_repeat 0 65535
any None
literal 60
literal 47
groupref 1
literal 62
The numbers after 'literal' and 'range' refer to the integer values of the ascii characters they're supposed to match.
Sure, although you'll need more complicated examples to truly understand how REs work. Consider the following RE:
^[A-Za-z][A-Za-z0-9_]*$
which is a typical identifier (must start with alpha and can have any number of alphanumeric and undescore characters following, including none). The following pseudo-code shows how this can be done with a finite state machine:
state = FIRSTCHAR
for char in all_chars_in(string):
if state == FIRSTCHAR:
if char is not in the set "A-Z" or "a-z":
error "Invalid first character"
state = SUBSEQUENTCHARS
next char
if state == SUBSEQUENTCHARS:
if char is not in the set "A-Z" or "a-z" or "0-9" or "_":
error "Invalid subsequent character"
state = SUBSEQUENTCHARS
next char
Now, as I said, this is a very simple example. It doesn't show how to do greedy/nongreedy matches, backtracking, matching within a line (instead of the whole line) and other more esoteric features of state machines that are easily handled by the RE syntax.
That's why REs are so powerful. The actual finite state machine code required to do what a one-liner RE can do is usually very long and complex.
The best thing you could do is grab a copy of some lex/yacc (or equivalent) code for a specific simple language and see the code it generates. It's not pretty (it doesn't have to be since it's not supposed to be read by humans, they're supposed to be looking at the lex/yacc code) but it may give you a better idea as to how they work.
Make your own on the fly!
http://osteele.com/tools/reanimator/???
This is a really nicely put together tool which visualises regular expressions as FSMs. It doesn't have support for some of the syntax you'll find in real-world regular expression engines, but certainly enough to understand exactly what's going on.
Is the question "How do I choose the states and the transition conditions?", or "How do I implement my abstract state machine in Foo?"
How do I choose the states and the transition conditions?
I usually use FSMs for fairly simple problems and choose them intuitively. In my answer to another question about regular expressions, I just looked at the parsing problem as one of being either Inside or outside a tag pair, and wrote out the transitions from there (with a beginning and ending state to keep the implementation clean).
How do I implement my abstract state machine in Foo?
If your implementation language supports a structure like c's switch statement, then you switch on the current state and process the input to see which action and/or transition too perform next.
Without switch-like structures, or if they are deficient in some way, you if style branching. Ugh.
Written all in one place in c the example I linked would look something like this:
token_t token;
state_t state=BEGIN_STATE;
do {
switch ( state.getValue() ) {
case BEGIN_STATE;
state=OUT_STATE;
break;
case OUT_STATE:
switch ( token.getValue() ) {
case CODE_TOKEN:
state = IN_STATE;
output(token.string());
break;
case NEWLINE_TOKEN;
output("<break>");
output(token.string());
break;
...
}
break;
...
}
} while (state != END_STATE);
which is pretty messy, so I usually rip the state cases out to separate functions.
I'm sure someone has better examples, but you could check this post by Phil Haack, which has an example of a regular expression and a state machine doing the same thing (there's a previous post with a few more regex examples in there as well I think..)
Check the "HenriFormatter" on that page.
I don't know what academic papers you've already read but it really isn't that difficult to understand how to implement a finite state machine. There are some interesting mathematics but to idea is actually very trivial to understand. The easiest way to understand an FSM is through input and output (actually, this comprises most of the formal definition, that I won't describe here). A "state" is essentially just describing a set of input and outputs that have occurred and can occur from a certain point.
Finite state machines are easiest to understand via diagrams. For example:
alt text http://img6.imageshack.us/img6/7571/mathfinitestatemachinedco3.gif
All this is saying is that if you begin in some state q0 (the one with the Start symbol next to it) you can go to other states. Each state is a circle. Each arrow represents an input or output (depending on how you look at it). Another way to think of an finite state machine is in terms of "valid" or "acceptable" input. There are certain output strings that are NOT possible certain finite state machines; this would allow you to "match" expressions.
Now suppose you start at q0. Now, if you input a 0 you will go to state q1. However, if you input a 1 you will go to state q2. You can see this by the symbols above the input/output arrows.
Let's say you start at q0 and get this input
0, 1, 0, 1, 1, 1
This means you have gone through states (no input for q0, you just start there):
q0 -> q1 -> q0 -> q1 -> q0 -> q2 -> q3 -> q3
Trace the picture with your finger if it doesn't make sense. Notice that q3 goes back to itself for both inputs 0 and 1.
Another way to say all this is "If you are in state q0 and you see a 0 go to q1 but if you see a 1 go to q2." If you make these conditions for each state you are nearly done defining your state machine. All you have to do is have a state variable and then a way to pump input in and that is basically what is there.
Ok, so why is this important regarding Joel's statement? Well, building the "ONE TRUE REGULAR EXPRESSION TO RULE THEM ALL" can be very difficult and also difficult to maintain modify or even for others to come back and understand. Also, in some cases it is more efficient.
Of course, state machines have many other uses. Hope this helps in some small way. Note, I didn't bother going into the theory but there are some interesting proofs regarding FSMs.