I've been reading the dragon book and was really interested in the algorithm that converts a regex directly to a DFA (without an explicit NFA). Suppose my lexical file layout is like lex:
...
%%
if { return IF_TOK; }
else { return ELSE_TOK; }
then { return THEN_TOK; }
[a-zA-z][a-zA-Z0-9]* { return IDENT; }
...more tokens
%%
...
I've implemented the algorithm and was wondering how I could extract tokens from it. I know that when an NFA constructed with Thompson's construction (the dragon book's variation), then converted to a DFA, you can get the tokens from the NFA states a DFA is composed of, but I am unsure in this case.
A (f)lex description -- indeed, any lexical description -- is a collection of regular expressions. However, all practical algorithms match a single regular expression. This isn't a problem because we can simply construct a single regular expression consisting of an alternation between all the possible tokens. So the first four patterns in your example would be combined into
(if|then|else|[a-zA-z][a-zA-Z0-9]*)
The direct regex->DFA algorithm given in the Dragon book works on an augmented regular expression, constructed by adding a single end-of-input marker at the end of the regex. States which include the position of this augmented marker are accepting states.
That's convenient for the description of an algorithm, but not very helpful for the construction of an actual tokeniser, because the tokeniser rarely reaches the end of the input. It just keeps going until no more transitions are possible, at which point it backs up to the last accepting state.
States in the regex->DFA algorithm correspond to sets of positions in the original regular expression. (If you've written the algorithm you know what a position is, but for the sake of other people reading along, a position is roughly speaking the offset in the regular expression of the character literal (or wildcard character) which has just matched. Not all offsets in the regular expression are positions in this sense; parentheses and operators do not correspond to any matched character, for example. And an entire character class literal is a single position. But it's often easier to think about positions as though they were simply byte offsets in the regular expression itself.)
That's useful, because we can see instantly from the name of the state where in the regular expression the scan has reached. And since each pattern in the combined regular expression has a different range of positions, we know which rule we're in. If the pattern has succeeded, the rule number tells us which action to choose. If the accepting state is in more than one rule, we just choose the rule with the smallest position, because that's the (f)lex algorithm for selecting between two longest matches of the same length: choose the first one in the scanner definition.
But, as we noted earlier, the only accepting state is the one corresponding to the end-of-input marker at the end of the augmented pattern, and that is not in any of the original patterns. So if we want to identify the actual pattern, we need to augment each individual rule with its own end marker, rather than having all the rules share a single end marker. Furthermore, the end marker cannot be restricted to just matching the (fictional) character at the end of the input, because we're interested in the longest matching prefix rather than a complete match. So we have to consider the end marker as matching any single character, including the fictional end-of-input character. (That makes it a wilder wildcard than ., since . only matches real characters, not fictional end-of-input characters, and also -- in the real (f)lex implementation -- only real characters which happen not to be newline characters. But in principle, it's a kind of wildcard character, except that it doesn't absorb the character it matches.)
So the end result is that we transform the collection of patters into:
(if#|then#|else#|[a-zA-z][a-zA-Z0-9]*#)
where # is the end-of-match wildcard (or, more accurately, zero-length assertion which always succeeds) as described above. Then, every state which includes one (or more) of the # wildcards is an accepting state. The position it corresponds to is the smallest position in its name which corresponds to a #, and the rule action which should be taken is the one whose augmented pattern includes that #.
I found a state diagram of a DFA (deterministic finite automaton) with its RegEx in a script, but this diagram is just a sample without any explanations. So I tried to derive the RegEx from the DFA state diagram by myself and got the expression: ab+a+b(a*b)*. I dont understand how I get the original RegEx (ab+a*)+ab+ mentioned in the script. Here my derivation:
I am grateful for any help, links, references and hints!
You've derived the regex correctly here. The expression you have ab+a+b(a*b)* is equivalent to (ab+a*)+ab+ - once you've finished the DFA state elimination (you have a single transition from the start state to an accepting state), there aren't any more derivations to do. You may however get different final regexes depending on the order you eliminate the states in, and they should all be valid assuming you did the eliminations correctly. The state elimination method is also not guaranteed to be able to produce all equivalent regular expressions for a particular DFA, so it's okay that you didn't arrive at exactly the original regex. You can also check the equivalence of two regular expressions here.
For your particular example though to show that this DFA is equivalent to this original regex (ab+a*)+ab+, take a look at the DFA at this state of the elimination (somewhere in between the second and third steps you've shown above):
Let's expand our expression (ab+a*)+ab+ to (ab+a*)(ab+a*)*ab+. So in the DFA, the first (ab+a*) gets us from state 0 to partway between states 2 and 3 (the a* in a*a).
Then the next part (ab+a*)* means we're allowed to have 0 or more copies of (ab+a*). If there are 0 copies, we'll just finish with an ab+, reading an a from the second half of the a*a transition from 2 to 3 and a b from the 3 to 4 transition, landing us in state 4 which is accepting and where we can take the self loop and read as many b's as we want.
Otherwise we have 1 or more copies of (ab+a*), again reading an a from the second half of the a*a transition from 2 to 3 and a b from the 3 to 4 transition. The a* comes from the first half of the a*ab self loop on state 4 and the second half ab is either the final ab+ of the regex or the start of another copy of (ab+a*). I'm not sure if there's a state elimination that arrives at exactly the expression (ab+a*)+ab+ but for what it's worth, I think the regex you derived more clearly captures the structure of this DFA.
I am reading my textbook ("Introduction to the theory of computation" by Michael Sipser) and there is a part that is trying to convert the regular expression (ab U b)* into a non-deterministic finite automata. The book is telling me to break the expression apart and write a NFA for each piece. When it reaches ab it shows:
An a to go from the first state to the second
An empty string to go from the second state to the third
A b to go from the third state to the final accepting state.
I am really confused trying to understand the second step. What's the purpose of the empty string. Would it be incorrect if it wasn't there?
You're probably making reference to example 1.56, right? First of all, remember he is describing a process - think of it as an algorithm, that will be executed always in the same matter. What he's saying is that, if you have a regexp concatenation as P.Q, no matter what P and Q are, you can join their AFNs by connecting the ending states of P with an empty string to the starting state of Q.
So, in his example, he's creating an AFN to represent a.b. AFNs to represent a and b isolated are trivial, as:
So, in order to join both, you just follow the algorithm, putting an empty string connection from the final state of the first AFN to the starting state of the second algorithm, as:
In this particular case, since it's very easy to create a a.b AFD, we look at it and perceive this empty string is not necessary. But the process works with every single concatenation - and, in fact, being self evident that this is equivalent of a trivially correct AFD just reinforces his statement: the process works.
Hope that helps.
I have a given DFA that represent a regular expression.
I want to match the DFA against an input stream and get all possible matches back, not only the leastmost-longest match.
For example:
regex: a*ba|baa
input: aaaaabaaababbabbbaa
result:
aaaaaba
aaba
ba
baa
Assumptions
Based on your question and later comments you want a general method for splitting a sentence into non-overlapping, matching substrings, with non-matching parts of the sentence discarded. You also seem to want optimal run-time performance. Also I assume you have an existing algorithm to transform a regular expression into DFA form already. I further assume that you are doing this by the usual method of first constructing an NFA and converting it by subset construction to DFA, since I'm not aware of any other way of accomplishing this.
Before you go chasing after shadows, make sure your trying to apply the right tool for the job. Any discussion of regular expressions is almost always muddied by the fact that folks use regular expressions for a lot more things than they are really optimal for. If you want to receive the benefits of regular expressions, be sure you're using a regular expression, and not something broader. If what you want to do can't be somewhat coded into a regular expression itself, then you can't benefit from the advantages of regular expression algorithms (fully)
An obvious example is that no amount of cleverness will allow a FSM, or any algorithm, to predict the future. For instance, an expression like (a*b)|(a), when matched against the string aaa... where the ellipsis is the portion of the expression not yet scanned because the user has not typed them yet, cannot give you every possible right subgroup.
For a more detailed discussion of Regular expression implementations, and specifically Thompson NFA's please check this link, which describes a simple C implementation with some clever optimizations.
Limitations of Regular Languages
The O(n) and Space(O(1)) guarantees of regular expression algorithms is a fairly narrow claim. Specifically, a regular language is the set of all languages that can be recognized in constant space. This distinction is important. Any kind of enhancement to the algorithm that does something more sophisticated than accepting or rejecting a sentence is likely to operate on a larger set of languages than regular. On top of that, if you can show that some enhancement requires greater than constant space to implement, then you are also outside of the performance guarantee. That being said, we can still do an awful lot if we are very careful to keep our algorithm within these narrow constraints.
Obviously that eliminates anything we might want to do with recursive backtracking. A stack does not have constant space. Even maintaining pointers into the sentence would be verboten, since we don't know how long the sentence might be. A long enough sentence would overflow any integer pointer. We can't create new states for the automaton as we go to get around this. All possible states (and a few impossible ones) must be predictable before exposing the recognizer to any input, and that quantity must be bounded by some constant, which may vary for the specific language we want to match, but by no other variable.
This still allows some room for adding additonal behavior. The usual way of getting more mileage is to add some extra annotations for where certain events in processing occur, such as when a subexpression started or stopped matching. Since we are only allowed to have constant space processing, that limits the number of subexpression matches we can process. This usually means the latest instance of that subexpression. This is why, when you ask for the subgrouped matched by (a|)*, you always get an empty string, because any sequence of a's is implicitly followed by infinitely many empty strings.
The other common enhancement is to do some clever thing between states. For example, in perl regex, \b matches the empty string, but only if the previous character is a word character and the next is not, or visa versa. Many simple assertions fit this, including the common line anchor operators, ^ and $. Lookahead and lookbehind assertions are also possible, but much more difficult.
When discussing the differences between various regular language recognizers, it's worth clarifying if we're talking about match recognition or search recognition, the former being an accept only if the entire sentence is in the language, and the latter accepts if any substring in the sentence is in the language. These are equivalent in the sense that if some expression E is accepted by the search method, then .*(E).* is accepted in the match method.
This is important because we might want to make it clear whether an expression like a*b|a accepts aa or not. In the search method, it does. Either token will match the right side of the disjunction. It doesn't match, though, because you could never get that sentence by stepping through the expression and generating tokens from the transitions, at least in a single pass. For this reason, i'm only going to talk about match semantics. Obviously if you want search semantics, you can modify the expression with .*'s
Note: A language defined by expression E|.* is not really a very manageable language, regardless of the sublanguage of E because it matches all possible sentences. This represents a real challenge for regular expression recognizers because they are really only suited to recognizing a language or else confirming that a sentence is not in that same language, rather than doing any more specific work.
Implementation of Regular Language Recognizers
There are generally three ways to process a regular expression. All three start the same, by transforming the expression into an NFA. This process produces one or two states for each production rule in the original expression. The rules are extemely simple. Here's some crude ascii art: note that a is any single literal character in the language's alphabet, and E1 and E2 are any regular expression. Epsilon(ε) is a state with inputs and outputs, but ignores the stream of characters, and doesn't consume any input either.
a ::= > a ->
E1 E2 ::= >- E1 ->- E2 ->
/---->
E1* ::= > --ε <-\
\ /
E1
/-E1 ->
E1|E2 ::= > ε
\-E2 ->
And that's it! Common uses such as E+, E?, [abc] are equivalent to EE*, (E|), (a|b|c) respectively. Also note that we add for each production rule a very small number of new states. In fact each rule adds zero or one state (in this presentation). characters, quantifiers and dysjunction all add just one state, and the concatenation doesn't add any. Everything else is done by updating the fragments' end pointers to start pointers of other states or fragments.
The epsilon transition states are important, because they are ambiguous. When encountered, is the machine supposed to change state to once following state or another? should it change state at all or stay put? That's the reason why these automatons are called nondeterministic. The solution is to have the automaton transition to the right state, whichever allows it to match the best. Thus the tricky part is to figure out how to do that.
There are fundamentally two ways of doing this. The first way is to try each one. Follow the first choice, and if that doesn't work, try the next. This is recursive backtracking, appears in a few (and notable) implementations. For well crafted regular expressions, this implementation does very little extra work. If the expression is a bit more convoluted, recursive backtracking is very, very bad, O(2^n).
The other way of doing this is to instead try both options in parallel. At each epsilon transition, add to the set of current states both of the states the epsilon transition suggests. Since you are using a set, you can have the same state come up more than once, but you only need to track it once, either you are in that state or not. If you get to the point that there's no option for a particular state to follow, just ignore it, that path didn't match. If there are no more states, then the entire expression didn't match. as soon as any state reaches the final state, you are done.
Just from that explanation, the amount of work we have to do has gone up a little bit. We've gone from having to keep track of a single state to several. At each iteration, we may have to update on the order of m state pointers, including things like checking for duplicates. Also the amount of storage we needed has gone up, since now it's no longer a single pointer to one possible state in the NFA, but a whole set of them.
However, this isn't anywhere close to as bad as it sounds. First off, the number of states is bounded by the number of productions in the original regular expression. From now on we'll call this value m to distinguish it from the number of symbols in the input, which will be n. If two state pointers end up transitioning to the same new state, you can discard one of them, because no matter what else happens, they will both follow the same path from there on out. This means the number of state pointers you need is bounded by the number of states, so that to is m.
This is a bigger win in the worst case scenario when compared to backtracking. After each character is consumed from the input, you will create, rename, or destroy at most m state pointers. There is no way to craft a regular expression which will cause you to execute more than that many instructions (times some constant factor depending on your exact implementation), or will cause you to allocate more space on the stack or heap.
This NFA, simultaneously in some subset of its m states, may be considered some other state machine who's state represents the set of states the NFA it models could be in. each state of that FSM represents one element from the power set of the states of the NFA. This is exactly the DFA implementation used for matching regular expressions.
Using this alternate representation has an advantage that instead of updating m state pointers, you only have to update one. It also has a downside, since it models the powerset of m states, it actually has up to 2m states. That is an upper limit, because you don't model states that cannot happen, for instance the expression a|b has two possible states after reading the first character, either the one for having seen an a, or the one for having seen a b. No matter what input you give it, it cannot be in both of those states at the same time, so that state-set does not appear in the DFA. In fact, because you are eliminating the redundancy of epsilon transitions, many simple DFA's actually get SMALLER than the NFA they represent, but there is simply no way to guarantee that.
To keep the explosion of states from growing too large, a solution used in a few versions of that algorithm is to only generate the DFA states you actually need, and if you get too many, discard ones you haven't used recently. You can always generate them again.
From Theory to Practice
Many practical uses of regular expressions involve tracking the position of the input. This is technically cheating, since the input could be arbitrarily long. Even if you used a 64 bit pointer, the input could possibly be 264+1 symbols long, and you would fail. Your position pointers have to grow with the length of the input, and now your algorithm now requires more than constant space to execute. In practice this isn't relevant, because if your regular expression did end up working its way through that much input, you probably won't notice that it would fail because you'd terminate it long before then.
Of course, we want to do more than just accept or reject inputs as a whole. The most useful variation on this is to extract submatches, to discover which portion of an input was matched by a certain section of the original expression. The simple way to achieve this is to add an epsilon transition for each of the opening and closing braces in the expression. When the FSM simulator encounters one of these states, it annotates the state pointer with information about where in the input it was at the time it encountered that particular transition. If the same pointer returns to that transition a second time, the old annotation is discarded and replaced with a new annotation for the new input position. If two states pointers with disagreeing annotations collapse to the same state, the annotation of a later input position wins again.
If you are sticking to Thompson NFA or DFA implementations, then there's not really any notion of greedy or non-greedy matching. A backtracking algorithm needs to be given a hint about whether it should start by trying to match as much as it can and recursively trying less, or trying as little as it can and recursively trying more, when it fails it first attempt. The Thompson NFA method tries all possible quantities simultaneously. On the other hand, you might still wish to use some greedy/nongreedy hinting. This information would be used to determine if newer or older submatch annotations should be preferred, in order to capture just the right portion of the input.
Another kind of practical enhancement is assertions, productions which do not consume input, but match or reject based on some aspect of the input position. For instance in perl regex, a \b indicates that the input must contain a word boundary at that position, such that the symbol just matched must be a word character, but the next character must not be, or visa versa. Again, we manage this by adding an epsilon transition with special instructions to the simulator. If the assertion passes, then the state pointer continues, otherwise it is discarded.
Lookahead and lookbehind assertions can be achieved with a bit more work. A typical lookbehind assertion r0(?<=r1)r2 is transformed into two separate expressions, .*r1 and r0εr2. Both expressions are applied to the input. Note that we added a .* to the assertion expression, because we don't actually care where it starts. When the simulator encounters the epsilon in the second generated fragment, it checks up on the state of the first fragment. If that fragment is in a state where it could accept right there, the assertion passes with the state pointer flowing into r2, but otherwise, it fails, and both fragments continue, with the second discarding the state pointer at the epsilon transition.
Lookahead also works by using an extra regex fragment for the assertion, but is a little more complex, because when we reach the point in the input where the assertion must succeed, none of the corresponding characters have been encountered (in the lookbehind case, they have all been encountered). Instead, when the simulator reaches the assertion, it starts a pointer in the start state of the assertion subexpression and annotates the state pointer in the main part of the simulation so that it knows it is dependent on the subexpression pointer. At each step, the simulation must check to see that the state pointer it depends upon is still matching. If it doesn't find one, then it fails wherever it happens to be. You don't have to keep any more copies of the assertion subexpressions state pointers than you do for the main part, if two state pointers in the assertion land on the same state, then the state pointers each of them depend upon will share the same fate, and can be reannotated to point to the single pointer you keep.
While were adding special instructions to epsilon transitions, it's not a terrible idea to suggest an instruction to make the simulator pause once in a while to let the user see what's going on. Whenever the simulator encounters such a transition, it will wrap up its current state in some kind of package that can be returned to the caller, inspected or altered, and then resumed where it left off. This could be used to match input interactively, so if the user types only a partial match, the simulator can ask for more input, but if the user types something invalid, the simulator is empty, and can complain to the user. Another possibility is to yield every time a subexpression is matched, allowing you to peek at every sub match in the input. This couldn't be used to exclude some submatches, though. For instance, if you tried to match ((a)*b) against aaa, you could see three submatches for the a's, even though the whole expression ultimately fails because there is no b, and no submatch for the corresponding b's
Finally, there might be a way to modify this to work with backreferences. Even if it's elegent, it's sure to be inefficient, specifically, regular expressions plus backreferences are in NP-Complete, so I won't even try to think of a way to do this, because we are only interested (here) in (asymptotically) efficient possibilities.
I have a container of regular expressions. I'd like to analyze them to determine if it's possible to generate a string that matches more than 1 of them. Short of writing my own regex engine with this use case in mind, is there an easy way in C++ or Python to solve this problem?
There's no easy way.
As long as your regular expressions use only standard features (Perl lets you embed arbitrary code in matching, I think), you can produce from each one a nondeterministic finite-state automaton (NFA) that compactly encodes all the strings that the RE matches.
Given any pair of NFA, it's decidable whether their intersection is empty. If the intersection isn't empty, then some string matches both REs in the pair (and conversely).
The standard decidability proof is to determinize them into DFAs first, and then construct a new DFA whose states are pairs of the two DFAs' states, and whose final states are exactly those in which both states in the pair are final in their original DFA. Alternatively, if you've already shown how to compute the complement of a NFA, then you can (DeMorgan's law style) get the intersection by complement(union(complement(A),complement(B))).
Unfortunately, NFA->DFA involves a potentially exponential size explosion (because states in the DFA are subsets of states in the NFA). From Wikipedia:
Some classes of regular languages can
only be described by deterministic
finite automata whose size grows
exponentially in the size of the
shortest equivalent regular
expressions. The standard example are
here the languages L_k consisting of
all strings over the alphabet {a,b}
whose kth-last letter equals a.
By the way, you should definitely use OpenFST. You can create automata as text files and play around with operations like minimization, intersection, etc. in order to see how efficient they are for your problem. There already exist open source regexp->nfa->dfa compilers (I remember a Perl module); modify one to output OpenFST automata files and play around.
Fortunately, it's possible to avoid the subset-of-states explosion, and intersect two NFA directly using the same construction as for DFA:
if A ->a B (in one NFA, you can go from state A to B outputting the letter 'a')
and X ->a Y (in the other NFA)
then (A,X) ->a (B,Y) in the intersection
(C,Z) is final iff C is final in the one NFA and Z is final in the other.
To start the process off, you start in the pair of start states for the two NFAs e.g. (A,X) - this is the start state of the intersection-NFA. Each time you first visit a state, generate an arc by the above rule for every pair of arcs leaving the two states, and then visit all the (new) states those arcs reach. You'd store the fact that you expanded a state's arcs (e.g. in a hash table) and end up exploring all the states reachable from the start.
If you allow epsilon transitions (that don't output a letter), that's fine:
if A ->epsilon B in the first NFA, then for every state (A,Y) you reach, add the arc (A,Y) ->epsilon (B,Y) and similarly for epsilons in the second-position NFA.
Epsilon transitions are useful (but not necessary) in taking the union of two NFAs when translating a regexp to an NFA; whenever you have alternation regexp1|regexp2|regexp3, you take the union: an NFA whose start state has an epsilon transition to each of the NFAs representing the regexps in the alternation.
Deciding emptiness for an NFA is easy: if you ever reach a final state in doing a depth-first-search from the start state, it's not empty.
This NFA-intersection is similar to finite state transducer composition (a transducer is an NFA that outputs pairs of symbols, that are concatenated pairwise to match both an input and output string, or to transform a given input to an output).
This regex inverter (written using pyparsing) works with a limited subset of re syntax (no * or + allowed, for instance) - you could invert two re's into two sets, and then look for a set intersection.
In theory, the problem you describe is impossible.
In practice, if you have a manageable number of regular expressions that use a limited subset or of regexp syntax, and/or a limited selection of strings that can be used to match against the container of regular expressions, you might be able to solve it.
Assuming you're not trying to solve the abstract general case, there might be something you can do to solve a practical application. Perhaps if you provided a representative sample of the regexps, and described the strings you'd be matching with, a heuristic could be created to solve the problem.