I am trying to implement a pattern matching "syntax" and language.
I know of regular expressions but these aren't enough for my scopes.
I have individuated some "mathematical" operators.
In the examples that follow I will suppose that the subject of pattern mathing are character strings but it isn't necessary.
Having read the description bellow: The question is, does any body knows of a mathematical theory explicitating that or any language that takes the same approach implementing it ? I would like to look at it in order to have ideas !
Descprition of approach:
At first we have characters. Characters may be aggregated to form strings.
A pattern is:
a) a single character
b) an ordered group of patterns with the operator matchAny
c) an ordered group of patterns with the operator matchAll
d) other various operators to see later on.
Explanation:
We have a subject character string and a starting position.
If we check for a match of a single character, then if it matches it moves the current position forward by one position.
If we check for a match of an ordered group of patterns with the operator matchAny then it will check each element of the group in sequence and we will have a proliferation of starting positions that will get multiplied by the number of possible matches being advanced by the length of the match.
E.G suppose the group of patterns is { "a" "aba" "ab" "x" "dd" } and the string under examination is:
"Dabaxddc" with current position 2 ( counting from 1 ).
Then applying matchAny with the previous group we have that "a" mathces "aba" matches and "ab" matches while "x" and "dd" do not match.
After having those matches there are 3 starting positions 3 4 5 ( corresponding to "a" "ab" "aba" ).
We may continue our pattern matching by accepting to have more then one starting positions. So now we may continue to the next case under examination and check for a matchAll.
matchAll means that all patterns must match sequentially and are applied sequentially.
subcases of matchAll are match0+ match1+ etc.
I have to add that the same fact to try to ask the question has already helped me and cleared me out some things.
But I would like to know of similar approaches in order to study them.
Please only languages used by you and not bibliography !!!
I suggest you have a look at the paper "Parsing Permutation Phrases". It deals with recognizing a set of things in any order where the "things" can be recognizers themselves. The presentation in the paper might be a little different than what you expect; they don't compile to finite automaton. However, they do give an implementation in a functional language and that should be helpful to you.
Your description of matching strings against patterns is exactly what a compiler does. In particular, your description of multiple potential matches is highly reminiscent of the way an LR parser works.
If the patterns are static and can be described by an EBNF, then you could use an LR parser generator (such as YACC) to generate a recogniser.
If the patterns are dynamic but can still be formulated as EBNF there are other tools that can be applied. It just gets a bit more complicated.
[In Australia at least, Computer Science was a University course in 1975, when I did mine. YACC dates from around 1970 in its original form. EBNF is even older.]
I'm trying to implement a lexer for fun. I have already implemented a basic regular expression matcher(by first converting a pattern to a NFA and then to a DFA). Now I'm clueless about how to proceed.
My lexer would be taking a list of tokens and their corresponding regexs. What is the general algorithm used to create a lexer out of this?
I thought about (OR)ing all the regex, but then I can't identify which specific token was matched. Even if I extend my regex module to return the pattern matched when a match is successful, how do I implement lookahead in the matcher?
Assuming you have a working regex, regex_match which returns a boolean (True if a string satisfies the regex). First, you need to have an ordered list of tokens (with regex for each) tokens_regex, the order being important as order will prescribe precedence.
One algorithm could be (this is not necessarily the only one):
Write a procedure next_token which takes a string, and returns the first token, its value and the remaining string (or - if an illegal/ignore character - None, the offending character and the remaining string). Note: this has to respect precedence, and should find the longest token.
Write a procedure lex which recursively calls next_token.
.
Something like this (written in Python):
tokens_regex = [ (TOKEN_NAME, TOKEN_REGEX),...] #order describes precedence
def next_token( remaining_string ):
for t_name, t_regex in tokens_regex: # check over in order of precedence
for i in xrange( len(remaining_string), 0, -1 ): #check longest possibilities first (there may be a more efficient method).
if regex_match( remaining_string[:i], t_regex ):
return t_name, remaining_string[:i], remaining_string[i:]
return None, remaining_string[0], remaining_string[1:] #either an ignore or illegal character
def lex( string ):
tokens_so_far = []
remaining_string = string
while len(remaining_string) > 0:
t_name, t_value, string_remaining = next_token(remaining_string)
if t_name is not None:
tokens_so_far.append(t_name, t_value)
#elif not regex_match(t_value,ignore_regex):
#check against ignore regex, if not in it add to an error list/illegal characters
return tokens_so_far
Some things to add to improve your lexer: ignore regex, error lists and locations/line numbers (for these errors or for tokens).
Have fun! And good luck making a parser :).
I've done pretty much the same thing. The way I did it was to combine all the expressions in one pretty big NFA and converted that same thing into one DFA. When doing that keep track of the states that previously were accepting states in each corresponding original DFA and their precedence.
The generated DFA will have many states that are accepting states. You run this DFA until it recieves a character that it has no corresponding transitions for. If the DFA is then in an accepting state you will then look at which of your original NFAs that had that accepting state in them. The one that has the highest precedence is the token you're going to return.
This does not handle regular expression lookaheads. These are typically not really needed for lexer work anyway. That would be the job of the parser.
Such a lexer runs in much the same speed as a single regular expression since there is basically only one DFA for it to run. You can omit converting the NFA altogether for a faster-to-construct algorithm but slower to run. The algorithm is basically the same.
The source code for the lexer I wrote is freely available on github, should you want to see how I did it.
In "modern compiler implementation in Java" by Andrew Appel he claims in an exercise that:
Lex has a lookahead operator / so that the regular expression abc/def matches abc only when followed by def (but def is not part of the matched string, and will be part of the next token(s)). Aho et al. [1986] describe, and Lex [Lesk 1975] uses, an incorrect algorithm for implementing lookahead (it fails on (a|ab)/ba with input aba, matching ab where it should match a). Flex [Paxson 1995] uses a better mechanism that works correctly for (a|ab)/ba but fails (with a warning message on zx*/xy*. Design a better lookahead mechanism.
Does anyone know the solution to what he is describing?
"Does not work how I think it should" and "incorrect" are, not always the same thing. Given the input
aba
and the pattern
(ab|a)/ab
it makes a certain amount of sense for the (ab|a) to match greedily, and then for the /ab constraint to be applied separately. You're thinking that it should work like this regular expression:
(ab|a)(ab)
with the constraint that the part matched by (ab) is not consumed. That's probably better because it removes some limitations, but since there weren't any external requirements for what lex should do at the time it was written, you cannot call either behavior correct or incorrect.
The naive way has the merit that adding a trailing context doesn't change the meaning of a token, but simply adds a totally separate constraint about what may follow it. But that does lead to limitations/surprises:
{IDENT} /* original code */
{IDENT}/ab /* ident, only when followed by ab */
Oops, it won't work because "ab" is swallowed into IDENT precisely because its meaning was not changed by the trailing context. That turns into a limitation, but maybe it's a limitation that the author was willing to live with in exchange for simplicity. (What is the use case for making it more contextual, anyway?)
How about the other way? That could have surprises also:
{IDENT}/ab /* input is bracadabra:123 */
Say the user wants this not to match because bracadabra is not an identifier followed by (or ending in) ab. But {IDENT}/ab will match bracad and then, leaving abra:123 in the input.
A user could have expectations which are foiled no matter how you pin down the semantics.
lex is now standardized by The Single Unix specification, which says this:
r/x
The regular expression r shall be matched only if it is followed by an occurrence of regular expression x ( x is the instance of trailing context, further defined below). The token returned in yytext shall only match r. If the trailing portion of r matches the beginning of x, the result is unspecified. The r expression cannot include further trailing context or the '$' (match-end-of-line) operator; x cannot include the '^' (match-beginning-of-line) operator, nor trailing context, nor the '$' operator. That is, only one occurrence of trailing context is allowed in a lex regular expression, and the '^' operator only can be used at the beginning of such an expression.
So you can see that there is room for interpretation here. The r and x can be treated as separate regexes, with a match for r computed in the normal way as if it were alone, and then x applied as a special constraint.
The spec also has discussion about this very issue (you are in luck):
The following examples clarify the differences between lex regular expressions and regular expressions appearing elsewhere in this volume of IEEE Std 1003.1-2001. For regular expressions of the form "r/x", the string matching r is always returned; confusion may arise when the beginning of x matches the trailing portion of r. For example, given the regular expression "a*b/cc" and the input "aaabcc", yytext would contain the string "aaab" on this match. But given the regular expression "x*/xy" and the input "xxxy", the token xxx, not xx, is returned by some implementations because xxx matches "x*".
In the rule "ab*/bc", the "b*" at the end of r extends r's match into the beginning of the trailing context, so the result is unspecified. If this rule were "ab/bc", however, the rule matches the text "ab" when it is followed by the text "bc". In this latter case, the matching of r cannot extend into the beginning of x, so the result is specified.
As you can see there are some limitations in this feature.
Unspecified behavior means that there are some choices about what the behavior should be, none of which are more correct than the others (and don't write patterns like that if you want your lex program to be portable). "As you can see, there are some limitations in this feature".
I have a given DFA that represent a regular expression.
I want to match the DFA against an input stream and get all possible matches back, not only the leastmost-longest match.
For example:
regex: a*ba|baa
input: aaaaabaaababbabbbaa
result:
aaaaaba
aaba
ba
baa
Assumptions
Based on your question and later comments you want a general method for splitting a sentence into non-overlapping, matching substrings, with non-matching parts of the sentence discarded. You also seem to want optimal run-time performance. Also I assume you have an existing algorithm to transform a regular expression into DFA form already. I further assume that you are doing this by the usual method of first constructing an NFA and converting it by subset construction to DFA, since I'm not aware of any other way of accomplishing this.
Before you go chasing after shadows, make sure your trying to apply the right tool for the job. Any discussion of regular expressions is almost always muddied by the fact that folks use regular expressions for a lot more things than they are really optimal for. If you want to receive the benefits of regular expressions, be sure you're using a regular expression, and not something broader. If what you want to do can't be somewhat coded into a regular expression itself, then you can't benefit from the advantages of regular expression algorithms (fully)
An obvious example is that no amount of cleverness will allow a FSM, or any algorithm, to predict the future. For instance, an expression like (a*b)|(a), when matched against the string aaa... where the ellipsis is the portion of the expression not yet scanned because the user has not typed them yet, cannot give you every possible right subgroup.
For a more detailed discussion of Regular expression implementations, and specifically Thompson NFA's please check this link, which describes a simple C implementation with some clever optimizations.
Limitations of Regular Languages
The O(n) and Space(O(1)) guarantees of regular expression algorithms is a fairly narrow claim. Specifically, a regular language is the set of all languages that can be recognized in constant space. This distinction is important. Any kind of enhancement to the algorithm that does something more sophisticated than accepting or rejecting a sentence is likely to operate on a larger set of languages than regular. On top of that, if you can show that some enhancement requires greater than constant space to implement, then you are also outside of the performance guarantee. That being said, we can still do an awful lot if we are very careful to keep our algorithm within these narrow constraints.
Obviously that eliminates anything we might want to do with recursive backtracking. A stack does not have constant space. Even maintaining pointers into the sentence would be verboten, since we don't know how long the sentence might be. A long enough sentence would overflow any integer pointer. We can't create new states for the automaton as we go to get around this. All possible states (and a few impossible ones) must be predictable before exposing the recognizer to any input, and that quantity must be bounded by some constant, which may vary for the specific language we want to match, but by no other variable.
This still allows some room for adding additonal behavior. The usual way of getting more mileage is to add some extra annotations for where certain events in processing occur, such as when a subexpression started or stopped matching. Since we are only allowed to have constant space processing, that limits the number of subexpression matches we can process. This usually means the latest instance of that subexpression. This is why, when you ask for the subgrouped matched by (a|)*, you always get an empty string, because any sequence of a's is implicitly followed by infinitely many empty strings.
The other common enhancement is to do some clever thing between states. For example, in perl regex, \b matches the empty string, but only if the previous character is a word character and the next is not, or visa versa. Many simple assertions fit this, including the common line anchor operators, ^ and $. Lookahead and lookbehind assertions are also possible, but much more difficult.
When discussing the differences between various regular language recognizers, it's worth clarifying if we're talking about match recognition or search recognition, the former being an accept only if the entire sentence is in the language, and the latter accepts if any substring in the sentence is in the language. These are equivalent in the sense that if some expression E is accepted by the search method, then .*(E).* is accepted in the match method.
This is important because we might want to make it clear whether an expression like a*b|a accepts aa or not. In the search method, it does. Either token will match the right side of the disjunction. It doesn't match, though, because you could never get that sentence by stepping through the expression and generating tokens from the transitions, at least in a single pass. For this reason, i'm only going to talk about match semantics. Obviously if you want search semantics, you can modify the expression with .*'s
Note: A language defined by expression E|.* is not really a very manageable language, regardless of the sublanguage of E because it matches all possible sentences. This represents a real challenge for regular expression recognizers because they are really only suited to recognizing a language or else confirming that a sentence is not in that same language, rather than doing any more specific work.
Implementation of Regular Language Recognizers
There are generally three ways to process a regular expression. All three start the same, by transforming the expression into an NFA. This process produces one or two states for each production rule in the original expression. The rules are extemely simple. Here's some crude ascii art: note that a is any single literal character in the language's alphabet, and E1 and E2 are any regular expression. Epsilon(ε) is a state with inputs and outputs, but ignores the stream of characters, and doesn't consume any input either.
a ::= > a ->
E1 E2 ::= >- E1 ->- E2 ->
/---->
E1* ::= > --ε <-\
\ /
E1
/-E1 ->
E1|E2 ::= > ε
\-E2 ->
And that's it! Common uses such as E+, E?, [abc] are equivalent to EE*, (E|), (a|b|c) respectively. Also note that we add for each production rule a very small number of new states. In fact each rule adds zero or one state (in this presentation). characters, quantifiers and dysjunction all add just one state, and the concatenation doesn't add any. Everything else is done by updating the fragments' end pointers to start pointers of other states or fragments.
The epsilon transition states are important, because they are ambiguous. When encountered, is the machine supposed to change state to once following state or another? should it change state at all or stay put? That's the reason why these automatons are called nondeterministic. The solution is to have the automaton transition to the right state, whichever allows it to match the best. Thus the tricky part is to figure out how to do that.
There are fundamentally two ways of doing this. The first way is to try each one. Follow the first choice, and if that doesn't work, try the next. This is recursive backtracking, appears in a few (and notable) implementations. For well crafted regular expressions, this implementation does very little extra work. If the expression is a bit more convoluted, recursive backtracking is very, very bad, O(2^n).
The other way of doing this is to instead try both options in parallel. At each epsilon transition, add to the set of current states both of the states the epsilon transition suggests. Since you are using a set, you can have the same state come up more than once, but you only need to track it once, either you are in that state or not. If you get to the point that there's no option for a particular state to follow, just ignore it, that path didn't match. If there are no more states, then the entire expression didn't match. as soon as any state reaches the final state, you are done.
Just from that explanation, the amount of work we have to do has gone up a little bit. We've gone from having to keep track of a single state to several. At each iteration, we may have to update on the order of m state pointers, including things like checking for duplicates. Also the amount of storage we needed has gone up, since now it's no longer a single pointer to one possible state in the NFA, but a whole set of them.
However, this isn't anywhere close to as bad as it sounds. First off, the number of states is bounded by the number of productions in the original regular expression. From now on we'll call this value m to distinguish it from the number of symbols in the input, which will be n. If two state pointers end up transitioning to the same new state, you can discard one of them, because no matter what else happens, they will both follow the same path from there on out. This means the number of state pointers you need is bounded by the number of states, so that to is m.
This is a bigger win in the worst case scenario when compared to backtracking. After each character is consumed from the input, you will create, rename, or destroy at most m state pointers. There is no way to craft a regular expression which will cause you to execute more than that many instructions (times some constant factor depending on your exact implementation), or will cause you to allocate more space on the stack or heap.
This NFA, simultaneously in some subset of its m states, may be considered some other state machine who's state represents the set of states the NFA it models could be in. each state of that FSM represents one element from the power set of the states of the NFA. This is exactly the DFA implementation used for matching regular expressions.
Using this alternate representation has an advantage that instead of updating m state pointers, you only have to update one. It also has a downside, since it models the powerset of m states, it actually has up to 2m states. That is an upper limit, because you don't model states that cannot happen, for instance the expression a|b has two possible states after reading the first character, either the one for having seen an a, or the one for having seen a b. No matter what input you give it, it cannot be in both of those states at the same time, so that state-set does not appear in the DFA. In fact, because you are eliminating the redundancy of epsilon transitions, many simple DFA's actually get SMALLER than the NFA they represent, but there is simply no way to guarantee that.
To keep the explosion of states from growing too large, a solution used in a few versions of that algorithm is to only generate the DFA states you actually need, and if you get too many, discard ones you haven't used recently. You can always generate them again.
From Theory to Practice
Many practical uses of regular expressions involve tracking the position of the input. This is technically cheating, since the input could be arbitrarily long. Even if you used a 64 bit pointer, the input could possibly be 264+1 symbols long, and you would fail. Your position pointers have to grow with the length of the input, and now your algorithm now requires more than constant space to execute. In practice this isn't relevant, because if your regular expression did end up working its way through that much input, you probably won't notice that it would fail because you'd terminate it long before then.
Of course, we want to do more than just accept or reject inputs as a whole. The most useful variation on this is to extract submatches, to discover which portion of an input was matched by a certain section of the original expression. The simple way to achieve this is to add an epsilon transition for each of the opening and closing braces in the expression. When the FSM simulator encounters one of these states, it annotates the state pointer with information about where in the input it was at the time it encountered that particular transition. If the same pointer returns to that transition a second time, the old annotation is discarded and replaced with a new annotation for the new input position. If two states pointers with disagreeing annotations collapse to the same state, the annotation of a later input position wins again.
If you are sticking to Thompson NFA or DFA implementations, then there's not really any notion of greedy or non-greedy matching. A backtracking algorithm needs to be given a hint about whether it should start by trying to match as much as it can and recursively trying less, or trying as little as it can and recursively trying more, when it fails it first attempt. The Thompson NFA method tries all possible quantities simultaneously. On the other hand, you might still wish to use some greedy/nongreedy hinting. This information would be used to determine if newer or older submatch annotations should be preferred, in order to capture just the right portion of the input.
Another kind of practical enhancement is assertions, productions which do not consume input, but match or reject based on some aspect of the input position. For instance in perl regex, a \b indicates that the input must contain a word boundary at that position, such that the symbol just matched must be a word character, but the next character must not be, or visa versa. Again, we manage this by adding an epsilon transition with special instructions to the simulator. If the assertion passes, then the state pointer continues, otherwise it is discarded.
Lookahead and lookbehind assertions can be achieved with a bit more work. A typical lookbehind assertion r0(?<=r1)r2 is transformed into two separate expressions, .*r1 and r0εr2. Both expressions are applied to the input. Note that we added a .* to the assertion expression, because we don't actually care where it starts. When the simulator encounters the epsilon in the second generated fragment, it checks up on the state of the first fragment. If that fragment is in a state where it could accept right there, the assertion passes with the state pointer flowing into r2, but otherwise, it fails, and both fragments continue, with the second discarding the state pointer at the epsilon transition.
Lookahead also works by using an extra regex fragment for the assertion, but is a little more complex, because when we reach the point in the input where the assertion must succeed, none of the corresponding characters have been encountered (in the lookbehind case, they have all been encountered). Instead, when the simulator reaches the assertion, it starts a pointer in the start state of the assertion subexpression and annotates the state pointer in the main part of the simulation so that it knows it is dependent on the subexpression pointer. At each step, the simulation must check to see that the state pointer it depends upon is still matching. If it doesn't find one, then it fails wherever it happens to be. You don't have to keep any more copies of the assertion subexpressions state pointers than you do for the main part, if two state pointers in the assertion land on the same state, then the state pointers each of them depend upon will share the same fate, and can be reannotated to point to the single pointer you keep.
While were adding special instructions to epsilon transitions, it's not a terrible idea to suggest an instruction to make the simulator pause once in a while to let the user see what's going on. Whenever the simulator encounters such a transition, it will wrap up its current state in some kind of package that can be returned to the caller, inspected or altered, and then resumed where it left off. This could be used to match input interactively, so if the user types only a partial match, the simulator can ask for more input, but if the user types something invalid, the simulator is empty, and can complain to the user. Another possibility is to yield every time a subexpression is matched, allowing you to peek at every sub match in the input. This couldn't be used to exclude some submatches, though. For instance, if you tried to match ((a)*b) against aaa, you could see three submatches for the a's, even though the whole expression ultimately fails because there is no b, and no submatch for the corresponding b's
Finally, there might be a way to modify this to work with backreferences. Even if it's elegent, it's sure to be inefficient, specifically, regular expressions plus backreferences are in NP-Complete, so I won't even try to think of a way to do this, because we are only interested (here) in (asymptotically) efficient possibilities.
Why repeated strings such as
[wcw|w is a string of a's and b's]
cannot be denoted by regular expressions?
pls. give me detailed answer as i m new to lexical analysis.
Thanks ...
Regular expressions in their original form describe regular languages/grammars. Those cannot contain nested structures as those languages can be described by a simple finite state machine. Simplified you can picture that as if each word of the language grows strictly from left to right (or right to left), where repeating structures have to be explicitly defined and are static.
What this means is, that no information whatsoever from previous states can be carried over to later states (a few characters further in the input). So if you have your symbol w you can't specify that the input must have exactly the same string w later in the sequence. Similarly you can't ensure that each opening paranthesis needs a closin paren as well (so regular expressions themselves are not even a regular language and thus cannot be described by regular expressions :-)).
In theoretical computer science we worked with a very restricted set of regex operators, basically only consisting of sequence, alternative (|) and repetition (*), everything else can be described with those operations.
However, usually regex engines allow grouping of certain sub-patterns into matches which can then be referenced or extracted later. Some engines even allow to use such a backreference in the search expression string itself, thereby allowing the expression to describe more than just a regular language. If I remember correctly such use of backreferences can even yield languages that are not context-free.
Additional pointers:
This StackOverflow question
Wikipedia
It can be, you just can't assure that it's the same string of "a"s and "b"s because there's no way to retain the information acquired in traversing the first half for use in traversing the second.