How to efficiently implement longest match in a lexer generator? - regex

I'm interested in learning how to write a lexer generator like flex. I've been reading "Compilers: Principles, Techniques, and Tools" (the "dragon book"), and I have a basic idea of how flex works.
My initial approach is this: the user will supply a hash map of regexes mapping a regex to a token enum. The program will just loop through the regexes one by one in the order given and see if they match the start of the string (I could add a ^ to the beginning of each regex to implement this). If they do, I can add the token for that regex to a list of tokens for the program.
My first question is, is this the most efficient way to do it? Currently I have to loop through each regex, but in theory I could construct a DFA from all of the regexes combined and step through that more efficiently. However, there will be some overhead from creating this DFA.
My second question is, how would I implement the longest matching string tie breaker, like flex does? i.e, I want to match ifa as an identifier, and not the keyword if followed by the letter a. I don't see any efficient way to do this with regex. I think I'll have to loop through all of the regexes, try to match them, and if I have more than one match, take the longest result. However, if I converted the regexes to a DFA (that is, my own DFA data structure), then I could continue stepping through the characters until there are no more possible transition edges on the DFA. At that point, I can take the last time I passed through an acceptance state as the actual match of a token, since that should be the longest match.
Both of my questions point to writing my own translator from regex to a DFA. Is this required, or can I still do this efficiently with plain regex (implemented by a standard library) and still get the longest match?
EDIT: I kept the regex engine I'm using out of this because I wanted a general answer, but I'm using Rust's regex library: http://static.rust-lang.org/doc/master/regex/index.html

Timewise, it's much more efficient to compile all the regexes down into a single automaton that matches all patterns in parallel. It might blow up the space usage significantly, though (DFAs can have exponentially many states relative to the pattern sizes), so it's worth investigating whether this will hurt.
Typically, the way you'd implement maximal-munch (matching the longest string you can) is to run the matching automaton as normal. Keep track of the index of the last match that you find. When the automaton enters a dead state and stops, you can then output the substring from the beginning of the token up through the match point, then jump back in the input sequence to the point right after the match finished. This can be done pretty efficiently and without much slowdown at all.
In case it helps, here are some lecture slides from a compilers course I taught that explores scanning techniques.
Hope this helps!

Related

can I change the NFA algorithm by removing the repeated stage to prevent Regex DOS

As my knowledge to Regex algorithm , it transfers Regex to NFA.
By the definition of Regex, it would try all possibilities so it would record all possible stages to an array.
But for the bad Regex, it would save too many stages inside array then the algorithm would blow out.
The bad Regex like /(a+)+/
My question is : Can i simply remove the repeated stage in the array then it would not blow out the memory to prevent all the Regex DOS?
Would this reduce the abilities of Regex ?

Is there any way to optimize a generic regular expression?

I code in Eclipse, and when I do a CTRL-F to find some string, I see that apart from the standardized options of whole word, case sensitive, there is an option for regular expression search also (it is there in Notepad++ too).
I have tried it once or twice, and generally the results are almost instantaneous. But after all, the code files are not humongous, the biggest ones are not more than 500 lines long, with most lines filled less than half. Is there any way to optimize such that any user supplied regex will run much faster on a large piece of text, say 10-15 MB of size?
I can't think of any method because no standardized search algorithm like Rabin-Karp, or suffix tree would apply here!
I have no idea on how regular expression is implemented in Eclipse and why it is so slow. Here is just some thoughts:
First of all, there are a few concepts you should know: Nondeterministic finite automaton (NFA) and Deterministic finite automaton (DFA). In theory, Regular Expression, NFA, and DFA are equivalent, which means they have exactly the same ability to describe languages (sequences of characters). This implies that any one of them can be converted to another (see this site).
Regular Expression can be implemented by converting it to DFA, and using DFA to match text only takes linear time (many of the string matching algorithms, e.g. KMP, are actually special DFAs). However, the trouble is, most of modern Regular Expression implementations introduced features like backreferences making it impossible to use DFA.
So, if discarding those complex features is possible, implementing a fast Regular Expression would be feasible (do the matching in linear time). You may find more in this article.
What makes you think suffix tree isn't a suitable algorithm for this problem? From http://en.wikipedia.org/wiki/Suffix_tree:
Once [the suffix tree is] constructed, several operations can be performed quickly, for instance locating a substring in S, locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc.
I think a modified Boyer–Moore string search algorithm also would be possible.

Detect if a regexp is exponential

This article show that there is some regexp that is O(2^n) when backtracking.
The example is (x+x+)+y.
When attempt to match a string like xxxx...p it going to backtrack for a while before figure it out that it couldn't match.
Is there a way to detect such regexp?
thanks
If your regexp engine exposes runtime exponential behavior for (x+x+)+y ,then it is broken because a DFA or NFA can recognize this pattern in linear time:
echo "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" | egrep "(x+x+)+y"
echo "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxy" | egrep "(x+x+)+y"
both answer immediately.
In fact, there are only a few cases (like backreferences) where backtracking is really needed (mainly, because a regexp with a backreference is not a regular expression in the language theoretic sense anymore). A capable implementation should switch to backtracking only when these corner cases are given.
In fairness, DFA's have a dark side too, because some regexp's have exponential size requirements, but a size contraints is easier to enforce than a time constraint and the huge DFA runs linear on the input, so it's a better bargain than a small backtracker choking on a couple of X's.
You should really read Russ Cox excellent article series about the implementation of regexp (and the pathological behavior of backtracking): http://swtch.com/~rsc/regexp/
To answer your question about decidability: You can't. Because there is not the one backtracking for regexpr. Every implementation has its own strategies to deal with exponential growth in their algorithm for certain cases and does not cover others. One rule might be fit for here and catastrophic for there.
UPDATE:
For example, one implementation could contain an optimizer which could use algebraic transformations to simplify regexps before executing them: (x+x+)+y is the same a xxx*y, which shouldn't be a problem for any backtracker. But the same optimizer wouldn't recognize the next expression and the problem is there again. Here someone described how to craft a regexpr which fools Perl's optimizer:
http://perlgeek.de/blog-en/perl-tips/in-search-of-an-exponetial-regexp.html
No I don't think so, but you can use these guidelines:
If it contains two quantifiers that are open-ended at the high end and they are nested then it might be O(2^n).
If it does not contain two such quantifiers then I think it cannot be O(2^n).
Quantifiers that can cause this are: *, + and {k,}.
Also note that the worst case complexity of evaluating a regular expression might be very different from the complexity on typical strings and that the complexity depends on the specific regular expression engine.
Any regex without backreferences can be matched in linear time, though many regex engines out there in the real world don't do it that way (at least many regex engines that are plugged into programming language runtime environments support backreferences, and don't switch to a more efficient execution model when no backreferences are present).
There's no easy way to find out how much time a regex with backreferences is going to consume.
You could detect and reject nested repetitions using a regex parser, which corresponds to a star height of 1. I've just written a module to compute and reject start heights of >1 using a regex parser from npm.
$ node safe.js '(x+x+)+y'
false
$ node safe.js '(beep|boop)*'
true
$ node safe.js '(a+){10}'
false
$ node safe.js '\blocation\s*:[^:\n]+\b(Oakland|San Francisco)\b'
true

Mutually exclusive regular expressions

If I have a list of regular expressions, is there an easy way to determine that no two of them will both return a match for the same string?
That is, the list is valid if and only if for all strings a maximum of one item in the list will match the entire string.
It seems like this will be very hard (maybe impossible?) to prove definitively, but I can't seem to find any work on the subject.
The reason I ask is that I am working on a tokenizer that accepts regexes, and I would like to ensure only one token at a time can match the head of the input.
If you're working with pure regular expressions (no backreferences or other features that cause them to recognize context-free or more complicated languages), what you ask is possible.
What you can do is convert each regex to a DFA, then (since regular languages are closed under intersection) combine them into a DFA that recognizes
the intersection of the two languages. If that DFA has a path from the start state to an accepting state, that string is accepted by both input regexen.
The problem with this is that the first step of the usual regex->DFA algorithm is to
convert the regex to a NFA, then convert the NFA to a DFA. But that last step can
result in an exponential blowup in the number of DFA states, so this will only be
feasible for very simple regular expressions.
If you are working with extended regex syntax, all bets are off: context-free languages
are not closed under intersection, so this method won't work.
The Wkipedia article on regular expressions does state
It is possible to write an algorithm which for two given regular expressions decides whether the described languages are essentially equal, reduces each expression to a minimal deterministic finite state machine, and determines whether they are isomorphic (equivalent).
but gives no further hints.
Of course the easy way you are after is to run a lot of tests -- but we all know the shortcomings of testing as a method of proof.
You can't do that by only looking at the regular expression.
Consider the case where you have [0-9] and [0-9]+. They are obviously different expressions, but when applied to the string "1", they both produce the same result. When applied to string "11" they produce different results.
The point is that a regular expression isn't enough information. The result depends both on the regex and the target string.

Create a program that inputs a regular expression and outputs strings that satisfy that regular expression

I think that the title accurately summarizes my question, but just to elaborate a bit.
Instead of using a regular expression to verify properties of existing strings, I'd like to use the regular expression as a way to generate strings that have certain properties.
Note: The function doesn't need to generate every string that satisfies the regular expression (cause that would be an infinite number of string for a lot of regexes). Just a sampling of the many valid strings is sufficient.
How feasible is something like this? If the solution is too complicated/large, I'm happy with a general discussion/outline. Additionally, I'm interested in any existing programs or libraries (.NET) that do this.
Well a regex is convertible to a DFA which can be thought of as a graph. To generate a string given this DFA-graph you'd just find a path from a start state to an end state. You'd just have to think about how you want to handle cycles (Maybe traverse every cycle at least once to get a sampling? n times?), but I don't see why it wouldn't work.
This utility on UtilityMill will invert some simple regexen. It is based on this example from the pyparsing wiki. The test cases for this program are:
[A-EA]
[A-D]*
[A-D]{3}
X[A-C]{3}Y
X[A-C]{3}\(
X\d
foobar\d\d
foobar{2}
foobar{2,9}
fooba[rz]{2}
(foobar){2}
([01]\d)|(2[0-5])
([01]\d\d)|(2[0-4]\d)|(25[0-5])
[A-C]{1,2}
[A-C]{0,3}
[A-C]\s[A-C]\s[A-C]
[A-C]\s?[A-C][A-C]
[A-C]\s([A-C][A-C])
[A-C]\s([A-C][A-C])?
[A-C]{2}\d{2}
#|TH[12]
#(#|TH[12])?
#(#|TH[12]|AL[12]|SP[123]|TB(1[0-9]?|20?|[3-9]))?
#(#|TH[12]|AL[12]|SP[123]|TB(1[0-9]?|20?|[3-9])|OH(1[0-9]?|2[0-9]?|30?|[4-9]))?
(([ECMP]|HA|AK)[SD]|HS)T
[A-CV]{2}
A[cglmrstu]|B[aehikr]?|C[adeflmorsu]?|D[bsy]|E[rsu]|F[emr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airu]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|Uu[bhopqst]|U|V|W|Xe|Yb?|Z[nr]
(a|b)|(x|y)
(a|b) (x|y)
This can be done by traversing the DFA (includes pseudocode) or else by walking the regex's abstract-syntax tree directly or converting to NFA first, as explained by Doug McIlroy: paper and Haskell code. (He finds the NFA approach to go faster, but he didn't compare it to the DFA.)
These all work on regular expressions without back-references -- that is, 'real' regular expressions rather than Perl regular expressions. To handle the extra Perl features it'd be easiest to add on a post-filter.
Added: code for this in Python, by Peter Norvig and me.
Since it is trivially possible to write a regular expression that matches no possible strings, and I believe it is also possible to write a regular expression for which calculating a matching string requires an exhaustive search of possible strings of all lengths, you'll probably need an upper bound on requesting an answer.
The easiest way to implement but definitely most CPU time intensive approach would be to simply brute force it.
Set up a character table with the characters that your string should contain and then just sequentially generate strings and do a Regex.IsMatch on them.
I, personally, believe that this is the holy grail of reg-ex. If you could implement this -- even only 3/4 working -- I have no doubt that you'd be rich in about 5 minutes.
All joking aside, I'm not sure that what you are truly going after is feasible. Reg-Ex is a very open, flexible language and giving the computer enough sample input to truly and accurately find what you need, is probably not feasible.
If I'm proven wrong, I wish kudos to that developer.
To look at this from a different perspective, this is almost (not quite) like giving a computer it's output, and having it -- based on that -- write a program for you. This is a little overboard, but it kind of illustrates my point.