library for converting regular expressions to NFAs? - regex

Is there a good library for converting Regular Expressions into NFAs? I see lots of academic papers on the subject, which are helpful, but not much in the way of working code.
My question is due partially to curiosity, and partially to an actual need to speed up regular expression matching on a production system I'm working on. Although it might be fun to explore this subject for learning's sake, I'm not sure it's a "practical" solution to speeding up our pattern matching. We're a Java shop, but would happily take pointers to good code in any language.
Edit:
Interesting, I did not know that Java's regexps were already NFAs. The title of this paper lead me to believe otherwise. Incidentally, we are currently doing our regexp matching in Postgres; if the simple solution is to move the matching into the Java code that would be great.

Addressing your need to speed up your regexes:
Java's implementation of its regex engine is NFA based. As such, to tune your regexes, I would say that you would benefit from a deeper understanding of how the engine is implemented.
And as such I direct you to: Mastering Regular Expressions The book gives substantial treatment to the NFA engine and how it performs matches, including how to tune your regex specific to the NFA engine.
Additionally, look into Atomic Grouping for tuning your regex.

Disclaimer: I'm not an expert on java+regexes. But, if I understand correctly...
If Java's regular expression matcher is similar to most others, it does use NFA's - but not the way you might expect. Instead of the forward-only implementation you may have heard about, it's using a backtracking solution which simplifies subexpression matching, and is probably required for Backreference usage. However, it performs alternation poorly.
You want to see: http://swtch.com/~rsc/regexp/regexp1.html (concerning edge cases which perform poorly on this altered architecture).
I've also written a question which I suppose comes down to the same thing:
Regex implementation that can handle machine-generated regex's: *non-backtracking*, O(n)?
But basically, it looks like for some very odd reason all common major-vendor regex implementaions have terrible performance when used on certain regexes, even though this is unnecessary.

Disclaimer: I'm a googler, not an expert on regexes.
There is a bunch of faster-than-JDK regex libraries one of which is dk.brics.automaton. According to the benchmark linked in the article, it is approximately x20 faster than the JDK implementation.
This library was written by Anders Møller and had also been mavenized.

Related

Why do most languages implement wildcard regular expressions inefficiently?

I was given a link to the following article regarding the implementation of regular expressions in many modern languages.
http://swtch.com/~rsc/regexp/regexp1.html
TL;DNR: Certain regular expressions such as (a?)^na^n for fixed $n$ take exponential time matched against, say, a^n because its implemented via backtracking over the string when matching the ? section. Implementing these as an NFA by keeping state lists makes this much more efficient for obvious reasons
The details of how each language actually implements these isn't very detailed (and the article is old), but I'm curious: what, if any, are the drawbacks of using an NFA as opposed to other implementation techniques. The only thing I can come up with is that with all the bells and whistles of most libraries either a) building a NFA for all those features is impractical or b) there is some conflicting performance issue between the expression above and some other, possibly more common, operation.
While it is possible to construct DFAs that handle these complex cases well (the Tcl RE engine, which was written by Henry Spencer, is a proof by example; the article linked indicated this with its performance data) it's also exceptionally hard.
One key thing though is that if you can detect that you never need the matching group information, you can then (for many REs, especially those without internal backreferences) transform the RE into one that only uses parentheses for grouping allowing a more efficient RE to be generated (so (a?){n}a{n} — I'm using modern conventional syntax — becomes effectively equivalent to a{n,2n}). Backreferences break that major optimisation; it's not for nothing that in Henry's RE code (alluded to above) there is a code comment describing them as the “Feature from the Black Lagoon”. It is one of the best comments I've ever read in code (with the exception of references to academic papers that describe the algorithm encoded).
On the other hand, the Perl/PCRE style engines with their recursive-descent evaluation schemes, can ascribe a much saner set of semantics to mixed greediness REs, and many other things besides. (At the extreme end of this, recursive patterns — (?R) et al — are completely impossible with automata-theoretic approaches. They require a stack to match, making them formally not be regular expressions.)
On a practical level, the cost of building the NFA and the DFA you then compile that to can be quite high. You need clever caching to make it not too expensive. And also on a practical level, the PCRE and Perl implementations have had a lot more developer effort applied to them.
My understanding is that the main reason is we're not just interested in whether a string matches, but in how it matches, e.g. with capturing groups. For example, (x*)x needs to know how many xs were in the group so it can be returned as a capturing group. Similarly it "promises" to consume as many x characters as possible, which matters if we continue matching more things against the remaining string.
Some simpler types of expressions could be matched in the efficient way the article describes, and I have no special knowledge of why this isn't done. Presumably it's more effort to write two separate engines, and perhaps the extra time analyzing an expression to determine which engine to use on it is expensive enough that it's better to skip that step for the common case, and live with the very poor performance in the worst case.
Here:
http://haifux.org/lectures/156/PCRE-Perl_Compatible_Regular_Expression_Library.pdf
They write that pcre uses NFA based implementation. But this link also not the youngest thing on the web...
Around the page 36 there is comparison between engines. It can also be relevant to the original question.

Special construct of expression use

With the construct of using If-Then-Else Conditionals in regular expressions, I would like to know the possible outcome of trying to manipulate many constructs into a single expression for multiple matches.
Let's take this example below.
foo(bar)?(?(1)baz|quz)
Now being combined with an expression, which matches the previous conditions and then we add on to the previous with the following conditions..
foo(bar)?(?(1)baz|quz)|(?(?=.*baz)bar|foo)
Mainly I am asking should you construct a regular expression in this way, and what would ever be the purpose that you would need to use it in this way?
should you construct a regular expression in this way, and what would
ever be the purpose that you would need to use it in this way?
In this case, and probably most cases, I would say "no".
I often find that conditionals can be rewritten as lookarounds or simplified within alternations.
For instance, it seems to me that the regex you supplied,
foo(bar)?(?(1)baz|quz)|(?(?=.*baz)bar|foo)
could be replaced for greater clarity by
bar(?=.*baz)|foo(?:quz|barbaz)?
which gives us two simple match paths.
But it's been six months since you posted the question. During this time, answering a great many questions on SO, have you ever felt the need for that kind of construction?
I believe the answer to this would be ultimately be specific to the regex library being used or a language's implementation. ("Comparison of regular expression engines", Wikipedia.)
There isn't an official RFC or specification for regular expressions. And the diversity of implementations leads to frustration doing even "simple" expressions--the nuances you're considering are probably implementation-specific instead of regex-specific.
And even beyond your specific question, I think most of the regex-related questions on StackOverflow would be improved if people were more specific about the language being used or the library employed by the application they're using. When troubleshooting my own regular expressions in various applications (text editors for example) it took me awhile before I realized the need to understand the specific library being used.

How to figure out if a regex implementation uses DFA or NFA?

I'm facing the question, whether a certain regex implementation is based on a DFA or NFA.
What are the starting points for me to figure this out. One could also ask: What am I looking for? What are the basic patterns and / or characteristics? A good and explanatory link or a little comparisons (even if not directly dedicated to regex) is perfectly fine.
If it's a black box, then give it some input and measure its time characteristics with a pathological case, with reference to the graphs in this discussion of NFS vs backtracking regex implementations. (note the NFS graph is microseconds not seconds).
Also, if it's a pure NFA, then it won't have some non-regular features which are found is some 'regular expression' parsers, which require backtracking.
Alternatively, look at the documentation of the RxParser class; documentation appears to be unavailable on the web and requires a squeak runtime to browse.
I think you mean "regex implementation" rather than algorithm (in the usual sense).
You could test with know expressions that are known to cause problems with one approach or the other. Also looking for features that are easier to implement in one or the other (this is not a reliable approach – the developers of regex engines find new ways to implement previously hard things).
Normally the answer is to read the documentation, or look in a known reference ("Mastering Regular Expressions" documents many popular cases). Finally why not ask the authors?

When a string is being matched against a regular expression, what's going on behind the scenes?

I'd be interested to know what kind of algorithms are used for matching it, and how they are optimised, because I imagine that somes regexes could produce a vast number of possible matches that could cause serious problems on a poorly witten regex parser.
Also, I recently discovered the concept of a ReDoS, why do regexes such as (a|aa)+ or (a|a?)+ cause problems?
EDIT: I have used them most in C# and Python, so that's what was in my mind when I was considering the question. I assume Python's is written in C like the rest of the interpreter, but I have no idea about C#
I find http://www.regular-expressions.info has really useful info about regular expressions.
The author specifically talks about catastrophic uses of regular expression.
Regex Buddy has this debug page which "offers you a unique view inside a regular expression engine".
http://www.regexbuddy.com/debug.html
There are two kinds of regular expression engine: NFA and DFA. I am quite rusty so I don't dare go into specifics by memory. Here is a page that goes through the algorithms, though. Some parsers will perform better with poorly-written expressions. A good book on the subject (that is sitting on my shelf) is Mastering Regular Expression.

does regex comparisons consume lots of resources?

i dunno, but will your machine suffer great slowdown if you use a very complex regex?
like for example the famous email validation module proposed just recently? which can be found here RFC822
update: sorry i had to ask this question in a hurry anyway i posted the link to the email regex i was talking about
It highly depends on the individual regex: features like look-behind or look-ahead can get very expensive, while simple regular expressions are fine for most situations.
Tutorials on http://www.regular-expressions.info/ offer performance advice, so that can be a good start.
Regexes are usually implemented as one of two algorithms (NFA or DFA) that correspond to two different FSMs. Different languages and even different versions of the same language may have a different type of regex. Naturally, some regexes work faster in one and some work faster in the other. If it's really critical, you might want to find what type of regex FSM is implemented.
I'm no expert here. I got all this from reading Mastering Regular Expressions by Jeffrey E. F. Friedl. You might want to look that up.
Depends also on how well you optimise your query, and knowing the internal working of regex.
Using the negated character class, for example, saves the cost of having the engine backtracking characters (i.e. /<[^>]+>/ instead of /<.+?>/)(*).Trivial in small matches, but saves a lot of cycles when you have to match inside a big chunk of text.
And there are many other ways to save resources in regex operations, so performance can vary wildly.
example taken from http://www.regular-expressions.info/repeat.html
You might be interested by articles like: Regular Expression Matching Can Be Simple And Fast or Understanding Regular Expressions.
It is, alas, easy to write inefficient REs, which can match quite quickly on success but can look for hours if no match is found, because the engine stupidly try a long match on every position of a long string!
There are a few recipes for this, like anchoring whenever it is possible, avoiding greediness if possible, etc.
Note that the giant e-mail expression isn't recent, and not necessarily slow: a short, simple expression can be slower than a more convoluted one!
Note also that in some situations (like e-mail, precisely), it can be more efficient (and maintainable!) to use a mix of regexes and code to handle cases, like splitting at #, handling different cases (first part starts with " or not, second part is IP address or domain, etc.).
Regexes are not the ultimate tool able to do everything, but it is a very useful tool well worth to master!
It depends on your regexp engine. As explained here (Regular Expression Matching Can Be Simple And Fast) there may be some important difference in the performance depending on the implementation.
You can't talk about regexes in general any more than you can talk about code in general.
Regular expressions are little programs on their own. Just as any given program may be fast or slow, any given regex may be fast or slow.
One thing to remember, however, is that the regular expression handler is is very well optimized to do its job and run the regex quickly.
I once made a program that analyzed a lot of text (a big code base, >300k lines). First I used regex but when I switched to using regular string functions it got a lot faster, like taking 40% of the time of the regex version. So while of course it depends, my thing got a lot faster.
Once I had written a greedy - accidentally, of course :-) - a multi-line regex and had it search/replace on 10 * 200 GB of text files. It was damn slow... So it depends what you write, and what you check.
Depends on the complexity of the expression and the language the expression is used with.
In JavaScript; you have to optimize everything. In C#; not so much.