Optimization techniques used by std::regex_constants::optimize - c++

I am working with std::regex, and whilst reading about the various constants defined in std::regex_constants, I came across std::optimize, reading about it, it sounds like it is useful in my application (I only need one instance of the regex, initialized at the beginning, but it is used multiple times throughout the loading process).
According to the working paper n3126 (pg. 1077), std::regex_constants::optimize:
Specifies that the regular expression engine should pay more attention to the speed with which regular expressions are matched, and less to the speed with which regular expression objects are constructed. Otherwise it has no detectable effect on the program output.
I was curious as to what type of optimization would be performed, but there doesn't seem to be much literature about it (indeed, it seems to be undefined), and one of the only things I found was at cppreference.com, which stated that std::regex_constants::optimize:
Instructs the regular expression engine to make matching faster, with the potential cost of making construction slower. For example, this might mean converting a non-deterministic FSA to a deterministic FSA.
However, I have no formal background in computer science, and whilst I'm aware of the basics of what an FSA is, and understand the basic difference between a deterministic FSA (each state only has one possible next state), and a non-deterministic FSA (with multiple potential next states); I do not understand how this improves matching time. Also, I would be interested to know if there are any other optimizations in various C++ Standard Library implementations.

There's some useful information on the topic of regex engines and performance trade offs (far more than can fit in a stackoverflow answer) in Mastering Regular Expressions by Jeffrey Friedl.
It's worth noting that Boost.Regex, which was the source for N3126, documents optimize as "This currently has no effect for Boost.Regex."
P.S.
indeed, it seems to be implementation-defined
No, it's unspecified. Implementation-defined means an implementation is required to define the choice of behaviour. Implementations are not required to document how their regex engines are implemented or what (if any) difference the optimize flag makes.
P.S. 2
in various STL implementations
std::regex is not part of the STL, the C++ Standard Library is not the same thing as the STL.

See http://swtch.com/~rsc/regexp/regexp1.html for a nice explanation on how NFA based regex implementations can avoid the exponential backtracking that occurs in DFA matchers in certain circumstances.

Related

Optimization techniques for backtracking regex implementations

I'm trying to implement a regular expression matcher based on the backtracking approach sketched in Exploring Ruby’s Regular Expression Algorithm. The compiled regex is translated into an array of virtual machine commands; for the backtracking the current command and input string indices as well as capturing group information is maintained on a stack.
In Regular Expression Matching: the Virtual Machine Approach Cox gives more detailed information about how to compile certain regex components into VM commands, though the discussed implementations are a bit different. Based on thoese articles my implementation works quite well for the standard grouping, character classes and repetition components.
Now I would like to see what extensions and optimization options there are for this type of implementation. Cox gives in his article a lot of useful information on the DFA/NFA approach, but the information about extensions or optimization techniques for the backtracking approach is a bit sparse.
For example, about backreferences he states
Backreferences are trivial in backtracking implementations.
and gives an idea for the DFA approach. But it's not clear to me how this can be "trivially" done with the VM approach. When the backreference command is reached, you'd have to compile the previously matched string from the corresponding group into another list of VM commands and somehow incorporate those commands into the current VM, or maintain a second VM and switch execution temporarily to that one.
He also mentions a possible optimization in repetitions by using look-aheads, but doesn't elaborate on how that would work. It seems to me this could be used to reduce the number items on the backtracking stack.
tl;dr What general optimization techniques exist for VM-based backtracking regex implementations and how do they work? Note that I'm not looking for optimizations specific to a certain programming language, but general techniques for this type of regex implemenations.
Edit: As mentioned in the first link, the Oniguruma library implements a regex matcher with exactly that stack-based backtracking approach. Perhaps someone can explain the optimizations done by that library which can be generalized to other implementations. Unfortunately, the library doesn't seem to provide any documentation on the source code and the code also lacks comments.
Edit 2: When reading about parsing expression grammars (PEGs), I stumbled upon a paper on a Lua PEG implementation which makes use of a similar VM-based approach. The paper mentions several optimization options to reduce the number of executed VM commands and an unnecessary growth of the backtracking stack.
I suggest you to watch full lection, it is very interesting, but here is outline:
Complexity explosion in backtracking. This happens then pattern have
ambiguity ([a-x]*[a-x0-9]*z in video, as an example) in it, so engine have to backtrack and test all conditions, until it become certain the pattern did (or didn't) match.
It can take up to O(Nᵖ), where p is "measure of ambiguity" of pattern.
To get O(pN), we need to avoid evaluating equivalent threads again and again.
...
Solution:
At one step ajust all threads by one character, "Breadth-first" execution results in linear comlexity.
Tricks to save every bit of performance
Inside std::regex
Hope this helps!
P.S Lector's repository

Why do most languages implement wildcard regular expressions inefficiently?

I was given a link to the following article regarding the implementation of regular expressions in many modern languages.
http://swtch.com/~rsc/regexp/regexp1.html
TL;DNR: Certain regular expressions such as (a?)^na^n for fixed $n$ take exponential time matched against, say, a^n because its implemented via backtracking over the string when matching the ? section. Implementing these as an NFA by keeping state lists makes this much more efficient for obvious reasons
The details of how each language actually implements these isn't very detailed (and the article is old), but I'm curious: what, if any, are the drawbacks of using an NFA as opposed to other implementation techniques. The only thing I can come up with is that with all the bells and whistles of most libraries either a) building a NFA for all those features is impractical or b) there is some conflicting performance issue between the expression above and some other, possibly more common, operation.
While it is possible to construct DFAs that handle these complex cases well (the Tcl RE engine, which was written by Henry Spencer, is a proof by example; the article linked indicated this with its performance data) it's also exceptionally hard.
One key thing though is that if you can detect that you never need the matching group information, you can then (for many REs, especially those without internal backreferences) transform the RE into one that only uses parentheses for grouping allowing a more efficient RE to be generated (so (a?){n}a{n} — I'm using modern conventional syntax — becomes effectively equivalent to a{n,2n}). Backreferences break that major optimisation; it's not for nothing that in Henry's RE code (alluded to above) there is a code comment describing them as the “Feature from the Black Lagoon”. It is one of the best comments I've ever read in code (with the exception of references to academic papers that describe the algorithm encoded).
On the other hand, the Perl/PCRE style engines with their recursive-descent evaluation schemes, can ascribe a much saner set of semantics to mixed greediness REs, and many other things besides. (At the extreme end of this, recursive patterns — (?R) et al — are completely impossible with automata-theoretic approaches. They require a stack to match, making them formally not be regular expressions.)
On a practical level, the cost of building the NFA and the DFA you then compile that to can be quite high. You need clever caching to make it not too expensive. And also on a practical level, the PCRE and Perl implementations have had a lot more developer effort applied to them.
My understanding is that the main reason is we're not just interested in whether a string matches, but in how it matches, e.g. with capturing groups. For example, (x*)x needs to know how many xs were in the group so it can be returned as a capturing group. Similarly it "promises" to consume as many x characters as possible, which matters if we continue matching more things against the remaining string.
Some simpler types of expressions could be matched in the efficient way the article describes, and I have no special knowledge of why this isn't done. Presumably it's more effort to write two separate engines, and perhaps the extra time analyzing an expression to determine which engine to use on it is expensive enough that it's better to skip that step for the common case, and live with the very poor performance in the worst case.
Here:
http://haifux.org/lectures/156/PCRE-Perl_Compatible_Regular_Expression_Library.pdf
They write that pcre uses NFA based implementation. But this link also not the youngest thing on the web...
Around the page 36 there is comparison between engines. It can also be relevant to the original question.

Regular vs LALR(1): what is faster

Supposing we have two grammars which define the same languge: regular one and LALR(1) one.
Both regular and LALR(1) algorithms are O(n) where n is input length.
Regexps are usually preferred for parsing regular languages. Why? Is there a formal proof (or maybe that's obvious) that they are faster?
You should prefer stackless automaton over pushdown one as there is much more developed maths for regular language automatons.
We are able to perform determinization for both types of automaton, but we are unable to perform efficient minimization of PDA. The well known fact is that for every PDA there exists equivalent one with the only state. This means that we should minimize it with respect to transitions count/max stack depth/some other criteria.
Also the problem of checking whether two different PDAs are equivalent with respect to the language they recognize is undecidable.
There is a big difference between parsing and recognizing. Although you could build a regular-language parser, it would be extremely limited, since most useful languages are not parseable with a useful unambiguous regular grammar. However, most (if not all) regular expression libraries recognize, possibly with the addition of a finite number of "captures".
In any event, parsing really isn't the performance bottleneck anymore. IMHO, it's much better to use tools which demonstrably parse the language they appear to parse.
On the other hand, if all you want to do is recognize a language -- and the language happens to be regular -- regular expressions are a lot easier and require much less infrastructure (parser generators, special-purpose DSLs, slightly more complicated Makefiles, etc.)
(As an example of a language feature which is not regular, I give you: parentheses.)
People prefer regular expressions because they're easier to write. If your language is a regular language, why bother creating a CFG grammer for it?

How to figure out if a regex implementation uses DFA or NFA?

I'm facing the question, whether a certain regex implementation is based on a DFA or NFA.
What are the starting points for me to figure this out. One could also ask: What am I looking for? What are the basic patterns and / or characteristics? A good and explanatory link or a little comparisons (even if not directly dedicated to regex) is perfectly fine.
If it's a black box, then give it some input and measure its time characteristics with a pathological case, with reference to the graphs in this discussion of NFS vs backtracking regex implementations. (note the NFS graph is microseconds not seconds).
Also, if it's a pure NFA, then it won't have some non-regular features which are found is some 'regular expression' parsers, which require backtracking.
Alternatively, look at the documentation of the RxParser class; documentation appears to be unavailable on the web and requires a squeak runtime to browse.
I think you mean "regex implementation" rather than algorithm (in the usual sense).
You could test with know expressions that are known to cause problems with one approach or the other. Also looking for features that are easier to implement in one or the other (this is not a reliable approach – the developers of regex engines find new ways to implement previously hard things).
Normally the answer is to read the documentation, or look in a known reference ("Mastering Regular Expressions" documents many popular cases). Finally why not ask the authors?

library for converting regular expressions to NFAs?

Is there a good library for converting Regular Expressions into NFAs? I see lots of academic papers on the subject, which are helpful, but not much in the way of working code.
My question is due partially to curiosity, and partially to an actual need to speed up regular expression matching on a production system I'm working on. Although it might be fun to explore this subject for learning's sake, I'm not sure it's a "practical" solution to speeding up our pattern matching. We're a Java shop, but would happily take pointers to good code in any language.
Edit:
Interesting, I did not know that Java's regexps were already NFAs. The title of this paper lead me to believe otherwise. Incidentally, we are currently doing our regexp matching in Postgres; if the simple solution is to move the matching into the Java code that would be great.
Addressing your need to speed up your regexes:
Java's implementation of its regex engine is NFA based. As such, to tune your regexes, I would say that you would benefit from a deeper understanding of how the engine is implemented.
And as such I direct you to: Mastering Regular Expressions The book gives substantial treatment to the NFA engine and how it performs matches, including how to tune your regex specific to the NFA engine.
Additionally, look into Atomic Grouping for tuning your regex.
Disclaimer: I'm not an expert on java+regexes. But, if I understand correctly...
If Java's regular expression matcher is similar to most others, it does use NFA's - but not the way you might expect. Instead of the forward-only implementation you may have heard about, it's using a backtracking solution which simplifies subexpression matching, and is probably required for Backreference usage. However, it performs alternation poorly.
You want to see: http://swtch.com/~rsc/regexp/regexp1.html (concerning edge cases which perform poorly on this altered architecture).
I've also written a question which I suppose comes down to the same thing:
Regex implementation that can handle machine-generated regex's: *non-backtracking*, O(n)?
But basically, it looks like for some very odd reason all common major-vendor regex implementaions have terrible performance when used on certain regexes, even though this is unnecessary.
Disclaimer: I'm a googler, not an expert on regexes.
There is a bunch of faster-than-JDK regex libraries one of which is dk.brics.automaton. According to the benchmark linked in the article, it is approximately x20 faster than the JDK implementation.
This library was written by Anders Møller and had also been mavenized.