Which regular expression requires backtracking? - regex

There are three different solutions to implement regular expression matching: DFA, NFA and Backtracking. I am looking for examples:
a regexp, which can be solved with a DFA and the reason, why a DFA is sufficient.
a regexp, which requires a NFA and the reason why a NFA is necessary.
a regexp, which requires backtracking and the reason why backtracking is necessary.
A recommendation for some good literature about this topic would be nice, too.

i guess there is more than 1 meaning to the word backtracking - even '.*a' has to backtrack to match the string "lalaiiiiiii" (because .* will first match the whole string - so then a won't match anything - and only then it will give up one character at a time, so the final match would be "lala")
i highly recommend http://www.regular-expressions.info/

What I found out so far is:
Every regular expression, which can be implemented with a NFA, can also be implemented with a DFA. Every NFA can be transformed into a DFA.
Regular expressions, which require backtracking, are regular expressions, which contain back references like /(a)\1/.

Related

Automata - Regular Expression

I've been trying to make a regular expression from the below:
L = {01, 0011, 000111, 00001111, 0000011111, 000000111111, ...}
but I just could not figure it out. The first thing that came to my mind was
0(0)^* 1(1)^*
Is there an app where I could test it out?
If this can't be done through Regular Expression, can an NFA or DFA be done?
but I'm not sure if that is the answer to the language. Could some good Samaritan kindly help me with this? Appreciate it.
A subroutine may suit your needs:
(?<!0)(0(?1)?1)(?!1)
Debuggex Demo
(?1) means recall the pattern captured in the first group, i.e. between the parens. This isn't available in all regex engines though - neither is the (negative) lookbehind (?<!...) by the way.
The difference between (?1) and \1 is that (?1) recalls the captured pattern while \1 recalls the captured data.
I don't know about what you meant when you said that it should be regex, because it is mentioned automaton/regular expression too.
As per the automata theory :-
If you are talking about the regular expression for this formal language (having equal number of 0's and 1's and all 0's must be followed by 1's), it is not a regular language. It can be proved using the pumping lemma that this language is not regular.
But, this language can be expressed as {0i1i | i>0}; i belongs to set of positive integers.

Can regex match intersection between two regular expressions?

Given several regular expressions, can we write a regular expressions which is equal to their intersection?
For example, given two regular expressions c[a-z][a-z] and [a-z][aeiou]t, their intersection contains cat and cut and possibly more. How can we write a regular expression for their intersection?
Thanks.
A logical AND in regex is represented by
(?=...)(?=...)
So,
(?=[a-z][aeiou]t)(?=c[a-z][a-z])
The lookahead examples are easy to use, but technically are no longer regular languages. However it is possible to take the intersection of two regular languages, and that complement is regular.
First note that Regular Expressions can be converted to and from NFAs; they both are ways of expressing regular languages.
Second, by DeMorgan's law,
Thus these are the steps to compute the intersection of two RegExs:
Convert both RegExs to NFAs.
Compute the complement of both NFAs.
Compute the union of the two complements.
Compute the complement of that union.
Convert the resulting NFA to a RegEx.
Some sources:
Union and RegEx to NFA: http://courses.engr.illinois.edu/cs373/sp2009/lectures/lect_06.pdf
NFA to RegEx: http://courses.engr.illinois.edu/cs373/sp2009/lectures/lect_08.pdf
Complement of NFA: https://cs.stackexchange.com/questions/13282/complement-of-non-deterministic-finite-automata
Mathematically speaking, an intersection of two regular languages is regular, so there has to be a regular expression that accepts it.
Building it via corresponding NFAs is probably the easiest. Consider the two NFAs that correspond to the two regexes. The new states Q are pairs (Q1,Q2) from the two NFAs. If there is a transition (P1,x,Q1) in the first NFA and (P2,x,Q2) in the second NFA, then and only then there is a transition ((P1,P2),x,(Q1,Q2)) in the new NFA. A new state (Q1,Q2) is initial/final iff both Q1 and Q2 are initial/final.
If you use NFAs with ε-moves, then also for each transition (P1,ε,Q1) there will be a transition ((P1,P2),ε,(Q1,P2)) for all states P2. Likewise for ε-moves in the second NFA.
Now convert the new NFA to a regular expression with any known algorithm, and that's it.
As for PCRE, they are not, strictly speaking, regular expressions. There is no way to do it in the general case. Sometimes you can use lookaheads, like ^(?=regex1$)(?=regex2$) but this is only good for matching the entire string and is no good for either searching or embedding in other regexps. Without anchoring, the two lookaheads may end up matching strings of different lengths. This is not intersection.
First, let's agree on terms. My syntactical assumption will be that
The intersection of several regexes is one regex that matches strings
that each of the component regexes also match.
The General Option
To check for the intersection of two patterns, the general method is (pseudo-code):
if match(regex1) && match(regex2) { champagne for everyone! }
The Regex Option
In some cases, you can do the same with lookaheads, but for a complex regex there is little benefit of doing so, apart from making your regex more obscure to your enemies. Why little benefit? Because the engine will have to parse the whole string multiple times anyway.
Boolean AND
The general pattern for an AND checking that a string exactly meets regex1 and regex2 would be:
^(?=regex1$)(?=regex2$)
The $ in each lookahead ensures that each string matches the pattern and nothing more.
Matching when AND
Of course, if you don't want to just check the boolean value of the AND but also do some actual matching, after the lookaheads, you can add a dot-star to consume the string:
^(?=regex1$)(?=regex2$).*
Or... After checking the first condition, just match the second:
^(?=regex1$)regex2$
This is a technique used for instance in password validation. For more details on this, see Mastering Lookahead and Lookbehind.
Bonus section: Union of Regexes
Instead of working on an intersection, let's say you are interested in the union of the following regexes, i.e., a regex that matches either of those regexes:
catch
cat1
cat2
cat3
cat5
This is accomplished with the alternation | operator:
catch|cat1|cat2|cat3|cat5
Furthermore, such a regex can often be compressed, as in:
cat(?:ch|[1-35])
For And operation, we have something like this in RegEx
(REGEX)(REGEX)
Taking your example
'Cat'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
["Cat", "C", "a", "t"]
'Ca'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
//null
'Cat123'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
//null
where
([A-Za-z]+) //Match All characters
and
([aeiouAEIOU]+) //Match all vowels
Combine them both will match
([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)
eg:
'Hmmmmmm'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
//null
'Stckvrflw'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
null
'StackOverflow'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
["StackOverflow", "StackOverfl", "o", "w"]

How to match Regular Expression with String containing a wildcard character?

Regular expression:
/Hello .*, what's up?/i
String which may contain any number of wildcard characters (%):
"% world, what's up?" (matches)
"Hello world, %?" (matches)
"Hello %, what's up?" (matches)
"Hey world, what's up?" (no match)
"Hello %, blabla." (no match)
I have thought of a solution myself, but I'd like to see what you are able to come up with (considering performance is a high priority). A requirement is the ability to use any regular expression; I only used .* in the example, but any valid regular expression should work.
A little automata theory might help you here. You say
this is a simplified version of matching a regular expression with a regular expression[1]
Actually, that does not seem to be the case. Instead of matching the text of a regular expression, you want to find regular expressions that can match the same string as a given regular expression.
Luckily, this problem is solvable :-) To see whether such a string exists, you would need to compute the union of the two regular languages and test whether the result is not the empty language. This might be a non-trivial problem and solving it efficiently [enough] may be hard, but standard algorithms for this do already exist. Basically you would need to translate the expression into a NFA, that one into a DFA which you then can union.
[1]: Indeed, the wildcard strings you're using in the question build some kind of regular language, and can be translated to corresponding regular expressions
Not sure that I fully understand your question, but if you're looking for performance, avoid regular expressions. Instead you can split the string on %. Then, take a look at the first and last matches:
// Anything before % should match at start of the string
targetString.indexOf(splits[0]) === 0;
// Anything after % should match at the end of the string
targetString.indexOf(splits[1]) + splits[1].length === targetString.length;
If you can use % multiple times within the string, then the first and last splits should follow the above rules. Anything else just needs to be in the string, and .indexOf is how you can check that.
I came to realize that this is impossible with a regular language, and therefore the only solution to this problem is to replace the wildcard symbol % with .* and then match two regular expressions with each other. This can however not be done by traditional regular expressions, look at this SO-question and it's answers for details.
Or perhaps you should edit the underlying Regular Expression engine for supporting wildcard based strings. Anyone being able to answer this question by extending the default implementation will be accepted as answer to this question ;-)

What are the zero width elements in a regular expression?

Recently, I have been seeing "zero width elements" in regular expressions. What are they? Can they be treated as ghost data, so that for replacement, they won't be replaced, and for ( ) matching, they won't go into the matches[1], matches[2], etc?
Is there a good tutorial for all its various uses? Have they been here for a long time? Which version of O'Reilly's Regular Expression book was the first to discuss them?
The point of zero-width lookaround assertions is that they check if a certain regex can or cannot be matched looking forward or backwards from the current position, without actually adding them to the match. So, yes, they won't count towards the capturing groups, and yes, their matches won't be replaced (because they aren't matched in the first place).
However, you can have a capturing group inside a lookaround assertion that will go into matches[1] etc.
For example, in C#:
Regex.Replace("ab", "(a)(?=(b))", "$1$2");
will return abb.
A very good online tutorial about regular expressions in general can be found at http://www.regular-expressions.info (even though it's a little out of date in some areas).
It contains a specific section about zero-width lookaround assertions (and Part II).
And of course they are covered in-depth in both Mastering Regular Expressions and the Regular Expressions Cookbook.

DFA vs NFA engines: What is the difference in their capabilities and limitations?

I am looking for a non-technical explanation of the difference between DFA vs NFA engines, based on their capabilities and limitations.
Deterministic Finite Automatons (DFAs) and Nondeterministic Finite Automatons (NFAs) have exactly the same capabilities and limitations. The only difference is notational convenience.
A finite automaton is a processor that has states and reads input, each input character potentially setting it into another state. For example, a state might be "just read two Cs in a row" or "am starting a word". These are usually used for quick scans of text to find patterns, such as lexical scanning of source code to turn it into tokens.
A deterministic finite automaton is in one state at a time, which is implementable. A nondeterministic finite automaton can be in more than one state at a time: for example, in a language where identifiers can begin with a digit, there might be a state "reading a number" and another state "reading an identifier", and an NFA could be in both at the same time when reading something starting "123". Which state actually applies would depend on whether it encountered something not numeric before the end of the word.
Now, we can express "reading a number or identifier" as a state itself, and suddenly we don't need the NFA. If we express combinations of states in an NFA as states themselves, we've got a DFA with a lot more states than the NFA, but which does the same thing.
It's a matter of which is easier to read or write or deal with. DFAs are easier to understand per se, but NFAs are generally smaller.
Here's a non-technical answer from Microsoft:
DFA engines run in linear time because they do not require backtracking (and thus they never test the same character twice). They can also guarantee matching the longest possible string. However, since a DFA engine contains only finite state, it cannot match a pattern with backreferences, and because it does not construct an explicit expansion, it cannot capture subexpressions.
Traditional NFA engines run so-called "greedy" match backtracking algorithms, testing all possible expansions of a regular expression in a specific order and accepting the first match. Because a traditional NFA constructs a specific expansion of the regular expression for a successful match, it can capture subexpression matches and matching backreferences. However, because a traditional NFA backtracks, it can visit exactly the same state multiple times if the state is arrived at over different paths. As a result, it can run exponentially slowly in the worst case. Because a traditional NFA accepts the first match it finds, it can also leave other (possibly longer) matches undiscovered.
POSIX NFA engines are like traditional NFA engines, except that they continue to backtrack until they can guarantee that they have found the longest match possible. As a result, a POSIX NFA engine is slower than a traditional NFA engine, and when using a POSIX NFA you cannot favor a shorter match over a longer one by changing the order of the backtracking search.
Traditional NFA engines are favored by programmers because they are more expressive than either DFA or POSIX NFA engines. Although in the worst case they can run slowly, you can steer them to find matches in linear or polynomial time using patterns that reduce ambiguities and limit backtracking.
[http://msdn.microsoft.com/en-us/library/0yzc2yb0.aspx]
A simple, nontechnical explanation, paraphrased from Jeffrey Friedl's book Mastering Regular Expressions.
CAVEAT:
While this book is generally considered the "regex bible", there appears some controversy as to whether the distinction made here between DFA and NFA is actually correct. I'm not a computer scientist, and I don't understand most of the theory behind what really is a "regular" expression, deterministic or not. After the controversy started, I deleted this answer because of this, but since then it has been referenced in comments to other answers. I would be very interested in discussing this further - can it be that Friedl really is wrong? Or did I get Friedl wrong (but I reread that chapter yesterday evening, and it's just like I remembered...)?
Edit: It appears that Friedl and I are indeed wrong. Please check out Eamon's excellent comments below.
Original answer:
A DFA engine steps through the input string character by character and tries (and remembers) all possible ways the regex could match at this point. If it reaches the end of the string, it declares success.
Imagine the string AAB and the regex A*AB. We now step through our string letter by letter.
A:
First branch: Can be matched by A*.
Second branch: Can be matched by ignoring the A* (zero repetitions are allowed) and using the second A in the regex.
A:
First branch: Can be matched by expanding A*.
Second branch: Can't be matched by B. Second branch fails. But:
Third branch: Can be matched by not expanding A* and using the second A instead.
B:
First branch: Can't be matched by expanding A* or by moving on in the regex to the next token A. First branch fails.
Third branch: Can be matched. Hooray!
A DFA engine never backtracks in the string.
An NFA engine steps through the regex token by token and tries all possible permutations on the string, backtracking if necessary. If it reaches the end of the regex, it declares success.
Imagine the same string and the same regex as before. We now step through our regex token by token:
A*: Match AA. Remember the backtracking positions 0 (start of string) and 1.
A: Doesn't match. But we have a backtracking position we can return to and try again. The regex engine steps back one character. Now A matches.
B: Matches. End of regex reached (with one backtracking position to spare). Hooray!
Both NFAs and DFAs are finite automata, as their names say.
Both can be represented as a starting state, a success (or "accept") state (or set of success states), and a state table listing transitions.
In the state table of a DFA, each <state₀, input> key will transit to one and only one state₁.
In the state table of an NFA, each <state₀, input> will transit to a set of states.
When you take a DFA, reset it to it's start state, give it a sequence of input symbols, and you will know exactly what end state it's in and whether it's a success state or not.
When you take an NFA, however, it will, for each input symbol, look up the set of possible result states, and (in theory) randomly, "nondeterministically," select one of them. If there exists a sequence of random selections which leads to one of the success states for that input string, then the NFA is said to succeed for that string. In other words, you are expected to pretend that it magically always selects the right one.
One early question in computing was whether NFAs were more powerful than DFAs, due to that magic, and the answer turned out to be no since any NFA could be translated into an equivalent DFA. Their capabilities and limitations are exactly precisely the same as one another.
For those wondering how real, non-magical, NFA engine can "magically" select the right successor state for a given symbol, this page describes the two common approaches.
I find the explanation given in Regular Expressions, The Complete Tutorial by Jan Goyvaerts to be the most usable. See page 7 of this PDF:
https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf
Among other points made on page 7, There are two kinds of regular expression engines: text-directed engines, and regex-directed engines. Jeffrey Friedl calls them DFA and NFA engines, respectively. ...certain very useful features, such as lazy quantifiers and backreferences, can only be implemented in regex-directed engines.