Rules of regex engines. Greediness, eagerness and laziness of regexes - regex

As we all know, regex engine use two rules when it goes about its work:
Rule 1: The Match That Begins Earliest Wins or regular expressions
are eager.
Rule 2: Regular expressions are greedy.
These lines appear in tutorial:
The two of these rules go hand in hand.
It's eager to give you a result, so what it does is it tries to just
keep letting that first one do all the work.
While we're already in the middle of it, let's keep going, get to the
end of the string and then when it doesn't work out, then it will
backtrack and try another one.
It doesn't backtrack back to the beginning; it doesn't try all sorts
of other combinations.
It's still eager to get you a result, so it says, what if I just gave
back one?
Would that allow me to give a result back?
If it does, great, it's done. It's able to just finish there.
It doesn't have to keep backtracking further in the string, looking
for some kind of a better match or match that's further along.
I don't quite understand these lines (especially 2nd ("While we're...") and last ("It doesn't have to keep backtracking") sentences).
And these lines about lazy mode.
It still defers to the overall match just like the greedy one does
clearly.
I don't understand the following analogy:
It's not necessarily any faster or slower to choose a lazy strategy or
a greedy strategy, but it will probably match different things.
Now as far as is faster or slower, it's a little bit like saying, if
you've lost your car keys and your sunglasses inside your house, is it
better to start looking in the kitchen or to start looking in the
living room?
You don't know which one's going to yield the best result, and you
don't know which one's going to find the sunglasses first or the keys
first; it's just about different strategies of starting the search.
So you will likely get different results depending on where you start,
but it's not necessarily faster to start in one place or the other.
What 'faster or slower' means?
I'm going to draw scheme how it work (in both case). So I will contemplate this questions until I find out what's going on around here!)
I need understand it exactly and unambiguously.
Thanks.

Let's try by the exemple
for an input of this is input for test input on regex and a regex like /this.*input/
The match will be this is input for test input
What will be done is
starting to examine the string and it will get a match with this is input
But now its at the middle of the string, it will continue to see if it could match more on it (this is the While we're already in the middle of it, let's keep going )
It will match till this is input for test input and continue till the end of the string
at the end, there's things wich are not part of the match, so the interpreter "backtrack" to the last time it matches.
For the last part its more about the ored regexes
Consider input string as cdacdgabcdef and the regex (ab|a).*
A common mistake is thinking it will return the more precise one (in this case 'abcdef') but it will return 'acdgabcdef' because the a match is the first one to match.
what happens here is: There's something matching this part, let's continue to the next part of the pattern and forget about the other options in this part.
For the lazy and greedy questions, the link of #AvinashRaj is clear enough, I won't repeat it here.

Related

Get value from a column with regex

I have lines of text similar to this:
value1|value2|value3 etc.
The values itself are not interesting. The type of delimeter is also unimportant, just like the number of fields. There could be 100 column, there could be only 5.
I would like to know what is the usual way to write a regexp which will put any given column's value into a capture group.
For example if I would like to get the content of the third field:
[^\|]+?\|[^\|]+?\|(?<capture_group>[^\|]+?)\|
Maybe a little bit nicer version:
(?:[^\|]+?\|){2}(?<capture_group>[^\|]+?)\|
But this could be the 7th, the 100th, the 1000th, it doesn't matter.
My problem is, that after a while I run into catastrophic backtracking or simply extremely low running times.
What is the usual way to solve a problem like this?
Edit:
For further clarification: this is a use case where further string operations are simply not permitted. Workarounds are not possible. I would like to know if there's a way simply based on regex or not.
As you stated:
My problem is, that after a while I run into catastrophic backtracking
or simply extremely low running times.
What is the usual way to solve a problem like this?
IMHO, You should prefer to perform string operations when you have a predefined structure in string (like for your case | character is used as a separator) because string operations are faster than using Regex which is designed to find a pattern. Like, in case the separators may change and we have to identify it first and then split based on separator, here the need of a Regex comes.
e.g.,
value1|value2;value3-value4
For your case, you can simply perform string split based on the separator character and access the respected index from an array.
EDIT:
If Regex is your only option then try using this regex:
^((.+?)\|){200}
Here 200 is the element I wish to access and seems a bit less time consuming than yours.
Demo
For example if I would like to get the content of the third field:
[^\|]+?\|[^\|]+?\|(?<capture_group>[^\|]+?)\|
Maybe a little bit nicer version:
(?:[^\|]+?\|){2}(?<capture_group>[^\|]+?)\|
But this could be the 7th, the 100th, the 1000th, it doesn't matter.
As a matter of "steps", using capture groups will cost more step.
However, using capture groups will allow you to condense your pattern and use a curly bracketed quantifier.
In your first pattern above, you can get away with "greedy" negated character classes (remove the ?) because they will halt at the next |, and you don't need to escape the pipe inside the square brackets.
When you want to access a "much later" positioned substring in the input string, not using a quantifier is going to require a horrifically long pattern and make it very, very difficult to comprehend the exact point that will be matched. In these cases, it would be pretty silly not to use a capture group and a quantifier.
I agree with Toto's comment; accessing an array of split results is going to be a very sensible solution if possible.

What does it mean that there is faster failure with atomic grouping

NOTE :- The question is bit long as it includes a section from book.
I was reading about atomic groups from Mastering Regular Expression.
It is given that atomic groups leads to faster failure. Quoting that particular section from the book
Faster failures with atomic grouping. Consider ^\w+: applied to
Subject. We can see, just by looking at it, that it will fail
because the text doesn’t have a colon in it, but the regex engine
won’t reach that conclusion until it actually goes through the
motions of checking.
So, by the time : is first checked, the \w+
will have marched to the end of the string. This results in a lot of
states — one skip me state for each match of \w by the plus
(except the first, since plus requires one match). When then checked
at the end of the string, : fails, so the regex engine backtracks to
the most recently saved state:
at
which point the : fails again, this time trying to match t. This
backtrack-test fail cycle happens all the way back to the oldest state:
After the attempt from the final state
fails, overall failure can finally be announced.
All that backtracking is a lot of work that after just a glance we
know to be unnecessary. If the colon can’t match after the last
letter, it certainly can’t match one of the letters the + is forced
to give up!
So, knowing that none of the states left by \w+, once
it’s finished, could possibly lead to a match, we can save the regex
engine the trouble of checking them: ^(?>\w+):. By adding the atomic
grouping, we use our global knowledge of the regex to enhance the
local working of \w+ by having its saved states (which we know to be
useless) thrown away. If there is a match, the atomic grouping won’t
have mattered, but if there’s not to be a match, having thrown away
the useless states lets the regex come to that conclusion more
quickly.
I tried these regex here. It took 4 steps for ^\w+: and 6 steps for ^(?>\w+): (with internal engine optimization disabled)
My Questions
In the second paragraph from above section, it is mentioned that
So, by the time : is first checked, the \w+ will have marched to the end of the string. This results in a lot of states — one skip me state for each match of \w by the plus (except the first, since plus requires one match).When then checked
at the end of the string, : fails, so the regex engine backtracks to
the most recently saved state:
at
which point the : fails again, this time trying to match t. This
backtrack-test fail cycle happens all the way back to the oldest state:
but on this site, I see no backtracking. Why?
Is there some optimization going on inside(even after it is disabled)?
Can the number of steps taken by a regex decide whether one regex is having good performance over other regex?
The debugger on that site seems to gloss over the details of backtracking. RegexBuddy does a better job. Here's what it shows for ^\w+:
After \w+ consumes all the letters, it tries to match : and fails. Then it gives back one character, tries the : again, and fails again. And so on, until there's nothing left to give back. Fifteen steps total. Now look at the atomic version (^(?>\w+):):
After failing to match the : the first time, it gives back all the letters at once, as if they were one character. A total of five steps, and two of those are entering and leaving the group. And using a possessive quantifier (^\w++:) eliminates even those:
As for your second question, yes, the number-of-steps metric from regex debuggers is useful, especially if you're just learning regexes. Every regex flavor has at least a few optimizations that allow even badly written regexes to perform adequately, but a debugger (especially a flavor-neutral one like RegexBuddy's) makes it obvious when you're doing something wrong.

Can one find out which input characters matched which part of a regex?

I'm trying to build a tool that uses something like regexes to find patterns in a string (not a text string, but that is not important right now). I'm familiar with automata theory, i.e. I know how to implement basic regex matching, and output true or false if the string matches my regex, by simulating an automaton in the textbook way.
Say I'm interested in all as that comes before bs, with no more as before the bs, so, this regex: a[^a]*b. But I don't just want to find out if my string contains such a part, I want to get as output the a, so that I can inspect it (remember, I'm not actually dealing with text).
In summary: Let's say I mark the a with parentheses, like so: (a)[^a]*b and run it on the input string bcadacb then I want the second a as output.
Or, more generally, can one find out which characters in the input string matches which part of the regex? How is it done in text editors? They at least know where the match started, because they can highlight the matches. Do I have to use a backtracking approach, or is there a smarter, less computationally expensive, way?
EDIT: Proper back references, i.e. capturing with parens and referencing with \1, etc. may not be necessary. I do know that back references do introduce the need for backtracking (or something similar) and make the problem (IIRC) NP-hard. My question, in essence, is: Is the capturing part, without the back referencing, less computationally expensive than proper back references?
Most text editors do this by using a backtracking algorithm, in which case recording the match locations is trivial to add.
It is possible to do with a direct NFA simulation too, by augmenting the state lists with parenthesis location information. This can be done in a way that preserves the linear time guarantee. See http://swtch.com/~rsc/regexp/regexp2.html#submatch.
Timos's answer is on the right track, but you cannot tag DFA states, because a DFA state corresponds to a collection of possible NFA states, and so one DFA state might represent the possibility of having passed a paren (but maybe something else too) and if that turns out not to be the case, it would be incorrect to record it as fact. You really need to work on the NFA simulation instead.
After you constructed your DFA for the matching, mark all states which correspond to the first state after an opening parenthesis in the regex. When you visit such a state, save the index of the current input character, when you visit a state which corresponds to a closing parenthesis, also save the index.
When you reach an accepting state, output the two indices. I am not sure if this is the algorithm used in text editors, but that's how I would do it.

PCRE Regex Syntax

I guess this is more or less a two-part question, but here's the basics first: I am writing some PHP to use preg_match_all to look in a variable for strings book-ended by {}. It then iterates through each string returned, replaces the strings it found with data from a MySQL query.
The first question is this: Any good sites out there to really learn the ins and outs of PCRE expressions? I've done a lot of searching on Google, but the best one I've been able to find so far is http://www.regular-expressions.info/. In my opinion, the information there is not well-organized and since I'd rather not get hung up having to ask for help whenever I need to write a complex regex, please point me at a couple sites (or a couple books!) that will help me not have to bother you folks in the future.
The second question is this: I have this regex
"/{.*(_){1}(.*(_){1}[a-z]{1}|.*)}/"
and I need it to catch instances such as {first_name}, {last_name}, {email}, etc. I have three problems with this regex.
The first is that it sees "{first_name} {last_name}" as one string, when it should see it as two. I've been able to solve this by checking for the existence of the space, then exploding on the space. Messy, but it works.
The second problem is that it includes punctuation as part of the captured string. So, if you have "{first_name} {last_name},", then it returns the comma as part of the string. I've been able to partially solve this by simply using preg_replace to delete periods, commas, and semi-colons. While it works for those punctuation items, my logic is unable to handle exclamation points, question marks, and everything else.
The third problem I have with this regex is that it is not seeing instances of {email} at all.
Now, if you can, are willing, and have time to simply hand me the solution to this problem, thank you as that will solve my immediate problem. However, even if you can do this, please please provide an lmgfty that provides good web sites as references and/or a book or two that would provide a good education on this subject. Sites would be preferable as money is tight, but if a book is the solution, I'll find the money (assuming my local library system is unable to procure said volume).
Back then I found PHP's own PCRE syntax reference quite good: http://uk.php.net/manual/en/reference.pcre.pattern.syntax.php
Let's talk about your expression. It's quite a bit more verbose than necessary; I'm going to simplify it while we go through this.
A rather simpler way of looking at what you're trying to match: "find a {, then any number of letters or underscores, then a }". A regular expression for that is (in PHP's string-y syntax): '/\{[a-z_]+\}/'
This will match all of your examples but also some wilder ones like {__a_b}. If that's not an option, we can go with a somewhat more complex description: "find a {, then a bunch of letters, then (as often as possible) an underscore followed by a bunch of letters, then a }". In a regular expression: /\{([a-z]+(_[a-z]+)*\}/
This second one maybe needs a bit more explanation. Since we want to repeat the thing that matches _foo segments, we need to put it in parentheses. Then we say: try finding this as often as possible, but it's also okay if you don't find it at all (that's the meaning of *).
So now that we have something to compare your attempt to, let's have a look at what caused your problems:
Your expression matches any characters inside the {}, including } and { and a whole bunch of other things. In other words, {abcde{_fgh} would be accepted by your regex, as would {abcde} fg_h {ijkl}.
You've got a mandatory _ in there, right after the first .*. The (_){1} (which means exactly the same as _) says: whatever happens, explode if this ain't here! Clearly you don't actually want that, because it'll never match {email}.
Here's a complete description in plain language of what your regex matches:
Match a {.
Match a _.
Match absolutely anything as long as you can match all the remaining rules right after that anything.
Match a _.
Match a single letter.
Instead of that _ and the single letter, absolutely anything is okay, too.
Match a }.
This is probably pretty far from what you wanted. Don't worry, though. Regular expressions take a while to get used to. I think it's very helpful if you think of it in terms of instructions, i.e. when building a regular expression, try to build it in your head as a "find this, then find that", etc. Then figure out the right syntax to achieve exactly that.
This is hard mainly because not all instructions you might come up with in your head easily translate into a piece of a regular expression... but that's where experience comes in. I promise you that you'll have it down in no time at all... if you are fairly methodical about making your regular expressions at first.
Good luck! :)
For PCRE, I simply digested the PCRE manpages, but then my brain works that way anyway...
As for matching delimited stuff, you generally have 2 approaches:
Match the first delimiter, match anything that is not the closing delimiter, match the closing delimiter.
Match the first delimiter, match anything ungreedily, match the closing delimiter.
E.g. for your case:
\{([^}]+)\}
\{(.+?)\} - Note the ? after the +
I added a group around the content you'd likely want to extract too.
Note also that in the case of #1 in particular but also for #2 if "dot matches anything" is in effect (dotall, singleline or whatever your favourite regex flavour calls it), that they would also match linebreaks within - you'd need to manually exclude that and anything else you don't want if that would be a problem; see the above answer for if you want something more like a whitelist approach.
Here's a good regex site.
Here's a PCRE regex that will work: \{\w+\}
Here's how it works:
It's basically looking for { followed by one ore more word characters followed by }. The interesting part is that the word character class actually includes an underscore as well. \w is essentially shorthand for [A-Za-z0-9_]
So it will basically match any combination of those characters within braces and because of the plus sign will only match braces that are not empty.

How to detect deviations in a sequence from a pattern?

i want to detect the passages where a sequence of action deviates from a given pattern and can't figure out a clever solution to do this, although the problem sound really simple.
The objective of the pattern is to describe somehow a normal sequence. More specific: "What actions should or shouldn't be contained in an action sequence and in which order?" Then i want to match the action sequence against the pattern and detect deviations and their locations.
My first approach was to do this with regular expressions. Here is an example:
Example 1:
Pattern: A.*BC
Sequence: AGDBC (matches)
Sequence: AGEDC (does not match)
Example 2:
Pattern: ABCD
Sequence: ABD (does not match)
Sequence: ABED (does not match)
Sequence: ABCED (does not match)
Example 3:
Pattern: ABCDEF
Sequence: ABXDXF (does not match)
With regular expressions it is simple to detect a error but not where it occurs. My approach was to successively remove the last regex block until I can find the pattern in the sequence. Then i will know the last correct actions and have at least found the first deviation. But this doesn't seem the best solution to me. Further i can't all deviations.
Other soultions in my mind are working with state machines, order tools like ANTLR. But I don't know if they can solve my problem.
I want to detect errors of omission and commission and give an user the possibility to create his own pattern. Do you know a good way to do this?
While matching your input, a regular expression engine has information on the location of a mismatch -- it may however not provide it in a way where you can easily access it.
Consider e.g. a DFA implementing the expression. It fetches characters sequentially, matching them with expectation, and you are interested at the point in the sequence where there is no valid match.
Other implementations may go back and forth, and you would be interested in the maximum address of any character that was fetched.
In Java, this could possibly be done by supplying a CharSequence implementation to
java.util.regex.Pattern.matches(String regex, CharSequence input)
where the accessor methods keep track of the maximum index.
I have not tried that, though. And it does not help you categorizing an error, either.
have you looked at markov chains? http://en.wikipedia.org/wiki/Markov_chain - it sounds like you want unexpected transitions. maybe also hidden markov models http://en.wikipedia.org/wiki/Hidden_Markov_Models
Find an open-source implementation of regexs, and add in a hook to return/set/print/save the index at which it fails, if a given comparison does not match. Alternately, write your own RE engine (not for the faint of heart) so it'll do exactly what you want!