How does the ? make a quantifier lazy in regex - regex

I've been looking into regex lately and figured that the ? operator makes the *,+, or ? lazy. My question is how does it do that? Is it that *? for example is a special operator, or does the ? have an effect on the * ? In other words, does regex recognize *? as one operator in itself, or does regex recognize *? as the two separate operators * and ? ? If it is the case that *? is being recognized as two separate operators, how does the ? affect the * to make it lazy. If ? means that the * is optional, shouldn't this mean that the * doesn't have to exists at all. If so, then in a statement .*? wouldn't regex just match separate letters and the whole string instead of the shorter string? Please explain, I'm desperate to understand.Many thanks.

? can mean a lot of different things in different contexts.
Following a normal regex token (a character, a shorthand, a character class, a group...), it means "Match the previous item 0-1 times".
Following a quantifier like ?, *, +, {n,m}, it takes on a different meaning: "Make the previous quantifier lazy instead of greedy (if that's the default; that can be changed, though - for example in PHP, the /U modifier makes all quantifiers lazy by default, so the additional ? makes them greedy).
Right after an opening parenthesis, it marks the start of a special construct like for example
a) (?s): mode modifiers ("turn on dotall mode")
b) (?:...): make the group non-capturing
c) (?=...) or (?!...): lookahead assertion
d) (?<=...) or (?<!...): lookbehind assertion
e) (?>...): atomic group
f) (?<foo>...): named capturing group
g) (?#comment): inline comments, ignored by the regex engine
h) (?(?=if)then|else): conditionals
and others. Not all constructs are available in all regex flavors.
Within a character class ([?]), it simply matches a verbatim ?.

I think a little history will make it easier to understand. When the Larry Wall wanted to grow regex syntax to support new features, his options were severely limited. He couldn't just decree (for example) that % is now a metacharacter that supports new feature "XYZ". That would break the millions of existing regexes that happened to use % to match a literal percent sign.
What he could do is take an already-defined metacharacter and use it in such a way that its original function wouldn't make sense. For example, any regex that contained two quantifiers in a row would be invalid, so it was safe to say a ? after another quantifier now turns it into a reluctant quantifier (a much better name than "lazy" IMO; non-greedy good too). So the answer to your question is that ? doesn't modify the *, *? is a single entity: a reluctant quantifier. The same is true of the + in possessive quantifiers (*+, {0,2}+ etc.).
A similar process occurred with group syntax. It would never make sense to have a quantifier after an unescaped opening parenthesis, so it was safe to say (? now marks the beginning of a special group construct. But the question mark alone would only support one new feature, so the ? itself to be followed has to be followed by at least one more character to indicate which kind of group it is ((?:...), (?<!...), etc.). Again, the (?: is a single entity: the opening delimiter of a non-capturing group.
I don't know offhand why he used the question mark both times. I do know Perl 6 Rules (a bottom-up rewrite of Perl 5 regexes) has done away with all that crap and uses an infinitely more sensible syntax.

Imagine you have the following text:
BAAAAAAAAD
The following regexs will return:
/B(A+)/ => 'BAAAAAAAA'
/B(A+?)/ => 'BA'
/B(A*)/ => 'BAAAAAAAA'
/B(A*?)/ => 'B'
The addition of the "?" to the + and * operators make them "lazy" - i.e. they will match the absolute minimum required for the expression to be true. Whereas by default the * and + operators are "greedy" and try and match AS MUCH AS POSSIBLE for the expression to be true.
Remember + means "one or more" so the minimum will be "one if possible, more if absolutely necessary" whereas the maximum will be "all if possible, one if absolutely necessary".
And * means "zero or more" so the minimum will be "nothing if possible, more if absolutely necessary" whereas the maximum will be "all if possible, zero if absolutely necessary".

This very much depends on the implementation, I guess. But since every quantifier I am aware of can be modified with ? it might be reasonable to implement it that way.

Related

Python re, get the first element found with findall [duplicate]

How do I make a python regex like "(.*)" such that, given "a (b) c (d) e" python matches "b" instead of "b) c (d"?
I know that I can use "[^)]" instead of ".", but I'm looking for a more general solution that keeps my regex a little cleaner. Is there any way to tell python "hey, match this as soon as possible"?
You seek the all-powerful *?
From the docs, Greedy versus Non-Greedy
the non-greedy qualifiers *?, +?, ??, or {m,n}? [...] match as little
text as possible.
>>> x = "a (b) c (d) e"
>>> re.search(r"\(.*\)", x).group()
'(b) c (d)'
>>> re.search(r"\(.*?\)", x).group()
'(b)'
According to the docs:
The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behavior isn’t desired; if the RE <.*> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only '<H1>'.
Would not \\(.*?\\) work? That is the non-greedy syntax.
Using an ungreedy match is a good start, but I'd also suggest that you reconsider any use of .* -- what about this?
groups = re.search(r"\([^)]*\)", x)
Do you want it to match "(b)"? Do as Zitrax and Paolo have suggested. Do you want it to match "b"? Do
>>> x = "a (b) c (d) e"
>>> re.search(r"\((.*?)\)", x).group(1)
'b'
As the others have said using the ? modifier on the * quantifier will solve your immediate problem, but be careful, you are starting to stray into areas where regexes stop working and you need a parser instead. For instance, the string "(foo (bar)) baz" will cause you problems.
To start with, I do not suggest using "*" in regexes. Yes, I know, it is the most used multi-character delimiter, but it is nevertheless a bad idea. This is because, while it does match any amount of repetition for that character, "any" includes 0, which is usually something you want to throw a syntax error for, not accept. Instead, I suggest using the + sign, which matches any repetition of length > 1. What's more, from what I can see, you are dealing with fixed-length parenthesized expressions. As a result, you can probably use the {x, y} syntax to specifically specify the desired length.
However, if you really do need non-greedy repetition, I suggest consulting the all-powerful ?. This, when placed after at the end of any regex repetition specifier, will force that part of the regex to find the least amount of text possible.
That being said, I would be very careful with the ? as it, like the Sonic Screwdriver in Dr. Who, has a tendency to do, how should I put it, "slightly" undesired things if not carefully calibrated. For example, to use your example input, it would identify ((1) (note the lack of a second rparen) as a match.

Regex query efficient?

I came up with the below regex expression to look for terms like Password,Passphrase,Pass001 etc and the word following it. Is it efficient or can it be made better? Thanks for the help
"([Pp][aA][sS][Ss]([wW][oO][rR][dD][sS]?|[Pp][hH][rR][aA][sS][eE])?|[Pp]([aA][sS]([sS])?)?[wW][Dd])[0-9]?[0-9]?[0-9]?[\s\:\-\=\_\/\#\&\'\[\(\+\*\r\n\)\]]+\S*"
I will be using it to scan files upto 300K for these terms. When I try now to scan with these expression a whole C: drive it takes 5 hours or worse case I have encountered, 5 days
You may use the following enhancement:
(?i)p(?:ass(?:words?|phrase)?|(?:ass?)?wd)[0-9]{0,3}[-\s:=_\/#&'\]\[()+*\r\n]\S*
See the regex demo
Instead of [sS], you may make the regex case insensitive by adding (?i) case insensitive modifier. Use corresponding option in your software if it does not work like this.
Make sure your alternations do not match at the same location in the string. It is not quite easy here, but p at the start of each alternative in the first group decreases the regex efficiency. So, move it outside (e.g. (?:pass|port) => p(ass|ort)).
Use non-capturing groups rather than capturing ones if you are not going to access submatches, that also has a slight impact on performance.
Use limiting quantifiers instead of repeating ? quantified patterns. Instead of a?a?a?, use a{0,3}.
Do not overescape chars inside the character class. I only left \/, \] and \[ as I am not sure what regex flavor you are using, it might appear you can avoid escaping at all.
Note that a performance penalty is big if you have consecutive non-fixed width patterns that may match the same type of chars. You have [\s\:\-\=\_\/\#\&\'\[\(\+\*\r\n\)\]]+\S*: [\s\:\-\=\_\/\#\&\'\[\(\+\*\r\n\)\]]+ matches 1 or more special chars and \S* matches 0 or more chars other than whitespace that also matches some chars matched by the preceding pattern. Remove the + from the preceding subpattern.

Why is a character class faster than alternation?

It seems that using a character class is faster than the alternation in an example like:
[abc] vs (a|b|c)
I have heard about it being recommended and with a simple test using Time::HiRes I verified it (~10 times slower).
Also using (?:a|b|c) in case the capturing parenthesis makes a difference does not change the result.
But I can not understand why. I think it is because of backtracking but the way I see it at each position there are 3 character comparison so I am not sure how backtracking hits in affecting the alternation. Is it a result of the implementation's nature of alternation?
This is because the "OR" construct | backtracks between the alternation: If the first alternation is not matched, the engine has to return before the pointer location moved during the match of the alternation, to continue matching the next alternation; Whereas the character class can advance sequentially. See this match on a regex engine with optimizations disabled:
Pattern: (r|f)at
Match string: carat
Pattern: [rf]at
Match string: carat
But to be short, the fact that pcre engine optimizes this (single literal characters -> character class) away is already a decent hint that alternations are inefficient.
Because a character class like [abc] is irreducable and can be optimised, whereas an alternation like (?:a|b|c) may also be (?:aa(?!xx)|[^xba]*?|t(?=.[^t])t).
The authors have chosen not to optimise the regex compiler to check that all elements of an alternation are a single character.
There is a big difference between "check that the next character is in this character class" and "check that the rest of the string matches any one of these regular expressions".

What does (.+_)* mean when using Henry Spencer regular expression library?

With reference to Henry spencer regex library I want to know the difference between (.+_)* and (.)*.
(.+_)* tries to match the string from back as well. From my understanding . matches any single character, .+ will mean non zero occurrences of that character. _ will mean space or { or } or , etc.
Parentheses imply that any one can be considered for a match and the final * signifies 0 or more occurrences.
I feel (.)* would also achieve the same thing. The + after . might be redundant.
Can someone explain me the subtle difference between the two?
For example, aa aa will be matched by (.+_)* but not by (._)* because the latter expects only one character before the space.
I don't recall that underscore has any special meaning. The special thing about Henry Spencer regex library is that it combines both regex engine techniques - deterministic and non-determinstic.
This has a pro and a con.
The pro is that you regexps will be the fastest possible, simply built, while in other engines you might to use look a head and advanced regexp techniques (like making it fail early if there is no match) to achieve the same speed.
The con is that the entire regexp will be either greedy or non greedy. That is, if you used the * or + withouth a following a ?, then the entire regexp will be greedy, even though you use ? after that. If the first time you use a * or + you follow it by a ?, then the entire regexp will be non greedy.
This makes it a slightly trickier to craft the regexp, but really slightly.
The Henry Speced library is the engine behind tcl's regexp command, which makes this language very efficient for regexps.
As I know the _ doesn't have a special meaning, it is just a "_". See regular-expressions.info
Your two regexes are not the same.
(._)* will match one character followed by an underscore (if the underscore has a special meaning in your implementation replace "underscore" by that meaning), this sequence will be matched 0 or more times, e.g. "a_%_._?_"
(.+_)* will match at least one character followed by an underscore, this sequence will be matched 0 or more times, e.g. "abc45_%_.;,:_?#'+*~_"
(.+_)* will match everything that can be matched by (._)* but not the other way round.

What is the difference between the regex (.*?) and (.*)?

I've been doing regex for a while but I'm not an expert on the subtleties of what particular rules do, I've always done (.*?) for matching, but with restriction, as in I understood it would stop the first chance it got, whereas (.*)? would continue and be more greedy.
but I have no real reason why I think that, I just think it because I read it once upon a time.
now I'd like to know, is there a difference? and if so, what is it...
(.*?) is a group containing a non-greedy match.
(.*)? is an optional group containing a greedy match.
Others have pointed out the difference between greedy and non-greedy matches. Here is an example of different results you can see in practice. Since regular expressions are often embedded in a host language, I'm going to use Perl as the host. In Perl, enclosing matches in parenthesis assigns the results of those matches to special variables. Therefore in this case, the matches may be the same but what's assigned to those variables may not:
For example, let's say your match string is 'hello'. Both patterns would match it, but the matched portions ($1) differ:
'hello' =~ /(.*?)l/;
# $1 == 'he'
'hello' =~ /(.*)?l/;
# $1 == 'hel'
Because * means "zero or more", it all gets slightly confusing. Both ?'s are quite different, which can be more clearly shown with a different example of each:
fo*? will match only f if you supply it foo. That is, this ? makes the match non-greedy. Removing it makes it match foo.
fo? will match f, but also fo. That is, this ? makes the match optional: the part that it applies to (in this case only o) must be present 0 or 1 times. Removing it makes the match required: it must then be present exactly once, so only fo will still match.
And while we're at different meanings of the ? in regexps, there's one more: a ? immediately following a ( is a prefix for several special operations, such as lookaround. That is, its meaning is not like any of the things you ask.
The ? has different meanings.
When it follows a character or a group it is a quantifier, matching 0 or 1 occurrence of the preceding construct. See here for details
When it follows a quantifier it modifies the matching behaviour of that quantifier, making it match lazy/ungreedy. See here for details