What is the difference between the regex (.*?) and (.*)? - regex

I've been doing regex for a while but I'm not an expert on the subtleties of what particular rules do, I've always done (.*?) for matching, but with restriction, as in I understood it would stop the first chance it got, whereas (.*)? would continue and be more greedy.
but I have no real reason why I think that, I just think it because I read it once upon a time.
now I'd like to know, is there a difference? and if so, what is it...

(.*?) is a group containing a non-greedy match.
(.*)? is an optional group containing a greedy match.

Others have pointed out the difference between greedy and non-greedy matches. Here is an example of different results you can see in practice. Since regular expressions are often embedded in a host language, I'm going to use Perl as the host. In Perl, enclosing matches in parenthesis assigns the results of those matches to special variables. Therefore in this case, the matches may be the same but what's assigned to those variables may not:
For example, let's say your match string is 'hello'. Both patterns would match it, but the matched portions ($1) differ:
'hello' =~ /(.*?)l/;
# $1 == 'he'
'hello' =~ /(.*)?l/;
# $1 == 'hel'

Because * means "zero or more", it all gets slightly confusing. Both ?'s are quite different, which can be more clearly shown with a different example of each:
fo*? will match only f if you supply it foo. That is, this ? makes the match non-greedy. Removing it makes it match foo.
fo? will match f, but also fo. That is, this ? makes the match optional: the part that it applies to (in this case only o) must be present 0 or 1 times. Removing it makes the match required: it must then be present exactly once, so only fo will still match.
And while we're at different meanings of the ? in regexps, there's one more: a ? immediately following a ( is a prefix for several special operations, such as lookaround. That is, its meaning is not like any of the things you ask.

The ? has different meanings.
When it follows a character or a group it is a quantifier, matching 0 or 1 occurrence of the preceding construct. See here for details
When it follows a quantifier it modifies the matching behaviour of that quantifier, making it match lazy/ungreedy. See here for details

Related

Why can "a*a+" and "(a{2,3})*a{2,3}" match "aaaa" while "(a{2,3})*" cannot?

My understanding of * is that it consumes as many characters as possible (greedily) but "gives back" when necessary. Therefore, in a*a+, a* would give one (or maybe more?) character back to a+ so it can match.
However, in (a{2,3})*, why doesn't the first "instance" of a{2,3} gives a character to the second "instance" so the second one can match?
Also, in (a{2,3})*a{2,3} the first part does seem to give a character to the second part.
A simple workaround for your question is to match aaaa with regex ^(a{2,3})*$.
Your problem is that:
In the case of (a{2,3})*, regex doesn't seem to consume as much
character as possible.
I suggest not to think in giving back characters. Instead, the key is acceptance.
Once regex accept your string, the matching will be over. The pattern a{2,3} only matches aa or aaa. So in the case of matching aaaa with (a{2,3})*, the greedy engine would match aaa. And then, it can't match more a{2,3} because there is only one a remained. Though it's able for regex engine to do backtrack and match an extra a{2,3}, it wouldn't. aaa is now accepted by the regex, thus regex engine would not do expensive backtracking.
If you add an $ to the end of the regex, it simply tells regex engine that a partly match is unacceptable. Moreover, it's easy to explain the (a{2,3})*a{2,3} case with accepting and backtracking.
The main problem is this:
My understanding of * is that it consumes as many characters as possible (greedily) but "gives back" when necessary
This is completely wrong. It is not what greedy means.
Greedy simply means "use the longest possible match". It does not give anything back.
Once you interpret the expressions with this new understanding everything makes sense.
a*a+ - zero or more a followed by one or more a
(a{2,3})*a{2,3} - zero or more of either two or three a followed by either two or three a (note: the KEY THING to remember is "zero or more", the first part not matching any character is considered a match)
(a{2,3})* - zero or more of either two or three a (this means that after matching three as the last single a left cannot match)
backtracking is done only if match fails however aaa is a valid match, a negative lookahead (?!a) can be use to prevent the match be followed by a a.
compare
(aaa?)*
and
(aaa?)*(?!a)

How does the ? make a quantifier lazy in regex

I've been looking into regex lately and figured that the ? operator makes the *,+, or ? lazy. My question is how does it do that? Is it that *? for example is a special operator, or does the ? have an effect on the * ? In other words, does regex recognize *? as one operator in itself, or does regex recognize *? as the two separate operators * and ? ? If it is the case that *? is being recognized as two separate operators, how does the ? affect the * to make it lazy. If ? means that the * is optional, shouldn't this mean that the * doesn't have to exists at all. If so, then in a statement .*? wouldn't regex just match separate letters and the whole string instead of the shorter string? Please explain, I'm desperate to understand.Many thanks.
? can mean a lot of different things in different contexts.
Following a normal regex token (a character, a shorthand, a character class, a group...), it means "Match the previous item 0-1 times".
Following a quantifier like ?, *, +, {n,m}, it takes on a different meaning: "Make the previous quantifier lazy instead of greedy (if that's the default; that can be changed, though - for example in PHP, the /U modifier makes all quantifiers lazy by default, so the additional ? makes them greedy).
Right after an opening parenthesis, it marks the start of a special construct like for example
a) (?s): mode modifiers ("turn on dotall mode")
b) (?:...): make the group non-capturing
c) (?=...) or (?!...): lookahead assertion
d) (?<=...) or (?<!...): lookbehind assertion
e) (?>...): atomic group
f) (?<foo>...): named capturing group
g) (?#comment): inline comments, ignored by the regex engine
h) (?(?=if)then|else): conditionals
and others. Not all constructs are available in all regex flavors.
Within a character class ([?]), it simply matches a verbatim ?.
I think a little history will make it easier to understand. When the Larry Wall wanted to grow regex syntax to support new features, his options were severely limited. He couldn't just decree (for example) that % is now a metacharacter that supports new feature "XYZ". That would break the millions of existing regexes that happened to use % to match a literal percent sign.
What he could do is take an already-defined metacharacter and use it in such a way that its original function wouldn't make sense. For example, any regex that contained two quantifiers in a row would be invalid, so it was safe to say a ? after another quantifier now turns it into a reluctant quantifier (a much better name than "lazy" IMO; non-greedy good too). So the answer to your question is that ? doesn't modify the *, *? is a single entity: a reluctant quantifier. The same is true of the + in possessive quantifiers (*+, {0,2}+ etc.).
A similar process occurred with group syntax. It would never make sense to have a quantifier after an unescaped opening parenthesis, so it was safe to say (? now marks the beginning of a special group construct. But the question mark alone would only support one new feature, so the ? itself to be followed has to be followed by at least one more character to indicate which kind of group it is ((?:...), (?<!...), etc.). Again, the (?: is a single entity: the opening delimiter of a non-capturing group.
I don't know offhand why he used the question mark both times. I do know Perl 6 Rules (a bottom-up rewrite of Perl 5 regexes) has done away with all that crap and uses an infinitely more sensible syntax.
Imagine you have the following text:
BAAAAAAAAD
The following regexs will return:
/B(A+)/ => 'BAAAAAAAA'
/B(A+?)/ => 'BA'
/B(A*)/ => 'BAAAAAAAA'
/B(A*?)/ => 'B'
The addition of the "?" to the + and * operators make them "lazy" - i.e. they will match the absolute minimum required for the expression to be true. Whereas by default the * and + operators are "greedy" and try and match AS MUCH AS POSSIBLE for the expression to be true.
Remember + means "one or more" so the minimum will be "one if possible, more if absolutely necessary" whereas the maximum will be "all if possible, one if absolutely necessary".
And * means "zero or more" so the minimum will be "nothing if possible, more if absolutely necessary" whereas the maximum will be "all if possible, zero if absolutely necessary".
This very much depends on the implementation, I guess. But since every quantifier I am aware of can be modified with ? it might be reasonable to implement it that way.

What does ?: do in regex

I have a regex that looks like this
/^(?:\w+\s)*(\w+)$*/
What is the ?:?
It indicates that the subpattern is a non-capture subpattern. That means whatever is matched in (?:\w+\s), even though it's enclosed by () it won't appear in the list of matches, only (\w+) will.
You're still looking for a specific pattern (in this case, a single whitespace character following at least one word), but you don't care what's actually matched.
It means only group but do not remember the grouped part.
By default ( ) tells the regex engine to remember the part of the string that matches the pattern between it. But at times we just want to group a pattern without triggering the regex memory, to do that we use (?: in place of (
Further to the excellent answers provided, its usefulness is also to simplify the code required to extract groups from the matched results. For example, your (\w+) group is known as group 1 without having to be concerned about any groups that appear before it. This may improve the maintainability of your code.
Let's understand by taking a example
In simple words we can say is let's for example I have been given a string say (s="a eeee").
Your regex(/^(?:\w+\s)(\w+)$/. ) will basically do in this case it will start with string finds 'a' in beginning of string and notice here there is 'white space character here) which in this case if you don't included ?: it would have returned 'a '(a with white space character).
If you may don't want this type of answer so u have included as*(?:\w+\s)* it will return you simply a without whitespace ie.'a' (which in this case ?: is doing it is matching with a string but it is excluding whatever comes after it means it will match the string but not whitespace(taking into account match(numbers or strings) not additional things with them.)
PS:I am beginner in regex.This is what i have understood with ?:.Feel free to pinpoint the wrong things explained.

How do I match a pattern with optional surrounding quotes?

How would one write a regex that matches a pattern that can contain quotes, but if it does, must have matching quotes at the beginning and end?
"?(pattern)"?
Will not work because it will allow patterns that begin with a quote but don't end with one.
"(pattern)"|(pattern)
Will work, but is repetitive. Is there a better way to do that without repeating the pattern?
You can get a solution without repeating by making use of backreferences and conditionals:
/^(")?(pattern)(?(1)\1|)$/
Matches:
pattern
"pattern"
Doesn't match:
"pattern
pattern"
This pattern is somewhat complex, however. It first looks for an optional quote, and puts it into backreference 1 if one is found. Then it searches for your pattern. Then it uses conditional syntax to say "if backreference 1 is found again, match it, otherwise match nothing". The whole pattern is anchored (which means that it needs to appear by itself on a line) so that unmatched quotes won't be captured (otherwise the pattern in pattern" would match).
Note that support for conditionals varies by engine and the more verbose but repetitive expressions will be more widely supported (and likely easier to understand).
Update: A much simpler version of this regex would be /^(")?(pattern)\1$/, which does not need a conditional. When I was testing this initially, the tester I was using gave me a false negative, which lead me to discount it (oops!).
I'll leave the solution with the conditional up for posterity and interest, but this is a simpler version that is more likely to work in a wider variety of engines (backreferences are the only feature being used here which might be unsupported).
This is quite simple as well: (".+"|.+). Make sure the first match is with quotes and the second without.
Depending on the language you're using, you should be able to use backreferences. Something like this, say:
(["'])(pattern)\1|^(pattern)$
That way, you're requiring that either there are no quotes, or that the SAME quote is used on both ends.
This should work with recursive regex (which needs longer to get right). In the meantime: in Perl, you can build a self-modifying regex. I'll leave that as an academic example ;-)
my #stuff = ( '"pattern"', 'pattern', 'pattern"', '"pattern' );
foreach (#stuff) {
print "$_ OK\n" if /^
(")?
\w+
(??{defined $1 ? '"' : ''})
$
/x
}
Result:
"pattern" OK
pattern OK
Generally #Daniel Vandersluis response would work. However, some compilers do not recognize the optional group (") if it is empty, therefore they do not detect the back reference \1.
In order to avoid this problem a more robust solution would be:
/^("|)(pattern)\1$/
Then the compiler will always detect the first group. This expression can also be modified if there is some prefix in the expression and you want to capture it first:
/^(key)=("|)(value)\2$/

Regular Expression Opposite

Is it possible to write a regex that returns the converse of a desired result? Regexes are usually inclusive - finding matches. I want to be able to transform a regex into its opposite - asserting that there are no matches. Is this possible? If so, how?
http://zijab.blogspot.com/2008/09/finding-opposite-of-regular-expression.html states that you should bracket your regex with
/^((?!^ MYREGEX ).)*$/
, but this doesn't seem to work. If I have regex
/[a|b]./
, the string "abc" returns false with both my regex and the converse suggested by zijab,
/^((?!^[a|b].).)*$/
. Is it possible to write a regex's converse, or am I thinking incorrectly?
Couldn't you just check to see if there are no matches? I don't know what language you are using, but how about this pseudocode?
if (!'Some String'.match(someRegularExpression))
// do something...
If you can only change the regex, then the one you got from your link should work:
/^((?!REGULAR_EXPRESSION_HERE).)*$/
The reason your inverted regex isn't working is because of the '^' inside the negative lookahead:
/^((?!^[ab].).)*$/
^ # WRONG
Maybe it's different in vim, but in every regex flavor I'm familiar with, the caret matches the beginning of the string (or the beginning of a line in multiline mode). But I think that was just a typo in the blog entry.
You also need to take into account the semantics of the regex tool you're using. For example, in Perl, this is true:
"abc" =~ /[ab]./
But in Java, this isn't:
"abc".matches("[ab].")
That's because the regex passed to the matches() method is implicitly anchored at both ends (i.e., /^[ab].$/).
Taking the more common, Perl semantics, /[ab]./ means the target string contains a sequence consisting of an 'a' or 'b' followed by at least one (non-line separator) character. In other words, at ANY point, the condition is TRUE. The inverse of that statement is, at EVERY point the condition is FALSE. That means, before you consume each character, you perform a negative lookahead to confirm that the character isn't the beginning of a matching sequence:
(?![ab].).
And you have to examine every character, so the regex has to be anchored at both ends:
/^(?:(?![ab].).)*$/
That's the general idea, but I don't think it's possible to invert every regex--not when the original regexes can include positive and negative lookarounds, reluctant and possessive quantifiers, and who-knows-what.
You can invert the character set by writing a ^ at the start ([^…]). So the opposite expression of [ab] (match either a or b) is [^ab] (match neither a nor b).
But the more complex your expression gets, the more complex is the complementary expression too. An example:
You want to match the literal foo. An expression, that does match anything else but a string that contains foo would have to match either
any string that’s shorter than foo (^.{0,2}$), or
any three characters long string that’s not foo (^([^f]..|f[^o].|fo[^o])$), or
any longer string that does not contain foo.
All together this may work:
^[^fo]*(f+($|[^o]|o($|[^fo]*)))*$
But note: This does only apply to foo.
You can also do this (in python) by using re.split, and splitting based on your regular expression, thus returning all the parts that don't match the regex, how to find the converse of a regex
In perl you can anti-match with $string !~ /regex/;.
With grep, you can use --invert-match or -v.
Java Regexps have an interesting way of doing this (can test here) where you can create a greedy optional match for the string you want, and then match data after it. If the greedy match fails, it's optional so it doesn't matter, if it succeeds, it needs some extra data to match the second expression and so fails.
It looks counter-intuitive, but works.
Eg (foo)?+.+ matches bar, foox and xfoo but won't match foo (or an empty string).
It might be possible in other dialects, but couldn't get it to work myself (they seem more willing to backtrack if the second match fails?)