Regular Expression nongreedy is greedy - regex

I have the following text
tooooooooooooon
According to this book I'm reading, when the ? follows after any quantifier, it becomes non greedy.
My regex to*?n is still returning tooooooooooooon.
It should return ton shouldn't it?
Any idea why?

A regular expression can only match a fragment of text that actually exists.
Because the substring 'ton' doesn't exist anywhere in your string, it can't be the result of a match. A match will only return a substring of the original string
EDIT: To be clear, if you were using the string below, with an extra 'n'
toooooooonoooooon
this regular expression (which doesn't specify 'o's)
t.*n
would match the following (as many characters as possible before an 'n')
toooooooonoooooon
but the regular expression
t.*?n
would only match the following (as few characters as possible before an 'n')
toooooooon

A regular expression es always eager to match.
Your expression says this:
A 't', followed by *as few as possible* 'o's, followed by a 'n'.
That means any o's necessary will be matched, because there is an 'n' at the end, which the expression is eager to reach. Matching all the o's is it's only possibility to succeed.

Regexps try to match everything in them. Because there are no less 'o's to match than every o in toooon to match the n, everything is matched. Also, because you are using o*? instead of o+? you are not requiring an o to be present.
Example, in Perl
$a = "toooooo";
$b = "toooooon";
if ($a =~ m/(to*?)/) {
print $1,"\n";
}
if ($b =~ m/(to*?n)/) {
print $1,"\n";
}
~>perl ex.pl
t
toooooon

The Regex always does its best to match. The only thing you are doing in this case would be slowing your parser down, by having it backtrack into the /o*?/ node. Once for every single 'o' in "tooooon". Whereas with normal matching, it would take as many 'o's, as it can, the first time through. Since the next element to match against is 'n', which won't be matched by 'o', there is little point in trying to use minimal matching. Actually, when the normal matching fails, it would take quite a while for it to fail. It has to backtrack through every 'o', until there is none left to backtrack through. In this case I would actually use maximal matching /to*+n/. The 'o' would take all it could, and never give any of it back. This would make it so that when it fails it fails quickly.
Minimal RE succeeding:
'toooooon' ~~ /to*?n/
t o o o o o o n
{t} match [t]
[t] match [o] 0 times
[t]<n> fail to match [n] -> retry [o]
[t]{o} match [o] 1 times
[t][o]<n> fail to match [n] -> retry [o]
[t][o]{o} match [o] 2 times
[t][o][o]<n> fail to match [n] -> retry [o]
. . . .
[t][o][o][o][o]{o} match [o] 5 times
[t][o][o][o][o][o]<n> fail to match [n] -> retry [o]
[t][o][o][o][o][o]{o} match [o] 6 times
[t][o][o][o][o][o][o]{n} match [n]
Normal RE succeeding:
(NOTE: Similar for Maximal RE)
'toooooon' ~~ /to*n/
t o o o o o o n
{t} match [t]
[t]{o}{o}{o}{o}{o}{o} match [o] 6 times
[t][o][o][o][o][o][o]{n} match [n]
Failure of Minimal RE:
'toooooo' ~~ /to*?n/
t o o o o o o
. . . .
. . . .
[t][o][o][o][o]{o} match [o] 5 times
[t][o][o][o][o][o]<n> fail to match [n] -> retry [o]
[t][o][o][o][o][o]{o} match [o] 6 times
[t][o][o][o][o][o][o]<n> fail to match [n] -> retry [o]
[t][o][o][o][o][o][o]<o> fail to match [o] 7 times -> match failed
Failure of Normal RE:
'toooooo' ~~ /to*n/
t o o o o o o
{t} match [t]
[t]{o}{o}{o}{o}{o}{o} match [o] 6 times
[t][o][o][o][o][o][o]<n> fail to match [n] -> retry [o]
[t][o][o][o][o][o] match [o] 5 times
[t][o][o][o][o][o]<n> fail to match [n] -> retry [o]
. . . .
[t][o] match [o] 1 times
[t][o]<o> fail to match [n] -> retry [o]
[t] match [o] 0 times
[t]<n> fail to match [n] -> match failed
Failure of Maximal RE:
'toooooo' ~~ /to*+n/
t o o o o o o
{t} match [t]
[t]{o}{o}{o}{o}{o}{o} match [o] 6 times
[t][o][o][o][o][o][o]<n> fail to match [n] -> match failed

The string you are searching in (the haystack as it were) does not contain the substring "ton".
It does however contain the substring "tooooooooooooon".

Related

Regex Camel Case only first Uppercase by group

This is my current regex:
/([A-Z])(?![A-Z])/gm
Bellow how it's evaluated:
https://regex101.com/r/K1gvmr/1
In the print you can see I'm getting:
[F] oo [B] ar XYBA [Z]
And instead I need to get that matches:
[F] oo [B] ar [X] YBAZ
How a negative lookahead (or another approach) can stop evaluation in the first char of each group only?
Try Regex: (?<![A-Z])[A-Z]
Explanation:
A negative look behind for A-Z followed by any character in A-Z
Demo

Find all chars occuring 1-3 times

I'm looking for regex here. I have a partial solution, but it has to be completed. Im looking for a leteral B here but I want to look for [A-Z] and use backreferences at the according places:
What I tried
^[^B]*(B)(?!(?:[^B]*B){3}) finds the first 'B' if its occuring 1-3 times.
regex101
Matches non B 0-n times
Matches first B
Does not match if there are 3(or more) B ahead
What it should be
What im looking for is not just match B, but match [A-Z] - if I try ^[^\1]*([A-Z])(?!(?:[^\1]*\1){3}) (replaced matched B with [A-Z] and backreferenced it.)
Problems to solve
The problem here. [^\1] this seems to not work.
I need to negate the backreference and quantify it. 2 things it doesnt wanna do :D
Desired Results (examples)
AAA result [A] because A is a [A-Z] and no more than 3 times in the string
AABB result [A,B] because A and B no more than 3 times
ABAB same as AABB
123AABBAA result [B] because [A-Z] and no more than 3 times
EDIT
Something that may help: (?:(?<=(?!\1)))* as replacement for [^\1]*
regex101
Kind of works returns A on (A)AA and no match on AAAA but wont match BAAA (which should redsult in [a,b])

Matching a^n b^n c^n for n > 0 with PCRE

How would you match a^n b^n c^n for n > 0 with PCRE?
The following cases should match:
abc
aabbcc
aaabbbccc
The following cases should not match:
abbc
aabbc
aabbbccc
Here's what I've "tried"; /^(a(?1)?b)$/gmx but this matches a^n b^n for n > 0:
ab
aabb
aaabbb
Online demo
Note: This question is the same as this one with the change in language.
Qtax trick
(The mighty self-referencing capturing group)
^(?:a(?=a*(\1?+b)b*(\2?+c)))+\1\2$
This solution is also named "The Qtax trick" because it uses the same technique as from "vertical" regex matching in an ASCII "image" by Qtax.
The problem in question burns down to a need to assert that three groups are matched of the same length. As a simplified version, to match:
xyz
Where x, y and z are really just subpatterns with a variable with matching length n of a, b and c. With an expression that uses lookaheads with self-referencing capturing groups, a character we specify is added to each repetition of the lookahead, which can effectively be used to "count":
aaabbbccc
^ ^ ^
This is achieved by the following:
(?:a…)+ A character of subpattern a is matched. With (?=a*, we skip directly to the "counter".
(\1?+b) Capturing group (\1) effectively consumes whatever has previously been matched, if it is there, and uses a possessive match which does not permit backtracking, and the match fails if the counter goes out of sync - That is, there has been more of subpatterns b than subpattern a. On the first iteration, this is absent, and nothing is matched. Then, a character of subpattern b is matched. It is added to the capturing group, effectively "counting" one of b in the group. With b*, we skip directly to the next "counter".
(\2?+c) Capturing group (\2) effectively consumes whatever has previously been matched just like the above. Because this additional character capture works just like the previous group, characters are allowed to sync up in length within these character groups. Assuming continuous sequences of a..b..c..:
(Excuse my art.)
First iteration:
| The first 'a' is matched by the 'a' in '^(?:a…)'.
| The pointer is stuck after it as we begin the lookahead.
v,- Matcher pointer
aaaa...bbbbbbbb...cccc...
^^^ |^^^ ^
skipped| skipped Matched by c in (\2?+c);
by a* | by b* \2 was "nothing",
| now it is "c".
Matched by b
in (\1?+b).
\1 was "nothing", now it is "b".
Second iteration:
| The second 'a' is matched by the 'a' in '^(?:a…)'.
| The pointer is stuck after it as we begin the lookahead.
v,- Matcher pointer
aaaa...bbbbbbbb...cccc...
/|^^^ |^
eaten by| skipped |Matched by c in (\2?+c);
\1?+ | by b* | '\2' was "nothing",
^^ | \2?+ now it is "cc".
skipped|
by a* \ Matched by b
in (\1?+b).
'\1' was "nothing", now it is "bb".
As the three groups discussed above "consumes" one of each of a, b, c respectively, they are matched in round-robin style and "counted" by the (?:a…)+, (\1?+b) and (\2?+c) groups respectively. With the additional anchoring and capturing what we started, we can assert that we match xyz (Representing each group above) where x, y and z are an, bn and cn respectively.
As a bonus, to "count" more, one can do this:
Pattern: ^(?:a(?=a*(\1?+b)b*(\2?+c)))+\1{3}\2$
Matches: abbbc
aabbbbbbcc
aaabbbbbbbbbccc
Pattern: ^(?:a(?=a*(\1?+bbb)b*(\2?+c)))+\1\2$
Matches: abbbc
aabbbbbbcc
aaabbbbbbbbbccc
First, let's explain the pattern you have:
^ # Assert begin of line
( # Capturing group 1
a # Match a
(?1)? # Recurse group 1 optionally
b # Match b
) # End of group 1
$ # Assert end of line
With the following modifiers:
g: global, match all
m: multiline, match start and end of line with ^ and $ respectively
x: extended, indentation are ignored with the ability to add comments with #
The recursion part is optional in order to exit the "endless" recursion eventually.
We could use the above pattern to solve the problem. We need to add some regex to match the c part. The problem is when aabb is matched in aabbcc, it is already consumed which means we could not track back.
The solution? Using lookaheads! Lookaheads are zero-width, which means it won't consume and move forward. Check it out:
^ # Assert begin of line
(?= # First zero-with lookahead
( # Capturing group 1
a # Match a
(?1)? # Recurse group 1 optionally
b # Match b
) # End of group 1
c+ # Match c one or more times
) # End of the first lookahead
(?= # Second zero-with lookahead
a+ # Match a one or more times
( # Capturing group 2
b # Match b
(?2)? # Recurse group 2 optionally
c # Match c
) # End of group 2
) # End of the second lookahead
a+b+c+ # Match each of a,b and c one or more times
$ # Assert end of line
Online demo
Basically we first assert that there's a^n b^n and then we assert b^n c^n which would result into a^n b^n c^n.

Why is bracket mandatory here?

1 . ^([0-9A-Za-z]{5})+$
vs
2 . ^[a-zA-Z0-9]{5}+$
My intention is to match any string of length n such that n is a multiple of 5.
Check here : https://regex101.com/r/sS6rW8/1.
Please elaborate why case 1 matches the string whereas case 2 doesnot.
Because {n}+ doesn't mean what you think it does. In PCRE syntax, this turns {n} into a possessive quantifier. In other words, a{5}+ is the same as (?>a{5}). It's like the second + in the expression a++, which is the same as using an atomic group (?>a+).
This has no use with a fixed-length {n} but is more meaningful when used with {min,max}. So, a{2,5}+ is equivalent to (?>a{2,5}).
As a simple example, consider these patterns:
^(a{1,2})(ab) will match aab -> $1 is "a", $2 is "ab"
^(a{1,2}+)(ab) won't match aab -> $1 consumes "aa" possessively and $2 can't match
In ^([0-9A-Za-z]{5})+$ you're saying any number or letter 5 characters long 1 or more times. The + is on the entire group (whatever's inside the parentheses) and the {5} is on the [0-9A-Za-z]
Your second example has a no backtrack clause {5}+, which is different than (stuff{5})+

Result of "regsub -all -- (a+)(ba*) aabaabxab {z\2} x"

What's the meaning of {z\2} in this regular expression, especially \2?
regsub -all -- (a+)(ba*) aabaabxab {z\2} x
I got this result:
(bin) 58 % regsub -all -- (a+)(ba*) aabaabxab {z\2} x
2
(bin) 59 % puts $x
zbaabxzb
How does the expression match aabaabxab?
{z\2} tells your command to substitute all matches with z and the content of the second capturing group (\2).
The whole expression itself doesn't match aabaabxab!
What it matches are aabaa and ab at the end:
aabaabxab
^^^^^ ^^
Why? Because (a+)(ba*) means:
one or more as
(a+)
followed by a b
(b)
optionally followed by as (zero or more)
(a*)
So, the first match will be aabaa (seen from the beginning of the string). Now aa is the content of the first capturing group, baa is the content of the second. You now replace aabaa with z\2, thus zbaa.
See if you can figure it out yourself for the second match. :-)