Consider this regex.
a*b
This will fail in case of aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
This takes 67 steps in debugger to fail.
Now consider this regex.
(?>a*)b
This will fail in case of aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
This takes 133 steps in debugger to fail.
And lastly this regex:
a*+b (a variant of atomic group)
This will fail in case of aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
This takes 67 steps in debugger to fail.
When I check the benchmark atomic group (?>a*)b performs 179% faster.
Now atomic groups disable backtracking. So performance in match is good.
But why are the number of steps more? Can somebody explain on this?
Why is there a diff. in steps between two atomic groups (?>a*)b and a*+b.
Do they work differently?
Author note:
This answer targets question 1 as delivered by the bounty text "I am looking forward to the exact reason why more steps are being needed by the debugger.I dont need answers explaining how atomic groups work."; Jerry's answer addresses the other concerns very well, while my other answer takes a ride through the mentioned constructs, how they work, and why they are important. For full knowledge, simply reading this post is not enough!
Every group in a regular expression takes a step to step into and out of the group.
WHAT?!
Yeah, I'm serious, read on...
Firstly, I would like to present you with quantified non-capturing groups, over without the group:
Pattern 1: (?:c)at
Pattern 2: cat
So what exactly happens here? We'll match the patterns with the test string "concat" on a regex engine with optimizations disabled:
While we're at it, I present you some more groups:
Oh no! I'm going to avoid using groups!
But wait! Please note that the number of steps taken to match has no correlation with the performance of the match. pcre engines optimizes away most of the "unnecessary steps" as I've mentioned. Atomic groups are still the most efficient, despite more steps taken on an engine with optimizations disabled.
Perhaps relevant:
Why is a character class faster than alternation?
So what's backtracking?
The engine comes to quantifiers that are greedy by default. Greedy modifiers matches all possible and backtracks by demand, allowing efficient matches,
as referenced by Greedy vs. Reluctant vs. Possessive Quantifiers:
A greedy quantifier first matches as much as possible. So the .* matches the entire string. Then the matcher tries to match the f following, but there are no characters left. So it "backtracks", making the greedy quantifier match one less thing (leaving the "o" at the end of the string unmatched). That still doesn't match the f in the regex, so it "backtracks" one more step, making the greedy quantifier match one less thing again (leaving the "oo" at the end of the string unmatched). That still doesn't match the f in the regex, so it backtracks one more step (leaving the "foo" at the end of the string unmatched). Now, the matcher finally matches the f in the regex, and the o and the next o are matched too. Success! [...]
What does this have to do with a*+b?
In /a*+b/:
a The literal character "a".
*+ Zero or more, possessive.
b The literal character "b".
As referenced by Greedy vs. Reluctant vs. Possessive Quantifiers:
A possessive quantifier is just like the greedy quantifier, but it doesn't backtrack. So it starts out with .* matching the entire string, leaving nothing unmatched. Then there is nothing left for it to match with the f in the regex. Since the possessive quantifier doesn't backtrack, the match fails there.
Why does it matter?
The machine won't realize if it's doing an (in)efficient match on its own. See here for a decent example: Program run forever when matching regex. In many scenarios, regexes written quickly may not be efficient and may well easily be problematic in deployment.
So what's an atomic group?
After the pattern within the atomic group finishes matching, it will not let go, ever. Study this example:
Pattern: (?>\d\w{2}|12)c
Matching string: 12c
Looks perfectly legitimate, but this match fails. The steps are simple: The first alternation of the atomic group matches perfectly - \d\w{2} consumes 12c. The group then completes its match - now here is our pointer location:
Pattern: (?>\d\w{2}|12)c
^
Matching string: 12c
^
The pattern advances. Now we try to match c, but there is no c. Instead of trying to backtrack (releasing \d\w{2} and consuming 12), the match fails.
Well that's a bad idea then! Why would we prevent backtracking, Unihedron?
Now imagine we're manipulating with a JSON object. This file is not small. Backtracking from the end is going to be a bad idea.
"2597401":[{"jobID":"2597401",
"account":"TG-CCR120014",
"user":"charngda",
"pkgT":{"pgi/7.2- 5":{"libA":["libpgc.so"],
"flavor":["default"]}},
"startEpoch":"1338497979",
"runTime":"1022",
"execType":"user:binary",
"exec":"ft.D.64",
"numNodes":"4",
"sha1":"5a79879235aa31b6a46e73b43879428e2a175db5",
"execEpoch":1336766742,
"execModify":"Fri May 11 15:05:42 2012",
"startTime":"Thu May 31 15:59:39 2012",
"numCores":"64",
"sizeT":{"bss":"1881400168","text":"239574","data":"22504"}},
{"jobID":"2597401",
"account":"TG-CCR120014",
"user":"charngda",
"pkgT":{"pgi/7.2-5":{"libA":["libpgc.so"],
"flavor":["default"]}},
"startEpoch":"1338497946",
"runTime":"33" "execType":"user:binary",
"exec":"cg.C.64",
"numNodes":"4",
"sha1":"caf415e011e28b7e4e5b050fb61cbf71a62a9789",
"execEpoch":1336766735,
"execModify":"Fri May 11 15:05:35 2012",
"startTime":"Thu May 31 15:59:06 2012",
"numCores":"64",
"sizeT":{"bss":"29630984","text":"225749","data":"20360"}},
{"jobID":"2597401",
"account":"TG-CCR120014",
"user":"charngda",
"pkgT":{"pgi/7.2-5": {"libA":["libpgc.so"],
"flavor":["default"]}},
"startEpoch":"1338500447",
"runTime":"145",
"execType":"user:binary",
"exec":"mg.D.64",
"numNodes":"4",
"sha1":"173de32e1514ad097b1c051ec49c4eb240f2001f",
"execEpoch":1336766756,
"execModify":"Fri May 11 15:05:56 2012",
"startTime":"Thu May 31 16:40:47 2012",
"numCores":"64",
"sizeT":{"bss":"456954120","text":"426186","data":"22184"}},{"jobID":"2597401",
"account":"TG-CCR120014",
"user":"charngda",
"pkgT":{"pgi/7.2-5":{"libA":["libpgc.so"],
"flavor":["default"]}},
"startEpoch":"1338499002",
"runTime":"1444",
"execType":"user:binary",
"exec":"lu.D.64",
"numNodes":"4",
"sha1":"c6dc16d25c2f23d2a3321d4feed16ab7e10c2cc1",
"execEpoch":1336766748,
"execModify":"Fri May 11 15:05:48 2012",
"startTime":"Thu May 31 16:16:42 2012",
"numCores":"64",
"sizeT":{"bss":"199850984","text":"474218","data":"27064"}}],
Uh oh...
Do you get what I mean now? :P
I'll leave you to figure out the rest, and try to find out more about possessive quantifiers and atomic groups; I'm not writing anything else into this post. Here is where the JSON came from, I saw the answer a few days ago, very inspiring: REGEX reformatting.
Read also
The Stack Overflow Regex Reference
ReDoS - Wikipedia
I think something is wrong...
I don't know how you did your benchmarking but both a*+b and (?>a*)b should be the same. To quote regular-expressions.info (emphasis mine):
Basically, instead of X*+, write (?>X*). It is important to notice that both the quantified token X and the quantifier are inside the atomic group. Even if X is a group, you still need to put an extra atomic group around it to achieve the same effect. (?:a|b)*+ is equivalent to (?>(?:a|b)*) but not to (?>a|b)*. The latter is a valid regular expression, but it won't have the same effect when used as part of a larger regular expression.
And just to confirm the above, I ran the following on ideone:
$tests = 1000000;
$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
preg_match('/a*b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );
printf( "For /a*b/ : %1.15f per iteration for %s iterations\n", ($stop - $start)/$tests, $tests );
unset( $stop, $start );
$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
preg_match('/(?>a*)b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );
printf( "For /(?>a*)b/: %1.15f per iteration for %s iterations\n", ($stop - $start)/$tests, $tests );
unset( $stop, $start );
$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
preg_match('/a*+b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );
printf( "For /a*+b/ : %1.15f per iteration for %s iterations\n", ($stop - $start)/$tests, $tests );
unset( $stop, $start );
$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
preg_match('/(?>a)*b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );
printf( "For /(?>a)*b/: %1.15f per iteration for %s iterations\n", ($stop - $start)/$tests, $tests );
unset( $stop, $start );
Getting this as the output:
For /a*b/ : 0.000000879034996 per iteration for 1000000 iterations
For /(?>a*)b/: 0.000000876362085 per iteration for 1000000 iterations
For /a*+b/ : 0.000000880002022 per iteration for 1000000 iterations
For /(?>a)*b/: 0.000000883045912 per iteration for 1000000 iterations
Now, I am by no means a PHP expert so I don't know if this is the right way to benchmark stuff in there, but that's they all have about the same performance, which is kind of expected given the simplicity of the task.
Still, couple of things I note from the above:
Neither (?>a)*b nor (?>a*)b are 179% faster than another regex; the above all are within 7% from each other.
Back to the actual question
But why are the number of steps more? Can somebody explain on this?
It is to be noted that the number of steps is not a direct representation of the performance of a regex. It is a factor, but not the ultimate determinant. There are more steps because the breakdown has steps before entering the group, and after entering the group...
1 / (?> a* ) b/x aaaaaaaaaaaaaaaaaaaa...
^
2 / (?> a* ) b/x aaaaaaaaaaaaaaaaaaaa...
^^^^^^^^
3 / (?> a* ) b/x aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
^^
4 / (?> a* ) b/x aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
^
5 / (?> a* ) b/x aaaaaaaaaaaaaaaaaaaaa...
^
That's 2 more steps because of the group than the 3 steps...
1 / a*+ b /x aaaaaaaaaaaaaaaaaaaa...
^
2 / a*+ b /x aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
^^^
3 / a*+ b /x aaaaaaaaaaaaaaaaaaaaa...
^
Where you can say that a*b is the same as (?:a*)b but the latter having more steps:
1 / (?: a* ) b/x aaaaaaaaaaaaaaaaaaaa...
^
2 / (?: a* ) b/x aaaaaaaaaaaaaaaaaaaa...
^^^^^^^^
3 / (?: a* ) b/x aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
^^
4 / (?: a* ) b/x aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
^
5 / (?: a* ) b/x aaaaaaaaaaaaaaaaaaaaa...
Note: Even there, you see that regex101 has the steps optimised a bit for the number of steps in a*b.
Conclusion
Possessive quantifiers and atomic groups work differently depending on how you use them. If I take regular-expression's example and tweaking it a little:
(?>a|b)*ac match aabaac
But
(?:a|b)*+ac and (?>(?:a|b)*)ac don't match aabaac.
Related
I have been doing regular expression for 25+ years but I don't understand why this regex is not a match (using Perl syntax):
"unify" =~ /[iny]{3}/
# as in
perl -e 'print "Match\n" if "unify" =~ /[iny]{3}/'
Can someone help solve that riddle?
The quantifier {3} in the pattern [iny]{3} means to match a character with that pattern (either i or n or y), and then another character with the same pattern, and then another. Three -- one after another. So your string unify doesn't have that, but can muster two at most, ni.
That's been explained in other answers already. What I'd like to add is an answer to a clarification in comments: how to check for these characters appearing 3 times in the string, scattered around at will. Apart from matching that whole substring, as shown already, we can use a lookahead:
(?=[iny].*[iny].*[iny])
This does not "consume" any characters but rather "looks" ahead for the pattern, not advancing the engine from its current position. As such it can be very useful as a subpattern, in combination with other patterns in a larger regex.
A Perl example, to copy-paste on the command line:
perl -wE'say "Match" if "unify" =~ /(?=[iny].*[iny].*[iny])/'
The drawback to this, as well as to consuming the whole such substring, is the literal spelling out of all three subpatterns; what when the number need be decided dynamically? Or when it's twelve? The pattern can be built at runtime of course. In Perl, one way
my $pattern = '(?=' . join('.*', ('[iny]')x3) . ')';
and then use that in the regex.
For the sake of performance, for long strings and many repetitions, make that .* non-greedy
(?=[iny].*?[iny].*?[iny])
(when forming the pattern dynamically join with .*?)
A simple benchmark for illustration (in Perl)
use warnings;
use strict;
use feature 'say';
use Getopt::Long;
use List::Util qw(shuffle);
use Benchmark qw( cmpthese );
# For how many seconds to run each option (-r N, default 3),
# how many times to repeat for the test string (-n N, default 2)
my ($runfor, $n) = (3, 2);
GetOptions('r=i' => \$runfor, 'n=i' => \$n);
my $str = 'aa'
. join('', map { (shuffle 'b'..'t')x$n, 'a' } 1..$n)
. 'a'x($n+1)
. 'zzz';
my $pat_greedy = '(?=' . join('.*', ('a')x$n) . ')';
my $pat_non_greedy = '(?=' . join('.*?', ('a')x$n) . ')';
#my $pat_greedy = join('.*', ('a')x$n); # test straight match,
#my $pat_non_greedy = join('.*?', ('a')x$n); # not lookahead
sub match_repeated {
my ($s, $pla) = #_;
return ( $s =~ /$pla(.*z)/ ) ? "match" : "no match";
}
cmpthese(-$runfor, {
greedy => sub { match_repeated($str, $pat_greedy) },
non_greedy => sub { match_repeated($str, $pat_non_greedy) },
});
(Shuffling of that string is probably unneeded but I feared optimizations intruding.)
When a string is made with the factor of 20 (program.pl -n 20) the output is
Rate greedy non_greedy
greedy 56.3/s -- -100%
non_greedy 90169/s 159926% --
So ... some 1600 times better non-greedy. That test string is 7646 characters long and the pattern to match has 20 subpatterns (a) with .* between them (in greedy case); so there's a lot going on there. With default 2, so for a short string and a simpler pattern, the difference is 10%.
Btw, to test for straight-up matches (not using lookahead) just move those comment signs around the pattern variables, and it's nearly twice as bad:
Rate greedy non_greedy
greedy 56.5/s -- -100%
non_greedy 171949/s 304117% --
The letters n, i, and y aren't all adjacent. There's an f in between them.
/[iny]{3}/ matches any string that contains a substring of three letters taken from the set {i, n, y}. The letters can be in any order; they can even be repeated.
Choosing three characters three times, with replacement, means there are 33 = 27 matching substrings:
iii, iin, iiy, ini, inn, iny, iyi, iyn, iyy
nii, nin, niy, nni, nnn, nny, nyi, nyn, nyy
yii, yin, yiy, yni, ynn, yny, yyi, yyn, yyy
To match non-adjacent letters you can use one of these:
[iny].*[iny].*[iny]
[iny](.*[iny]){2}
([iny].*){3}
(The last option will work fine on its own since your search is unanchored, but might not be suitable as part of a larger regex. The final .* could match more than you intend.)
That pattern looks for three consecutive occurrences of the letters i, n, or y. You do not have three consecutive occurrences.
Perhaps you meant to use [inf] or [ify]?
Looks like you are looking for 3 consecutive letters, so yours should not match
[iny]{3} //no match
[unf]{3} //no match
[nif]{3} //matches nif
[nify]{3} //matches nif
[ify]{3} //matches ify
[uni]{3} //matches uni
Hope that helps somewhat :)
The {3} atom means "exactly three consecutive matches of the preceding element." While all of the letters in your character class are present in the string, they are not consecutive as they are separated by other characters in your string.
It isn't the order of items in the character class that's at issue. It's the fact that you can't match any combination of the three letters in your character class where exactly three of them are directly adjacent to one another in your example string.
I am not clear on the use/need of the \G operator.
I read in the perldoc:
You use the \G anchor to start the next match on the same string where
the last match left off.
I don't really understand this statement. When we use \g we usually move to the character after the last match anyway.
As the example shows:
$_ = "1122a44";
my #pairs = m/(\d\d)/g; # qw( 11 22 44 )
Then it says:
If you use the \G anchor, you force the match after 22 to start with
the a:
$_ = "1122a44";
my #pairs = m/\G(\d\d)/g;
The regular expression cannot match there since it does not
find a digit, so the next match fails and the match operator returns
the pairs it already found
I don't understand this either. "If you use the \G anchor, you force the match after 22 to start with a." But without the \G the matching will be attempted at a anyway right? So what is the meaning of this sentence?
I see that in the example the only pairs printed are 11 and 22. So 44 is not tried.
The example also shows that using c option makes it index 44 after the while.
To be honest, from all these I can not understand what is the usefulness of this operator and when it should be applied.
Could someone please help me understand this, perhaps with a meaningful example?
Update
I think I did not understand this key sentence:
If you use the \G anchor, you force the match after 22 to start with
the a . The regular expression cannot match there since it does not
find a digit, so the next match fails and the match operator returns
the pairs it already found.
This seems to mean that when the match fails, the regex does not proceed further attempts and is consistent with the examples in the answers
Also:
After the match fails at the letter a , perl resets pos() and the next
match on the same string starts at the beginning.
\G is an anchor; it indicates where the match is forced to start. When \G is present, it can't start matching at some arbitrary later point in the string; when \G is absent, it can.
It is most useful in parsing a string into discrete parts, where you don't want to skip past other stuff. For instance:
my $string = " a 1 # ";
while () {
if ( $string =~ /\G\s+/gc ) {
print "whitespace\n";
}
elsif ( $string =~ /\G[0-9]+/gc ) {
print "integer\n";
}
elsif ( $string =~ /\G\w+/gc ) {
print "word\n";
}
else {
print "done\n";
last;
}
}
Output with \G's:
whitespace
word
whitespace
integer
whitespace
done
without:
whitespace
whitespace
whitespace
whitespace
done
Note that I am demonstrating using scalar-context /g matching, but \G applies equally to list context /g matching and in fact the above code is trivially modifiable to use that:
my $string = " a 1 # ";
my #matches = $string =~ /\G(?:(\s+)|([0-9]+)|(\w+))/g;
while ( my ($whitespace, $integer, $word) = splice #matches, 0, 3 ) {
if ( defined $whitespace ) {
print "whitespace\n";
}
elsif ( defined $integer ) {
print "integer\n";
}
elsif ( defined $word ) {
print "word\n";
}
}
But without the \G the matching will be attempted at a anyway right?
Without the \G, it won't be constrained to start matching there. It'll try, but it'll try starting later if required. You can think of every pattern as having an implied \G.*? at the front.
Add the \G, and the meaning becomes obvious.
$_ = "1122a44";
my #pairs = m/\G (\d\d)/xg; # qw( 11 22 )
my #pairs = m/\G .*? (\d\d)/xg; # qw( 11 22 44 )
my #pairs = m/ (\d\d)/xg; # qw( 11 22 44 )
To be honest, from all these I can not understand what is the usefulness of this operator and when it should be applied.
As you can see, you get different results by adding a \G, so the usefulness is getting the result you want.
Interesting answers and alot are valid I guess, but I can also guess that is still doesn't explain alot.
\G 'forces' the next match to occur at the position the last match ended.
Basically:
$str="1122a44";
while($str=~m/\G(\d\d)/g) {
#code
}
First match = "11"
Second match is FORCED TO START at 22 and yes, that's \d\d, so result is "22"
Third 'try' is FORCED to start at "a", but that's not \d\d, so it fails.
I have the (what I believe to be) negative lookahead assertion <#> *(?!QQQ) that I expect to match if the tested string is a <#> followed by any number of spaces (zero including) and then not followed by QQQ.
Yet, if the tested string is <#> QQQ the regular expression matches.
I fail to see why this is the case and would appreciate any help on this matter.
Here's a test script
use warnings;
use strict;
my #strings = ('something <#> QQQ',
'something <#> RRR',
'something <#>QQQ' ,
'something <#>RRR' );
print "$_\n" for map {$_ . " --> " . rep($_) } (#strings);
sub rep {
my $string = shift;
$string =~ s,<#> *(?!QQQ),at w/o ,;
$string =~ s,<#> *QQQ,at w/ QQQ,;
return $string;
}
This prints
something <#> QQQ --> something at w/o QQQ
something <#> RRR --> something at w/o RRR
something <#>QQQ --> something at w/ QQQ
something <#>RRR --> something at w/o RRR
And I'd have expected the first line to be something <#> QQQ --> something at w/ QQQ.
It matches because zero is included in "any number". So no spaces, followed by a space, matches "any number of spaces not followed by a Q".
You should add another lookahead assertion that the first thing after your spaces is not itself a space. Try this (untested):
<#> *(?!QQQ)(?! )
ETA Side note: changing the quantifier to + would have helped only when there's exactly one space; in the general case, the regex can always grab one less space and therefore succeed. Regexes want to match, and will bend over backwards to do so in any way possible. All other considerations (leftmost, longest, etc) take a back seat - if it can match more than one way, they determine which way is chosen. But matching always wins over not matching.
$string =~ s,<#> *(?!QQQ),at w/o ,;
$string =~ s,<#> *QQQ,at w/ QQQ,;
One problem of yours here is that you are viewing the two regexes separately. You first ask to replace the string without QQQ, and then to replace the string with QQQ. This is actually checking the same thing twice, in a sense. For example: if (X==0) { ... } elsif (X!=0) { ... }. In other words, the code may be better written:
unless ($string =~ s,<#> *QQQ,at w/ QQQ,) {
$string =~ s,<#> *,at w/o,;
}
You always have to be careful with the * quantifier. Since it matches zero or more times, it can also match the empty string, which basically means: it can match any place in any string.
A negative look-around assertion has a similar quality, in the sense that it needs to only find a single thing that differs in order to match. In this case, it matches the part "<#> " as <#> + no space + space, where space is of course "not" QQQ. You are more or less at a logical impasse here, because the * quantifier and the negative look-ahead counter each other.
I believe the correct way to solve this is to separate the regexes, like I showed above. There is no sense in allowing the possibility of both regexes being executed.
However, for theoretical purposes, a working regex that allows both any number of spaces, and a negative look-ahead would need to be anchored. Much like Mark Reed has shown. This one might be the simplest.
<#>(?! *QQQ) # Add the spaces to the look-ahead
The difference is that now the spaces and Qs are anchored to each other, whereas before they could match separately. To drive home the point of the * quantifier, and also solve a minor problem of removing additional spaces, you can use:
<#> *(?! *QQQ)
This will work because either of the quantifiers can match the empty string. Theoretically, you can add as many of these as you want, and it will make no difference (except in performance): / * * * * * * */ is functionally equivalent to / */. The difference here is that spaces combined with Qs may not exist.
The regex engine will backtrack until it finds a match, or until finding a match is impossible. In this case, it found the following match:
+--------------- Matches "<#>".
| +----------- Matches "" (empty string).
| | +--- Doesn't match " QQQ".
| | |
--- ---- ---
'something <#> QQQ' =~ /<#> [ ]* (?!QQQ)/x
All you need to do is shuffle things around. Replace
/<#>[ ]*(?!QQQ)/
with
/<#>(?![ ]*QQQ)/
Or you can make it so the regex will only match all the spaces:
/<#>[ ]*+(?!QQQ)/
/<#>[ ]*(?![ ]|QQQ)/
/<#>[ ]*(?![ ])(?!QQQ)/
PS — Spaces are hard to see, so I use [ ] to make them more visible. It gets optimised away anyway.
I'm writing regular expression for checking if there is a substring, that contains at least 2 repeats of some pattern next to each other. I'm matching the result of regex with former string - if equal, there is such pattern. Better said by example: 1010 contains pattern 10 and it is there 2 times in continuous series. On other hand 10210 wouldn't have such pattern, because those 10 are not adjacent.
What's more, I need to find the longest pattern possible, and it's length is at least 1. I have written the expression to check for it ^.*?(.+)(\1).*?$. To find longest pattern, I've used non-greedy version to match something before patter, then pattern is matched to group 1 and once again same thing that has been matched for group1 is matched. Then the rest of string is matched, producing equal string. But there's a problem that regex is eager to return after finding first pattern, and don't really take into account that I intend to make those substrings before and after shortest possible (leaving the rest longest possible). So from string 01011010 I get correctly that there's match, but the pattern stored in group 1 is just 01 though I'd except 101.
As I believe I can't make pattern "more greedy" or trash before and after even "more non-greedy" I can only come whit an idea to make regex less eager, but I'm not sure if this is possible.
Further examples:
56712453289 - no pattern - no match with former string
22010110100 - pattern 101 - match with former string (regex resulted in 22010110100 with 101 in group 1)
5555555 - pattern 555 - match
1919191919 - pattern 1919 - match
191919191919 - pattern 191919 - match
2323191919191919 - pattern 191919 - match
What I would get using current expression (same strings used):
no pattern - no match
pattern 2 - match
pattern 555 - match
pattern 1919 - match
pattern 191919 - match
pattern 23 - match
In Perl you can do it with one expression with help of (??{ code }):
$_ = '01011010';
say /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;
Output:
101
What happens here is that after a matching consecutive pair of substrings, we make sure with a negative lookahead that there is no longer pair following it.
To make the expression for the longer pair a postponed subexpression construct is used (??{ code }), which evaluates the code inside (every time) and uses the returned string as an expression.
The subexpression it constructs has the form .+?(..{N,})\1, where N is the current length of the first capturing group (length($^N), $^N contains the current value of the previous capturing group).
Thus the full expression would have the form:
(?=(.+)\1)(?!.+?(..{N,})\2}))
With the magical N (and second capturing group not being a "real"/proper capturing group of the original expression).
Usage example:
use v5.10;
sub longest_rep{
$_[0] =~ /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;
}
say longest_rep '01011010';
say longest_rep '010110101000110001';
say longest_rep '2323191919191919';
say longest_rep '22010110100';
Output:
101
10001
191919
101
You can do it in a single regex, you just have to pick the longest match from the list of results manually.
def longestrepeating(strg):
regex = re.compile(r"(?=(.+)\1)")
matches = regex.findall(strg)
if matches:
return max(matches, key=len)
This gives you (since re.findall() returns a list of the matching capturing groups, even though the matches themselves are zero-length):
>>> longestrepeating("yabyababyab")
'abyab'
>>> longestrepeating("10100101")
'010'
>>> strings = ["56712453289", "22010110100", "5555555", "1919191919",
"191919191919", "2323191919191919"]
>>> [longestrepeating(s) for s in strings]
[None, '101', '555', '1919', '191919', '191919']
Here's a long-ish script that does what you ask. It basically goes through your input string, shortens it by one, then goes through it again. Once all possible matches are found, it returns one of the longest. It is possible to tweak it so that all the longest matches are returned, instead of just one, but I'll leave that to you.
It's pretty rudimentary code, but hopefully you'll get the gist of it.
use v5.10;
use strict;
use warnings;
while (<DATA>) {
chomp;
print "$_ : ";
my $longest = foo($_);
if ($longest) {
say $longest;
} else {
say "No matches found";
}
}
sub foo {
my $num = shift;
my #hits;
for my $i (0 .. length($num)) {
my $part = substr $num, $i;
push #hits, $part =~ /(.+)(?=\1)/g;
}
my $long = shift #hits;
for (#hits) {
if (length($long) < length) {
$long = $_;
}
}
return $long;
}
__DATA__
56712453289
22010110100
5555555
1919191919
191919191919
2323191919191919
Not sure if anyone's thought of this...
my $originalstring="pdxabababqababqh1234112341";
my $max=int(length($originalstring)/2);
my #result;
foreach my $n (reverse(1..$max)) {
#result=$originalstring=~m/(.{$n})\1/g;
last if #result;
}
print join(",",#result),"\n";
The longest doubled match cannot exceed half the length of the original string, so we count down from there.
If the matches are suspected to be small relative to the length of the original string, then this idea could be reversed... instead of counting down until we find the match, we count up until there are no more matches. Then we need to back up 1 and give that result. We would also need to put a comma after the $n in the regex.
my $n;
foreach (1..$max) {
unless (#result=$originalstring=~m/(.{$_,})\1/g) {
$n=--$_;
last;
}
}
#result=$originalstring=~m/(.{$n})\1/g;
print join(",",#result),"\n";
Regular expressions can be helpful in solving this, but I don't think you can do it as a single expression, since you want to find the longest successful match, whereas regexes just look for the first match they can find. Greediness can be used to tweak which match is found first (earlier vs. later in the string), but I can't think of a way to prefer an earlier, longer substring over a later, shorter substring while also preferring a later, longer substring over an earlier, shorter substring.
One approach using regular expressions would be to iterate over the possible lengths, in decreasing order, and quit as soon as you find a match of the specified length:
my $s = '01011010';
my $one = undef;
for(my $i = int (length($s) / 2); $i > 0; --$i)
{
if($s =~ m/(.{$i})\1/)
{
$one = $1;
last;
}
}
# now $one is '101'
How should I handle regex-features labeled with "warning" like "(?{ code })", "(??{ code })" or "Special Backtracking Control Verbs"? How serious should I take the warnings?
I kinda think they’re here to stay, one way or the other — especially code escapes. Code escapes have been with us for more than a decade.
The scariness of them — that they can call code in unforeseen ways — is taken care of by use re "eval". Also, the regex matcher hasn’t been reëntrant until 5.12 IIRC, which could limit their usefulness.
The string-eval version, (??{ code }), used to be the only way to do recursion, but since 5.10 we have a much better way to do that; benchmarking the speed differences shows the eval way is way slower in most cases.
I mostly use the block-eval version, (?{ code}), for adding debugging, which happens at a different granualarity than use re "debug". It used to vaguely bother me that the return value from the block-eval version’s wasn’t usable, until I realized that it was. You just had to use it as the test part of a conditional pattern, like this pattern for testing whether a number was made up of digits that were decreasing by one each position to the right:
qr{
^ (
( \p{Decimal_Number} )
(?(?= ( \d )) | $)
(?(?{ ord $3 == 1 + ord $2 }) (?1) | $)
) $
}x
Before I figured out conditionals, I would have written that this way:
qr{
^ (
( \p{Decimal_Number} )
(?= $ | (??{ chr(1+ord($2)) }) )
(?: (?1) | $ )
) $
}x
which is much less efficient.
The backtracking control verbs are newer. I use them mostly for getting all possible permutations of a match, and that requires only (*FAIL). I believe it is the (*ACCEPT) feature that is especially marked “highly experimental”. These have only been with us since 5.10.