Perl 5 - longest token matching in regexp (using alternation) - regex

Is possible to force a Perl 5 regexp match longest possible string, if the regexp is, for example:
a|aa|aaa
I found is probably default in perl 6, but in perl 5, how i can get this behavior?
EXAMPLE pattern:
[0-9]|[0-9][0-9]|[0-9][0-9][0-9][0-9]
If I have string 2.10.2014, then first match will be 2, which is ok; but the next match will be 1, and this is not ok because it should be 10. Then 2014 will be 4 subsequently matches 2,0,1,4, but it should be 2014 using [0-9][0-9][0-9][0-9]. I know I could use [0-9]+, but I can't.

General solution: Put the longest one first.
my ($longest) = /(aaa|aa|a)/
Specific solution: Use
my ($longest) = /([0-9]{4}|[0-9]{1,2})/
If you can't edit the pattern, you'll have to find every possibility and find the longest of them.
my $longest;
while (/([0-9]|[0-9][0-9]|[0-9][0-9][0-9][0-9])/g) {
$longest = $1 if length($1) > length($longest);
}

The sanest solution I can see for unknown patterns is to match every possible pattern, look at the length of the matched substrings and select the longest substring:
my #patterns = (qr/a/, qr/a(a)/, qr/b/, qr/aaa/);
my $string = "aaa";
my #substrings = map {$string =~ /($_)/; $1 // ()} #patterns;
say "Matched these substrings:";
say for #substrings;
my $longest_token = (sort { length $b <=> length $a } #substrings)[0];
say "Longest token was: $longest_token";
Output:
Matched these substrings:
a
aa
aaa
Longest token was: aaa
For known patterns, one would sort them manually so that first-match is the same as longest-match:
"aaa" =~ /(aaa|aa|b|a)/;
say "I know that this was the longest substring: $1";

The alternation will use the first alternative that matches, so just write /aaa|aa|a/ instead.
For the example you have shown in your question, just put the longest alternative first like I said:
[0-9][0-9][0-9][0-9]|[0-9][0-9]|[0-9]

perl -Mstrict -Mre=/xp -MData::Dumper -wE'
{package Data::Dumper;our($Indent,$Sortkeys,$Terse,$Useqq)=(1)x4}
sub _dump { Dumper(shift) =~ s{(\[.*?\])}{$1=~s/\s+/ /gr}srge }
my ($count, %RS);
my $s= "aaaabbaaaaabbab";
$s =~ m{ \G a+b? (?{ $RS{ $+[0] - $-[0] } //= [ ${^MATCH}, $-[0] ]; $count++ }) (*FAIL) };
say sprintf "RS: %s", _dump(\%RS);
say sprintf "count: %s", $count;
'
RS: {
"1" => [ "a", 0 ],
"2" => [ "aa", 0 ],
"3" => [ "aaa", 0 ],
"4" => [ "aaaa", 0 ],
"5" => [ "aaaab", 0 ]
}
count: 5

Related

Determining the parts that match a regex in perl

I'm looking for the accumulation of possibly overlapping matches of a regex (the final goal being to do further searches in the resulting substrings).
I want to skip the matches that have already been "accumulated", while avoiding to make copies with substr (I might be wrong about avoiding substr), but the condition that I wrote for it with pos($...) = ... and a next if $... =~ /.../ doesn't work:
#!/usr/bin/env perl
# user inputs
$regexp = "abc|cba|b";
$string = "_abcbabc_bacba";
$length = length($string);
$result = "0" x $length;
while ( pos($string) < $length and $string =~ /$regexp/go ) {
pos($string) = $-[0] + 1;
next unless ($len = $+[0] - $-[0]);
# The failing condition is here:
# pos($result) = $-[0];
# next if $result =~ /1{$len}/;
substr($result, $-[0], $len) = "1" x $len;
printf "%s\n", $string;
printf "%".$-[0]."s%s\n", "", "^" x $len;
}
printf "%s\n", $result;
By commenting those lines I can get the desired result which is 01111111010111:
_abcbabc_bacba
^^^
_abcbabc_bacba
^
_abcbabc_bacba
^^^
_abcbabc_bacba
^
_abcbabc_bacba
^^^
_abcbabc_bacba
^
_abcbabc_bacba
^
_abcbabc_bacba
^^^
_abcbabc_bacba
^
01111111010111
But my expected output (with a working condition) would be:
_abcbabc_bacba
^^^
_abcbabc_bacba
^^^
_abcbabc_bacba
^^^
_abcbabc_bacba
^
_abcbabc_bacba
^^^
01111111010111
notes:
for each iteration I print the original string; the ^ just below show the characters that have been matched in the current iteration.
the 0 & 1 at the end represent the overall result. The characters that have been matched at least once during the process are set to 1.
My commented condition is meant to skip the current match when its corresponding characters are already set to 1 in the result.
I think you really want to find the longest overlapping sub match. If you can guarantee that the alternation will have the substrings in the order that you prefer them, that approach might work, but it also requires to know a lot about what is happening besides the match, and in future matches. That is, you don't know if you can output anything until you've the future matches that might overlap, and you can't tell how far into the future you need to look.
You can mess around with pos, but I think I'd just match each substring separately, remember the starting positions, then compare later. Decompose the problem into separate tasks for finding the matching positions and for deciding which ones you want.
Even if I wrote the same code you presented, it's unlikely that I'd remember everything that must happen just right to make it all work out if I had to see it again after a long absence (even if I did highlight #- and #+ in the first chapter of Mastering Perl ;)
use v5.10;
use strict;
my $target = "_abcbabc_bacba";
my #looking_for = qw( abc cba b );
my #found;
foreach my $want ( #looking_for ) {
my $pos = 0;
while( my $found_at = index $target, $want, $pos ) {
last if $found_at == -1;
push #found, $found_at;
$pos = $found_at + 1;
}
}
my #found = sort { $a->[1] <=> $b->[1] } #found;
use Data::Dumper;
say Dumper( \#found );
Now you have a data structure that you can massage any way that you like instead of thinking about all this stuff while in regex land. How you decide to do that is left as an exercise for the reader.
$VAR1 = [
[
'abc',
1
],
[
'b',
2
],
[
'cba',
3
],
[
'b',
4
],
[
'abc',
5
],
[
'b',
6
],
[
'b',
9
],
[
'cba',
11
],
[
'b',
12
]
];
Part of this may be inline. You can build up this data structure to the point where you know that everything you have so far can produce output (i.e. the thing you just matched does not overlap with the prior thing).

How to get the position of all capture groups with quantifiers?

I have a small problem. I have a perl regexp with multiple capture groups. Some of them have quantifiers (like '+'). If no quantifier is added, then #- & #+ array are filled nicely with the matched position of the capture groups, but if a quantifier is added only the last match is detected. But I would like to have all of them.
An example:
my $s = 'xx1a2b3cyy';
my $re = qr/^xx(\d\w)+/;
So I'd like to know that matches are '1a', '2b', '3c' at 2, 4, 6.
Simple matching gives:
if ($s =~ $re) {
print "Match #-, #+\n";
for (my $i = 0; $i < #-; ++$i) {
print 'i: ', $i, " - '", substr($s, $-[$i], $+[$i] - $-[$i]), "\n";
}
}
Gives:
Match 0 6, 8 8
i: 0 - 'xx1a2b3c
i: 1 - '3c
So only the last capture group match is remembered.
My next simple try was which is not really what I want as the RE is different:
$re = qr/(\d\w)/;
my #s = ($s =~ /$re/g);
print "RE: '#s'\n";
while ($s =~ /$re/g) {
print "Match #-, #+\n";
for (my $i = 0; $i < #-; ++$i) {
print 'i: ', $i, " - '", substr($s, $-[$i], $+[$i] - $-[$i]), "\n";
}
}
gives:
RE: '1a 2b 3c'
Match 2 2, 4 4
i: 0 - '1a
i: 1 - '1a
Match 4 4, 6 6
i: 0 - '2b
i: 1 - '2b
Match 6 6, 8 8
i: 0 - '3c
i: 1 - '3c
But this not what I want, as it would match a string like 'ZZ1aAA2bBB3cZZ'.
So somehow I have to combine the two. The best what I could get:
$re = '^xx(?:\d\w)*?\G(\d\w)';
pos($s) = 2;
while ($s =~ m($re)g) {
print "Match pos: ", pos($s), ', G: ', $1, ", '#-', '#+'\n"
}
gives:
Match pos: 4, G: 1a, '0 2', '4 4'
Match pos: 6, G: 2b, '0 4', '6 6'
Match pos: 8, G: 3c, '0 6', '8 8'
This is almost nice, but for this I need to know the position of the first possible match. If it is not set properly it will not match anything. I can only determine the first position if I remove the non greedy part:
$re = '^xx(\d\w)';
if ($s =~ m($re)) {
print "Match: '#-', '#+'\n";
}
which gives:
Match: '0 2', '4 4'
So $-[1] gives the first position, but for this I have to modify the RE "manually".
If I add code execution into the pattern I almost get what I need:
use re 'eval';
$re = '^xx(\d\w)+(??{print "Code: <#-> <#+>\n"})';
$s =~ m($re) and print "Match\n";
gives:
Code: <0 6> <8 8>
Code: <0 4> <6 6>
Code: <0 2> <4 4>
For this I need to add the (?{ code }) part.
Does anybody know a simpler method (I mean not need to modify the original RE) to get all the possible matches of a capture group having a quantifier?
Thanks in advance!
There's no general solution; the regex engine simply doesn't store the necessary information. You're asking to use a regex as a parser, and that's a no-go.
sub extract {
for ($_[0]) {
/^ xx /xg
or return ();
my #matches;
push #matches, $1 while /\G (\d\w) /xg;
return #matches;
}
}
or
sub extract {
my ($pairs) = $_[0] =~ /^xx((?:\d\w)+)/
or return ();
return unpack('(a2)*', $pairs);
}
If you just want the positions, it's the same.
sub extract {
for ($_[0]) {
/^ xx /xg
or return ();
my #matches;
push #matches, $-[1] while /\G (\d\w) /xg;
return #matches;
}
}
or
sub extract {
$_[0] =~ /^xx((?:\d\w)+)/
or return ();
return map { $-[1] + ( $_ - 1 )*2 } 1..length($1)/2;
}
Even a non-general purpose solution is extremely hard using regular expressions. Say you had the following pattern:
xx(\d\w)+yy(\d\w)+zz
The correct solution would be:
use Storable qw( dclone );
my $s = "xx1a2byy3c4dZZ...xx5a6byy7c8dzz";
local our $rv;
if (
$s =~ /
(?{ [] })
xx
(?: (\d\w) (?{ my $r = dclone($^R); push #{ $r->[0] }, $^N; $r }) )+
yy
(?: (\d\w) (?{ my $r = dclone($^R); push #{ $r->[1] }, $^N; $r }) )+
zz
(?{ $rv = $^R; })
/x
) {
say "\$1: #{ $rv->[0] }";
say "\$2: #{ $rv->[1] }";
}
Output:
$1: 5a 6b
$2: 7c 8d
And something like
(zz(\d\w)+)+
would need
use Storable qw( dclone );
my $s = "zz1a2bzz3c4d";
local our $rv;
if (
$s =~ /
(?{ [] })
(?:
(?{ my $r = dclone($^R); push #$r, []; $r })
zz
(?: (\d\w) (?{ my $r = dclone($^R); push #{ $r->[-1] }, $^N; $r }) )+
)+
(?{ $rv = $^R; })
/x
) {
say "\$1: #$_" for #$rv;
}
Output:
$1: 1a 2b
$1: 3c 4d
I think I can give some explanation for the behavior you see:
In the first example I can see only one capture group. The quantifier allows it to be used multiple times, but it's one capture group nonetheless. So every new occurence of a matching subpattern would overwrite the value previously captured there. Even if the RE engine is already advanced behind it, but backtracking would occur (for e.g. a more advanced pattern with branching and the likes), it could be that the now again visited capture group would change. And since #- and #+ hold the positions to the capture groups (as opposed to occuring subpattern matches), this would explain why there's only the last occurence of the subpattern contained.
You could even play around with named subpatterns and %+/%- and would experience the same thing. It becomes more obvious with the already used (?{ }), at least for debugging purposes. But use re 'debug' is fine for shorter regexes / strings to match.
So be aware of the effects of backtracking to capture groups while matching is still in progress!
But if you don't have to care about backtracking, I can think of kind of a recipe to handle a capture group with a quantifier:
If your capture group is (bla) and your quantifier {0,3}, transform it into
(?:(bla)(?{ print $-[$#-],$+[$#-]."\n" })){0,3}.
You practically put the subpattern into another (non-capturing) group. If the RE engine is done with it, execute code regarding the last capture group matched so far. The quantifier outside the surrounding group is then responsible for the correct number of execution of the code fragment.
So you example becomes this:
use Data::Dumper;
my $s = 'xx1a2b3cyy';
my #submatches;
sub getem { push #submatches, [$-[$#-],$+[$#-]]; }
$s =~ m/^xx(?:(\d\w)(?{ getem() }))+/;
print Dumper(\#submatches);
This also works with multiple capture groups transformed this way:
my $s = 'xx1a2b3cyy4de5fg6hihhh2';
$s =~ m/^xx(?:(\d\w)(?{ getem() }))+yy(?:(\d\w{2})(?{ getem() }))+hh/;
You have to adjust the index used, if your capture group contains more capture groups. That's why I prefer names capture groups.
Hope this helps.

Perl all matches of a regexp in a given string

Perl's regexp matching is left-greedy, so that the regexp
/\A (a+) (.+) \z/x
matching the string 'aaab', will set $1='aaa' and $2='b'.
(The \A and \z are just to force start and end of the string.)
You can also give non-greedy qualifiers, as
/\A (a+?) (.+?) \z/x
This will still match, but give $1='a' and $2='aab'.
But I would like to check all possible ways to generate the string, which are
$1='aaa' $2='b'
$1='aa' $2='ab'
$1='a' $2='aab'
The first way corresponds to the default left-greedy behaviour, and the third way corresponds to making the first match non-greedy, but there may be ways in between those extremes. Is there a regexp engine (whether Perl's, or some other such as PCRE or RE2) which can be made to try all possible ways that the regexp specified generates the given string?
Among other things, this would let you implement 'POSIX-compatible' regexp matching where the longest total match is picked. In my case I really would like to see every possibility.
(One way would be to munge the regexp itself, replacing the + modifier with {1,1} on the first attempt, then {1,2}, {1,3} and so on - for each combination of + and * modifiers in the regexp. That is very laborious and slow, and it's not obvious when to stop. I hope for something smarter.)
Background
To answer Jim G.'s question on what problem this might solve, consider a rule-based translation system between two languages, given by the rules
translate(any string of one or more 'a' . y) = 'M' . translate(y)
translate('ab') = 'U'
Then there is a possible result of translate('aaab'), namely 'MU'.
You might try to put these rules into Perl code based on regexps, as
our #m;
my #rules = (
[ qr/\A (a+) (.*) \z/x => sub { 'M' . translate($m[1]) } ],
[ qr/\A ab \z/x => sub { 'U' } ],
);
where translate runs over each of #rules and tries to apply them in turn:
sub translate {
my $in = shift;
foreach (#rules) {
my ($lhs, $rhs) = #$_;
$in =~ $lhs or next;
local #m = ($1, $2);
my $r = &$rhs;
next if index($r, 'fail') != -1;
return $r;
}
return 'fail';
}
However, calling translate('aaab') returns 'fail'. This is because
it tries to apply the first rule matching (a+)(.*) and the regexp
engine finds the match with the longest possible string of 'a'.
Using the answer suggested by ikegami, we can try all ways in which
the regular expression generates the string:
use re 'eval';
sub translate {
my $in = shift;
foreach (#rules) {
my ($lhs, $rhs) = #$_;
local our #matches;
$in =~ /$lhs (?{ push #matches, [ $1, $2 ] }) (*FAIL)/x;
foreach (#matches) {
local #m = #$_;
my $r = &$rhs;
next if index($r, 'fail') != -1;
return $r;
}
}
return 'fail';
}
Now translate('aaab') returns 'MU'.
local our #matches;
'aaab' =~ /^ (a+) (.+) \z (?{ push #matches, [ $1, $2 ] }) (*FAIL)/x;

Perl regex with a negative lookahead behaves unexpectedly

I'm attempting to match /ezmlm-(any word except 'weed' or 'return')\s+/ with a regex. The following demonstrates a foreach loop which does the right thing, and an attempted regex which almost does:
#!/usr/bin/perl
use strict;
use warnings;
my #tests = (
{ msg => "want 'yes', string has ezmlm, but not weed or return",
str => q[|/usr/local/bin/ezmlm-reject '<snip>'],
},
{ msg => "want 'yes', array has ezmlm, but not weed or return",
str => [ <DATA> ],
},
{ msg => "want 'no' , has ezmlm-weed",
str => q[|/usr/local/bin/ezmlm-weed '<snip>'],
},
{ msg => "want 'no' , doesn't have ezmlm-anything",
str => q[|/usr/local/bin/else '<snip>'],
},
{ msg => "want 'no' , ezmlm email pattern",
str => q[crazy/but/legal/ezmlm-wacky#example.org],
},
);
print "foreach regex\n";
foreach ( #tests ) {
print doit_fe( ref $_->{str} ? #{$_->{str}} : $_->{str} ) ? "yes" : "no";
print "\t";
print doit_re( ref $_->{str} ? #{$_->{str}} : $_->{str} ) ? "yes" : "no";
print "\t<--- $_->{msg}\n";
};
# for both of the following subs:
# #_ will contain one or more lines of data
# match the pattern /ezmlm-(any word except 'weed' or 'return')\s+/
sub doit_fe {
my $has_ezmlm = 0;
foreach ( #_ ) {
next if $_ !~ m/ezmlm-(.*?)\s/;
return 0 if $1 eq 'weed' or $1 eq 'return';
$has_ezmlm++;
};
return $has_ezmlm;
};
sub doit_re { return grep /ezmlm-(?!weed|return)/, #_; };
__DATA__
|/usr/local/bin/ezmlm-reject '<snip>'
|/usr/local/bin/ezmlm-issubn '<snip>'
|/usr/local/bin/ezmlm-send '<snip>'
|/usr/local/bin/ezmlm-archive '<snip>'
|/usr/local/bin/ezmlm-warn '<snip>'
The output of the sample program is as follows:
foreach regex
yes yes <--- want 'yes', string has ezmlm, but not weed or return
yes yes <--- want 'yes', array has ezmlm, but not weed or return
no no <--- want 'no' , has ezmlm-weed
no no <--- want 'no' , doesn't have ezmlm-anything
no yes <--- want 'no' , ezmlm email pattern
In the last instance, the regex fails, matching a goofy but legal email address. If I amend the regex placing a \s after the negative lookahead pattern like so:
grep /ezmlm-(?!weed|return)\s+/
The regex fails to match at all. I'm supposing it has to do with the how the negative pattern works. I've tried making the negation non-greedy, but it seems there's some lesson buried in 'perldoc perlre' that is escaping me. Is it possible to do this with a single regex?
The negative look-ahead is zero-width which means that the regex
/ezmlm-(?!weed|return)\s+/
will only match if one or more space characters immediately follow "ezmlm-".
The pattern
/ezmlm-(?!weed|return)/
will match
"crazy/but/legal/ezmlm-wacky#example.org"
because it contains "ezmlm-" not followed by "weedy" or "return".
Try
/ezmlm-(?!weed|return)\S+\s+/
where \S+ is one or more non-space characters (or instead use [^#\s]+ if you want to deny email addresses even if followed by a space).

How can I escape a literal string I want to interpolate into a regular expression?

Is there a built-in way to escape a string that will be used within/as a regular expression? E.g.
www.abc.com
The escaped version would be:
www\.abc\.com
I was going to use:
$string =~ s/[.*+?|()\[\]{}\\]/\\$&/g; # Escapes special regex chars
But I just wanted to make sure that there's not a cleaner built-in operation that I'm missing?
Use quotemeta or \Q...\E.
Consider the following test program that matches against $str as-is, with quotemeta, and with \Q...\E:
#! /usr/bin/perl
use warnings;
use strict;
my $str = "www.abc.com";
my #test = (
"www.abc.com",
"www/abc!com",
);
sub ismatch($) { $_[0] ? "MATCH" : "NO MATCH" }
my #match = (
[ as_is => sub { ismatch /$str/ } ],
[ qmeta => sub { my $qm = quotemeta $str; ismatch /$qm/ } ],
[ qe => sub { ismatch /\Q$str\E/ } ],
);
for (#test) {
print "\$_ = '$_':\n";
foreach my $method (#match) {
my($name,$match) = #$method;
print " - $name: ", $match->(), "\n";
}
}
Notice in the output that using the string as-is could produce spurious matches:
$ ./try
$_ = 'www.abc.com':
- as_is: MATCH
- qmeta: MATCH
- qe: MATCH
$_ = 'www/abc!com':
- as_is: MATCH
- qmeta: NO MATCH
- qe: NO MATCH
For programs that accept untrustworthy inputs, be extremely careful about using such potentially nasty bits as regular expressions: doing so could create unexpected runtime errors, denial-of-service vulnerabilities, and security holes.
The best way to do this is to use \Q to begin a quoted string and \E to end it.
my $foo = 'www.abc.com';
$bar =~ /blah\Q$foo\Eblah/;
You can also use quotemeta on the variable first. E.g.
my $quoted_foo = quotemeta($foo);
The \Q trick is documented in perlre under "Escape Sequences."