Perl regular expression isn't greedy enough - regex

I'm writing a regular expression in perl to match perl code that starts the definition of a perl subroutine. Here's my regular expression:
my $regex = '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n)*\s*\{';
$regex matches code that starts a subroutine. I'm also trying to capture the name of the subroutine in $1 and any white space and comments between the subroutine name and the initial open brace in $2. It's $2 that is giving me a problem.
Consider the following perl code:
my $x = 1;
sub zz
# This is comment 1.
# This is comment 2.
# This is comment 3.
{
$x = 2;
return;
}
When I put this perl code into a string and match it against $regex, $2 is "# This is comment 3.\n", not the three lines of comments that I want. I thought the regular expression would greedily put all three lines of comments into $2, but that seems not to be the case.
I would like to understand why $regex isn't working and to design a simple replacement. As the program below shows, I have a more complex replacement ($re3) that works. But I think it's important for me to understand why $regex doesn't work.
use strict;
use English;
my $code_string = <<END_CODE;
my \$x = 1;
sub zz
# This is comment 1.
# This is comment 2.
# This is comment 3.
{
\$x = 2;
return;
}
END_CODE
my $re1 = '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n)*\s*\{';
my $re2 = '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n){0,}\s*\{';
my $re3 = '\s*sub\s+([a-zA-Z_]\w*)((\s*#.*\n)+)?\s*\{';
print "\$code_string is '$code_string'\n";
if ($code_string =~ /$re1/) {print "For '$re1', \$2 is '$2'\n";}
if ($code_string =~ /$re2/) {print "For '$re2', \$2 is '$2'\n";}
if ($code_string =~ /$re3/) {print "For '$re3', \$2 is '$2'\n";}
exit 0;
__END__
The output of the perl script above is the following:
$code_string is 'my $x = 1;
sub zz
# This is comment 1.
# This is comment 2.
# This is comment 3.
{
$x = 2;
return;
} # sub zz
'
For '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n)*\s*\{', $2 is '# This is comment 3.
'
For '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n){0,}\s*\{', $2 is '# This is comment 3.
'
For '\s*sub\s+([a-zA-Z_]\w*)((\s*#.*\n)+)?\s*\{', $2 is '
# This is comment 1.
# This is comment 2.
# This is comment 3.
'

Look at only the part of your regex that captures $2. It is (\s*#.*\n). By itself, this can only capture a single comment line. You have an asterisk after it in order to capture multiple comment lines, and this works just fine. It captures multiple comment lines and puts each of them into $2, one by one, each time replacing the previous value of $2. So the final value of $2 when the regex is done matching is the last thing that the capturing group matched, which is the final comment line. Only. To fix it, you need to put the asterisk inside the capturing group. But then you need to put another set of parentheses (non-capturing, this time) to make sure the asterisk applies to the whole thing. So instead of (\s*#.*\n)*, you need ((?:\s*#.*\n)*).
Your third regex works because you unwittingly surrounded the whole expression in parentheses so that you could put a question mark after it. This caused $2 to capture all the comments at once, and $3 to capture only the final comment.
When you are debugging your regex, make sure you print out the values of all the match variables you are using: $1, $2, $3, etc. You would have seen that $1 was just the name of the subroutine and $2 was only the third comment. This might have led you to wonder how on earth your regex skipped over the first two comments when there is nothing between the first and second capturing groups, which would eventually lead you in the direction of discovering what happens when a capturing group matches multiple times.
By the way, it looks like you are also capturing any whitespace after the subroutine name into $1. Is this intentional? (Oops, I messed up my mnemonics and thought \w was "w for whitespace".)

If you add repetition to a capturing group, it will only capture the final match of that group. This is why $regex only matches the final comment line.
Here is how I would rewrite you regex:
my $regex = '\s*sub\s+([a-zA-Z_]\w*)((?:\s*#.*\n)*)\s*\{';
This is very similar to your $re3, except for the following changes:
The white space and comment matching portion is now in a non-capturing group
I changed that portion of the regex from ((...)+)? to ((...)*) which is equivalent.

The problem is that by default the \n isn't part of the string. The regex stops matching at \n.
You need to use the s modifier for multi-line matches:
if ($code_string =~ /$re1/s) {print "For '$re1', \$2 is '$2'\n";}
Note the s after the regex.

Related

regex: confirm if a optional portion was matched

I have a string that can be of two forms, and it is unknown which form it will be each time:
hello world[0:10]; or hello world;
There may or may not be the brackets with numbers. The two words (hello and world) can vary. If the brackets and numbers are there, the first number is always 0 and the second number (10) varies.
I need to capture the first word (hello) and, if it exists, the second number (10). I also need to know which form of the string it was.
hello world[0:10]; I would capture {hello, 10, form1}, and hello world; I would capture {hello, form2}. I don't really care how the "form" is formatted, I just need to be able to differentiate. It can be a bit (1=form1, 0=form2), structural (form1 puts me in one scope and form2 another), etc.
I currently have the following (now working) regex:
/(\w*) \s \w* (?:\[0:(\d*)\])?;/x
This gives me $1 = hello and potentially $2 = 10. I now just need to know if the bracketed numbers were there or not. This will be repeated many times, so I can't assume $2 = undef going into the regex. $2 could also be the same thing a few times in a row so I can't just look for a change in $2 before and after the regex.
My best solution so far is to run the regex twice, the first time with the brackets and the second time without:
if( /(\w*) \s \w* \[0:(\d*)\];/x ) {...}
elsif( /(\w*) \s \w*;/x ) {...}
This seems very inefficient and inelegant though so I was wondering if there is a better way?
You can use ? to optionally match portions of your regex. Then you can capture the output directly as a return value from the regex.
my $re = qr{ (\w*) \s* (?:\[0:(\d+)\])?; }x;
if( my($word, $num) = $line =~ $re ) {
say "Word: $word";
say "Num: $num" if defined $num;
}
else {
say "No match";
}
(?:\[0:(\d+)\])? says there may be a [0:\d+]. (?:) makes the grouping non-capturing so only \d+ is captured.
$1 and $2 are also safe to use, they are reset on each match, but using lexical variables makes things more explicit.

Extract first word after specific word

I'm having difficulty writing a Perl program to extract the word following a certain word.
For example:
Today i'm not going anywhere except to office.
I want the word after anywhere, so the output should be except.
I have tried this
my $words = "Today i'm not going anywhere except to office.";
my $w_after = ( $words =~ /anywhere (\S+)/ );
but it seems this is wrong.
Very close:
my ($w_after) = ($words =~ /anywhere\s+(\S+)/);
^ ^ ^^^
+--------+ |
Note 1 Note 2
Note 1: =~ returns a list of captured items, so the assignment target needs to be a list.
Note 2: allow one or more blanks after anywhere
In Perl v5.22 and later, you can use \b{wb} to get better results for natural language. The pattern could be
/anywhere\b{wb}.+?\b{wb}(.+?\b{wb})/
"wb" stands for word break, and it will account for words that have apostrophes in them, like "I'll", that plain \b doesn't.
.+?\b{wb}
matches the shortest non-empty sequence of characters that don't have a word break in them. The first one matches the span of spaces in your sentence; and the second one matches "except". It is enclosed in parentheses, so upon completion $1 contains "except".
\b{wb} is documented most fully in perlrebackslash
First, you have to write parentheses around left side expression of = operator to force array context for regexp evaluation. See m// and // in perlop documentation.[1] You can write
parentheses also around =~ binding operator to improve readability but it is not necessary because =~ has pretty high priority.
Use POSIX Character Classes word
my ($w_after) = ($words =~ / \b anywhere \W+ (\w+) \b /x);
Note I'm using x so whitespaces in regexp are ignored. Also use \b word boundary to anchor regexp correctly.
[1]: I write my ($w_after) just for convenience because you can write my ($a, $b, $c, #rest) as equivalent of (my $a, my $b, my $c, my #rest) but you can also control scope of your variables like (my $a, our $UGLY_GLOBAL, local $_, #_).
This Regex to be matched:
my ($expect) = ($words=~m/anywhere\s+([^\s]+)\s+/);
^\s+ the word between two spaces
Thanks.
If you want to also take into consideration the punctuation marks, like in:
my $words = "Today i'm not going anywhere; except to office.";
Then try this:
my ($w_after) = ($words =~ /anywhere[[:punct:]|\s]+(\S+)/);

Sub-pattern in regex can't be dereferenced?

I have following Perl script to extract numbers from a log. It seems that the non-capturing group with ?: isn't working when I define the sub-pattern in a variable. It's only working when I leave out the grouping in either the regex-pattern or the sub-pattern in $number.
#!/usr/bin/perl
use strict;
use warnings;
my $number = '(:?-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?)';
#my $number = '-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?';
open(FILE,"file.dat") or die "Exiting with: $!\n";
while (my $line = <FILE>) {
if ($line =~ m{x = ($number). y = ($number)}){
print "\$1= $1\n";
print "\$2= $2\n";
print "\$3= $3\n";
print "\$4= $4\n";
};
}
close(FILE);
The output for this code looks like:
$1= 12.15
$2= 12.15
$3= 3e-5
$4= 3e-5
for an input of:
asdf x = 12.15. y = 3e-5 yadda
Those doubled outputs aren't desired.
Is this because of the m{} style in contrast to the regular m// patterns for regex? I only know the former style to get variables (sub-strings) in my regex expressions. I just noticed this for the backreferencing so possibly there are other differences for metacharacters?
The delimiters you use for the regular expression aren't causing any problems but the following is:
(:?-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?)
^^
Notice this isn't a capturing group, it is an optional colon :
Probably a typo mistake but it is causing the trouble.
Edit: It looks that it is not a typo mistake, i substituted the variables in the regex and I got this:
x = ((:?-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?)). y = ((:?-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?))
^^ first and second group ^^ ^^ third and fourth grouop ^^
As you can see the first and second capturing group are capturing exactly the same thing, the same is happening for the third and fourth capturing group.
You're going to kick yourself...
Your regexp reads out as:
capture {
maybe-colon
maybe-minus
cluster { (?:(?:\d+\.?\d*)|(?:\.\d+))
cluster { (?:\d+\.?\d*)
1+ digits
maybe-dot
0+ digits
}
-or-
cluster { (?:\.\d+)
dot
1+digits
}
}
maybe cluster {
E or e
maybe + or -
1+ digets
} (?:[Ee][+-]?\d+)?
}
... which is what you're looking for.
However, when you then do your actual regexp, you do:
$line =~ m{x = $number. y = $number})
(the curly braces are a distraction.... you may use any \W if the m or s has been specified)
What this is asking is to capture whatever the regexp defined in $number is.... which is, itself, a capture.... hence $1 and $2 being the same thing.
Simply remove the capture braces from either $number or the regexp line.

Finding the N th Occurrence of a Match line

I have list (multiline text string) with same number of line (order of items may differ in many ways and numbers of line may be however):
Ardei
Mere
Pere
Ardei
Castraveti
I want to find 2 th occurrence of a match line that contain 'Ardei' and replace name of item with another name and, separately in another regex, find 1 st occurrence of 'Ardei' and replace name with something else (perl).
Let's say you want to replace the 2nd "Ardei" with "XYZ". You could do that like this (PCRE syntax):
^(?s)(.*?Ardei.*?)Ardei
and replace it with:
$1XYZ
The $1 contains everything that is captured in (.*?Ardei.*?) and the (?s) will cause the . to match really every character (also line break chars).
A little demo:
#!/usr/bin/perl -w
my $text = 'Ardei
Mere
Pere
Ardei
Castraveti
Ardei';
$text =~ s/^(?s)(.*?Ardei.*?)Ardei/$1XYZ/;
# or just: $text =~ s/^(.*?Ardei.*?)Ardei/$1XYZ/s;
print $text;
will print:
Ardei
Mere
Pere
XYZ
Castraveti
Ardei
Ardei[\W\w]*?(Ardei)
will match exactly the second "Ardei" by its \1, so you can use it to replace exactly the second instance.

Why can't I match a substring which may appear 0 or 1 time using /(subpattern)?/

The original string is like this:
checksession ok:6178 avg:479 avgnet:480 MaxTime:18081 fail1:19
The last part "fail1:19" may appear 0 or 1 time. And I tried to match the number after "fail1:", which is 19, using this:
($reg_suc, $reg_fail) = ($1, $2) if $line =~ /^checksession\s+ok:(\d+).*(fail1:(\d+))?/;
It doesn't work. The $2 variable is empty even if the "fail1:19" does exist. If I delete the "?", it can match only if the "fail1:19" part exists. The $2 variable will be "fail1:19". But if the "fail1:19" part doesn't exist, $1 and $2 neither match. This is incorrect.
How can I rewrite this pattern to capture the 2 number correctly? That means when the "fail1:19" part exist, two numbers will be recorded, and when it doesn't exit, only the number after "ok:" will be recorded.
First, the number in fail field would end in $3, as those variables are filled according to opening parentheses. Second, as codaddict shows, the .* construct in RE is hungry, so it will eat even the fail... part. Third, you can avoid numbered variables like this:
my $line = "checksession ok:6178 avg:479 avgnet:480 MaxTime:18081 fail1:19";
if(my ($reg_suc, $reg_fail, $addend)
= $line =~ /^checksession\s+ok:(\d+).*?(fail1:(\d+))?$/
) {
warn "$reg_suc\n$reg_fail\n$addend\n";
}
Try the regex:
^checksession\s+ok:(\d+).*?(fail1:(\d+))?$
Ideone Link
Changes made:
.* in the middle has been made
non-greedy and
$ (end anchor) has been added.
As a result of above changes .*? will try to consume as little as possible and the end anchor forces the regex to match till the end of the string, matching fail1:number if present.
I think this is one of the few cases where a split is actually more robust than a regex:
$bar[0]="checksession ok:6178 avg:479 avgnet:480 MaxTime:18081 fail1:19";
$bar[1]="checksession ok:6178 avg:479 avgnet:480 MaxTime:18081";
for $line (#bar){
(#fields) = split/ /,$line;
$reg_suc = $fields[1];
$reg_fail = $fields[5];
print "$reg_suc $reg_fail\n";
}
I try to avoid the non-greedy modifier. It often bites back. Kudos for suggesting split, but I'd go a step further:
my %rec = split /\s+|:/, ( $line =~ /^checksession (.*)/ )[0];
print "$rec{ok} $rec{fail1}\n";