$1 and multiple groupings - regex

If I have a regular expression and if I am expecting one or the other terms to match, say
ab*cc or ab*dd
I have a RegEx like
while($line =~ /(ab*cc)|(ab*dd)/g)
{
# print match whether its abcc or abdd
# print $1?
}
But I am unsure how $1 will work.
Is there such a thing as $2 meaning that it will be $1 if it matches abcc or $2 if it matches abdd? How can I extend this if I have say 3 groupings or so, meaning it could either be X or Y or Z?

You could use:
while ($line =~ m/((ab*cc)|(ab*dd))/g)
Now $1 will be whichever of the two terms matches, and $2 will be the same if the first term matches but undefined otherwise, while $3 will be the same if the second term matches but undefined otherwise. The extension to three or more terms should be obvious.
The m// notation is a marginally more explicit notation equivalent to //. Otherwise, it does not alter things. The $1 etc values are determined by the order of the open parentheses (. The outer pair wrap everything that's matched; the two inner pairs capture the terms. Note that if you had m/((ab*cc)+|(ab*dd)+))/g, the contents of $2 or $3 would be the last of the repeated terms, not the complete set of the repeated terms.
Example 1
$ cat example2.pl
#!/usr/bin/env perl
use strict;
use warnings;
while (my $line = <>)
{
chomp $line;
print "Line: <<$line>>\n";
while ($line =~ m/((ab*cc)|(ab*dd))/g)
{
printf "\$1 = <<%s>>; \$2 = <<%s>>; \$3 = <<%s>>\n",
$1 // "undef", $2 // "undef", $3 // "undef";
}
}
$ perl example1.pl
abbccabccaccaddabddabbdddabbbdddd
Line: <<abbccabccaccaddabddabbdddabbbdddd>>
$1 = <<abbcc>>; $2 = <<abbcc>>; $3 = <<undef>>
$1 = <<abcc>>; $2 = <<abcc>>; $3 = <<undef>>
$1 = <<acc>>; $2 = <<acc>>; $3 = <<undef>>
$1 = <<add>>; $2 = <<undef>>; $3 = <<add>>
$1 = <<abdd>>; $2 = <<undef>>; $3 = <<abdd>>
$1 = <<abbdd>>; $2 = <<undef>>; $3 = <<abbdd>>
$1 = <<abbbdd>>; $2 = <<undef>>; $3 = <<abbbdd>>
$
Example 2
$ cat example2.pl
#!/usr/bin/env perl
use strict;
use warnings;
while (my $line = <>)
{
chomp $line;
print "Line: <<$line>>\n";
while ($line =~ m/((ab*cc)+|(ab*dd)+)/g)
{
printf "\$1 = <<%s>>; \$2 = <<%s>>; \$3 = <<%s>>\n",
$1 // "undef", $2 // "undef", $3 // "undef";
}
}
$ perl example2.pl
abbccabccacc
Line: <<abbccabccacc>>
$1 = <<abbccabccacc>>; $2 = <<acc>>; $3 = <<undef>>
$

Captured parenthesis are labeled in the order of their appearance. In an or'ed group like that, either one or the other will match. To test which matched, simply use defined:
while($line =~ /(ab*cc)|(ab*dd)/g)
{
if (defined $1) {
print "first group matched: $1";
} elsif (defined $2) {
print "second group matched: $2";
}
}
If you do not care about distinguishing which group matched, just use a single parenthesis around the entire expression
while($line =~ /(ab*cc|ab*dd)/g)
{
print "Will hold whichever matched: $1";
}

Related

perl regex matching, why is it not finding all matches, why is the order important?

I ran into a problem with perl's regex matching. I destilled it down to a small example on the command line. Why is the order in which the matches are attempted important here ?
1.
$ echo "XYG" | perl -ne 'if ($_ =~ m/X/gi) { print "Matches X\n"; } ; if ($_ =~ m/Y/gi) { print "Matches Y\n"; } ; if ($_ =~ m/G/gi) { print "Matches G\n"; } '
Matches X
Matches Y
Matches G
2.
$ echo "GXY" | perl -ne 'if ($_ =~ m/X/gi) { print "Matches X\n"; } ; if ($_ =~ m/Y/gi) { print "Matches Y\n"; } ; if ($_ =~ m/G/gi) { print "Matches G\n"; } else { print "No match on G\n"; } '
Matches X
Matches Y
No match on G
The 1. examples matches all three letters as expected, but the second example does not match the letter G, why ?
However if I create an intermediate variable, here named $aa:
$ echo "GXY" | perl -ne 'if ($_ =~ m/X/gi) { print "Matches X\n"; } ; if ($_ =~ m/Y/gi) { print "Matches Y\n"; } ; $aa = $_; if ($aa =~ m/G/gi) { print "Matches G\n"; } '
Matches X
Matches Y
Matches G
Then the match works again ?
My perl version is:
$ perl -e 'print "$]\n";'
5.022001
On a LM 18.2 machine
$ lsb_release -d
Description: Linux Mint 18.2 Sonya
Ty+BR
Max.
Because if you match a regex in a scalar context like that, and you set the g flag (for global matching) it's iterative - that's to allow you to do things like while ( m/somepattern/g ) { and have it trigger multiple times.
That's because g means:
g - globally match the pattern repeatedly in the string
It'd not be particularly useful if it reset each time you tried it. But you can also use it slightly differently in an array context:
my #matches = $str =~ m/(some_capture)/g;
And that'll select them all into a list.
But with your code and regex debugging:
#!/usr/bin/env perl
use strict;
use warnings;
use re 'debug';
$_ = 'GXY';
if ( $_ =~ m/X/gi ) { print "Matches X\n"; }
if ( $_ =~ m/Y/gi ) { print "Matches Y\n"; }
if ( $_ =~ m/G/gi ) { print "Matches G\n"; }
else { print "No match on G\n"; }
You'll get (snipped for brevity):
Matching REx "X" against "GXY"
Matching REx "Y" against "Y"
Matching REx "G" against ""
The first match 'eats' "GX" to find "X", leaving "Y" for the next match, but nothing at all for the "G" match.
The simple workaround is omit the g flag, because then you're saying explicitly 'match once' and you'll get:
Matches X
Matches Y
Matches G
Alternatively, you can use the global match with a character class:
$_ = 'GXY';
my #matches = m/([GYX])/g; #implicitly operates on $_
print "Match on $_\n" for #matches;

How to refer to matched part in regex

I am using the following code to search for a substring and print it out with a few characters before and after it. Somehow Perl takes issue with me using $1 and complains about
Use of uninitialized value $1 in concatenation (.) or string.
I cannot figure out why...can you?
use List::Util qw[min max];
my $word = "test";
my $lines = "this is just a test to find something out";
my $context = 3;
while ($lines =~ m/\b$word\b/g ) { # as long as pattern is found...
print "$word\ ";
print "$1";
print substr ($lines, max(pos($lines)-length($1)-$context, 0), length($1)+$context); # check: am I possibly violating any boundaries here
}
You have to capture $word into regex group $1 by using parentheses,
while ($lines =~ m/\b($word)\b/g)
When you use $1, you are asking the code to use the first captured group from the regex and since your regex doesn't have any, well, that variable won't exist.
You can either refer to the whole match with $& or you add a capture group to your regex and keep using $1.
i.e. Either:
use List::Util qw[min max];
my $word = "test";
my $lines = "this is just a test to find something out";
my $context = 3;
while ($lines =~ m/\b$word\b/g ) { # as long as pattern is found...
print "$word\ ";
print "$&";
print substr ($lines, max(pos($lines)-length($&)-$context, 0), length($&)+$context); # check: am I possibly violating any boundaries here
}
Or
use List::Util qw[min max];
my $word = "test";
my $lines = "this is just a test to find something out";
my $context = 3;
while ($lines =~ m/(\b$word\b)/g ) { # as long as pattern is found...
print "$word\ ";
print "$1";
print substr ($lines, max(pos($lines)-length($1)-$context, 0), length($1)+$context); # check: am I possibly violating any boundaries here
}
Note: It doesn't matter whether you use (\b$word\b) or (\b$word)\b or \b($word\b) or \b($word)\b here because \b is a 'string' of 0 length.
When you want to address a matched part in regex, put it in parenthes. Than you'll be able to address this mathced part via $1 variable (for first pair of parenthes), $2 (for the second pair) and so on.
The values $1, $2 and so on hold the strings found by capture groups. When a match is performed all of these variables are set to undef. The code in the question does not have any capture groups and hence $1 is never given a value, it is undefined.
Running the code below shows the effect. Initially $1, $2 and $3 are not defined. The first match sets $1 and $2 but not $3. The second match sets only $1 but not that $2 is cleared to be undefined. The third match has no capture groups and all three are undefined.
use strict;
use warnings;
sub show
{
printf "\$1: %s\n", (defined $1 ? $1 : "-undef-");
printf "\$2: %s\n", (defined $2 ? $2 : "-undef-");
printf "\$3: %s\n", (defined $3 ? $3 : "-undef-");
print "\n";
}
my $text = "abcdefghij";
show();
$text =~ m/ab(cd)ef(gh)ij/; # First match
show();
$text =~ m/ab(cd)efghij/; # Second match
show();
$text =~ m/abcdefghij/; # Third match
show();
$1 will have no value unless you are actually capturing something.
You can adjust your boundary collection method to using lookahead and lookbehinds.
use strict;
use warnings;
my $lines = "this is just a test to find something out";
my $word = "test";
my $extra = 10;
while ($lines =~ m/(?:(?<=(.{$extra}))|(.{0,$extra}))\b(\Q$word\E)\b(?=(.{0,$extra}))/gs ) {
my $pre = $1 // $2;
my $word = $3;
my $post = $4;
print "'...$pre<$word>$post...'\n";
}
Outputs:
'...is just a <test> to find s...'

How can I count the amount of spaces at the start of a string in Perl?

How can I count the amount of spaces at the start of a string in Perl?
I now have:
$temp = rtrim($line[0]);
$count = ($temp =~ tr/^ //);
But that gives me the count of all spaces.
$str =~ /^(\s*)/;
my $count = length( $1 );
If you just want actual spaces (instead of whitespace), then that would be:
$str =~ /^( *)/;
Edit: The reason why tr doesn't work is it's not a regular expression operator. What you're doing with $count = ( $temp =~ tr/^ // ); is replacing all instances of ^ and with itself (see comment below by cjm), then counting up how many replacements you've done. tr doesn't see ^ as "hey this is the beginning of the string pseudo-character" it sees it as "hey this is a ^".
You can get the offset of a match using #-. If you search for a non-whitespace character, this will be the number of whitespace characters at the start of the string:
#!/usr/bin/perl
use strict;
use warnings;
for my $s ("foo bar", " foo bar", " foo bar", " ") {
my $count = $s =~ /\S/ ? $-[0] : length $s;
print "'$s' has $count whitespace characters at its start\n";
}
Or, even better, use #+ to find the end of the whitespace:
#!/usr/bin/perl
use strict;
use warnings;
for my $s ("foo bar", " foo bar", " foo bar", " ") {
$s =~ /^\s*/;
print "$+[0] '$s'\n";
}
Here's a script that does this for every line of stdin. The relevant snippet of code is the first in the body of the loop.
#!/usr/bin/perl
while ($x = <>) {
$s = length(($x =~ m/^( +)/)[0]);
print $s, ":", $x, "\n";
}
tr/// is not a regex operator. However, you can use s///:
use strict; use warnings;
my $t = (my $s = " \t\n sdklsdjfkl");
my $n = 0;
++$n while $s =~ s{^\s}{};
print "$n \\s characters were removed from \$s\n";
$n = ( $t =~ s{^(\s*)}{} ) && length $1;
print "$n \\s characters were removed from \$t\n";
Since the regexp matcher returns the parenthesed matches when called in a list context, CanSpice's answer can be written in a single statement:
$count = length( ($line[0] =~ /^( *)/)[0] );
This prints amount of white space
echo " hello" |perl -lane 's/^(\s+)(.*)+$/length($1)/e; print'
3

Perl regex: how to know number of matches

I'm looping through a series of regexes and matching it against lines in a file, like this:
for my $regex (#{$regexs_ref}) {
LINE: for (#rawfile) {
/#$regex/ && do {
# do something here
next LINE;
};
}
}
Is there a way for me to know how many matches I've got (so I can process it accordingly..)?
If not maybe this is the wrong approach..? Of course, instead of looping through every regex, I could just write one recipe for each regex. But I don't know what's the best practice?
If you do your matching in list context (i.e., basically assigning to a list), you get all of your matches and groupings in a list. Then you can just use that list in scalar context to get the number of matches.
Or am I misunderstanding the question?
Example:
my #list = /$my_regex/g;
if (#list)
{
# do stuff
print "Number of matches: " . scalar #list . "\n";
}
You will need to keep track of that yourself. Here is one way to do it:
#!/usr/bin/perl
use strict;
use warnings;
my #regexes = (
qr/b/,
qr/a/,
qr/foo/,
qr/quux/,
);
my %matches = map { $_ => 0 } #regexes;
while (my $line = <DATA>) {
for my $regex (#regexes) {
next unless $line =~ /$regex/;
$matches{$regex}++;
}
}
for my $regex (#regexes) {
print "$regex matched $matches{$regex} times\n";
}
__DATA__
foo
bar
baz
In CA::Parser's processing associated with matches for /$CA::Regex::Parser{Kills}{all}/, you're using captures $1 all the way through $10, and most of the rest use fewer. If by the number of matches you mean the number of captures (the highest n for which $n has a value), you could use Perl's special #- array (emphasis added):
#LAST_MATCH_START
#-
$-[0] is the offset of the start of the last successful match. $-[n] is the offset of the start of the substring matched by n-th subpattern, or undef if the subpattern did not match.
Thus after a match against $_, $& coincides with substr $_, $-[0], $+[0] - $-[0]. Similarly, $n coincides with
substr $_, $-[n], $+[n] - $-[n]
if $-[n] is defined, and $+ coincides with
substr $_, $-[$#-], $+[$#-] - $-[$#-]
One can use $#- to find the last matched subgroup in the last successful match. Contrast with $#+, the number of subgroups in the regular expression. Compare with #+.
This array holds the offsets of the beginnings of the last successful submatches in the currently active dynamic scope. $-[0] is the offset into the string of the beginning of the entire match. The n-th element of this array holds the offset of the nth submatch, so $-[1] is the offset where $1 begins, $-[2] the offset where $2 begins, and so on.
After a match against some variable $var:
$` is the same as substr($var, 0, $-[0])
$& is the same as substr($var, $-[0], $+[0] - $-[0])
$' is the same as substr($var, $+[0])
$1 is the same as substr($var, $-[1], $+[1] - $-[1])
$2 is the same as substr($var, $-[2], $+[2] - $-[2])
$3 is the same as substr($var, $-[3], $+[3] - $-[3])
Example usage:
#! /usr/bin/perl
use warnings;
use strict;
my #patterns = (
qr/(foo(bar(baz)))/,
qr/(quux)/,
);
chomp(my #rawfile = <DATA>);
foreach my $pattern (#patterns) {
LINE: for (#rawfile) {
/$pattern/ && do {
my $captures = $#-;
my $s = $captures == 1 ? "" : "s";
print "$_: got $captures capture$s\n";
};
}
}
__DATA__
quux quux quux
foobarbaz
Output:
foobarbaz: got 3 captures
quux quux quux: got 1 capture
How about below code:
my $string = "12345yx67hjui89";
my $count = () = $string =~ /\d/g;
print "$count\n";
It prints 9 here as expected.

What is the scope of $1 through $9 in Perl?

What is the scope of $1 through $9 in Perl? For instance, in this code:
sub bla {
my $x = shift;
$x =~ s/(\d*)/$1 $1/;
return $x;
}
my $y;
# some code that manipulates $y
$y =~ /(\w*)\s+(\w*)/;
my $z = &bla($2);
my $w = $1;
print "$1 $2\n";
What will $1 be? Will it be the first \w* from $x or the first \d* from the second \w* in $x?
from perldoc perlre
The numbered match variables ($1, $2, $3, etc.) and the related punctuation set ($+ , $& , $` , $' , and $^N ) are all dynamically scoped until the end of the enclosing block or until the next successful match, whichever comes first. (See ""Compound Statements"" in perlsyn.)
This means that the first time you run a regex or substitution in a scope a new localized copy is created. The original value is restored (à la local) when the scope ends. So, $1 will be 10 up until the regex is run, 20 after the regex, and 10 again when the subroutine is finished.
But I don't use regex variables outside of substitutions. I find much clearer to say things like
#!/usr/bin/perl
use strict;
use warnings;
sub bla {
my $x = shift;
$x =~ s/(\d*)/$1 $1/;
return $x;
}
my $y = "10 20";
my ($first, $second) = $y =~ /(\w*)\s+(\w*)/;
my $z = &bla($second);
my $w = $first;
print "$first $second\n";
where $first and $second have better names that describe their contents.
By making a couple of small alterations to your example code:
sub bla {
my $x = shift;
print "$1\n";
$x =~ s/(\d+)/$1 $1/;
return $x;
}
my $y = "hello world9";
# some code that manipulates $y
$y =~ /(\w*)\s+(\w*)/;
my $z = &bla($2);
my $w = $1;
print "$1 $2\n$z\n";
we get the following output:
hello
hello world9
world9 9
showing that the $1 is limited to the dynamic scope (ie the $1 assigned within bla ceases to exist at the end of that function (but the $1 assigned from the $y regex is accessible within bla until it is overwritten))
The variables will be valid until the next time they are written to in the flow of execution.
But really, you should be using something like:
my ($match1, match2) = $var =~ /(\d+)\D(\d+)/;
Then use $match1 and $match2 instead of $1 and $2, it's much less ambiguous.