Perl regex matches string differently when providing it as variable - regex

I've got a question regarding regex matching in Perl. See following code snippet:
my $r = '^\z';
my $s = "";
$r =~ /$s/ ? print "Match\n" : print "No match\n";
Output: Match
following snippet however prints:
my $r = '^\z';
$r =~ /""/ ? print "Match\n" : print "No match\n";
Output: No match
Why? Is this a Perl syntax thing I do not understand?

my $s = "";
is the same as
my $s = '';
or
my $s = q();
i.e. it assings an empty string to $s.
In
$r =~ /""/
you're testing whether $r contains two double quotes.
To assign a pair of double quotes to $s, use
my $s = '""';

my $r = '^\z';
$r =~ /""/ ? print "Match\n" : print "No match\n";
Here, $r is the test string. /""/ is the regex.
There is no "" in $r, so it does not match.
This /""/ is a string delimited by / char, the result of which turns into a regex object.
It is equivalent to:
$str = '""';
$rx = qr/$str/;

The $s variable is casted to nothing. You can check it by yourself, just add use re 'debug'; in the code or add -Mre=debug to perl invocation:
first example
$ perl -Mre=debug -E '$r=q{^\z}; $s=""; $r =~ /$s/ ? say "Match" : say "No match"'
Compiling REx ""
Final program:
1: NOTHING (2)
2: END (0)
minlen 0
Matching REx "" against "^\z"
0 <> <^\z> | 1:NOTHING(2)
0 <> <^\z> | 2:END(0)
Match successful!
Match
Freeing REx: ""
second example
$ perl -Mre=debug -E '$r=q{^\z}; $r =~ /""/ ? say "Match" : say "No match"'
Compiling REx "%"%""
Final program:
1: EXACT <""> (3)
3: END (0)
anchored "%"%"" at 0 (checking anchored isall) minlen 2
Guessing start of match in sv for REx "%"%"" against "^\z"
Did not find anchored substr "%"%""...
Match rejected by optimizer
No match
Freeing REx: "%"%""

Related

perl regex matching, why is it not finding all matches, why is the order important?

I ran into a problem with perl's regex matching. I destilled it down to a small example on the command line. Why is the order in which the matches are attempted important here ?
1.
$ echo "XYG" | perl -ne 'if ($_ =~ m/X/gi) { print "Matches X\n"; } ; if ($_ =~ m/Y/gi) { print "Matches Y\n"; } ; if ($_ =~ m/G/gi) { print "Matches G\n"; } '
Matches X
Matches Y
Matches G
2.
$ echo "GXY" | perl -ne 'if ($_ =~ m/X/gi) { print "Matches X\n"; } ; if ($_ =~ m/Y/gi) { print "Matches Y\n"; } ; if ($_ =~ m/G/gi) { print "Matches G\n"; } else { print "No match on G\n"; } '
Matches X
Matches Y
No match on G
The 1. examples matches all three letters as expected, but the second example does not match the letter G, why ?
However if I create an intermediate variable, here named $aa:
$ echo "GXY" | perl -ne 'if ($_ =~ m/X/gi) { print "Matches X\n"; } ; if ($_ =~ m/Y/gi) { print "Matches Y\n"; } ; $aa = $_; if ($aa =~ m/G/gi) { print "Matches G\n"; } '
Matches X
Matches Y
Matches G
Then the match works again ?
My perl version is:
$ perl -e 'print "$]\n";'
5.022001
On a LM 18.2 machine
$ lsb_release -d
Description: Linux Mint 18.2 Sonya
Ty+BR
Max.
Because if you match a regex in a scalar context like that, and you set the g flag (for global matching) it's iterative - that's to allow you to do things like while ( m/somepattern/g ) { and have it trigger multiple times.
That's because g means:
g - globally match the pattern repeatedly in the string
It'd not be particularly useful if it reset each time you tried it. But you can also use it slightly differently in an array context:
my #matches = $str =~ m/(some_capture)/g;
And that'll select them all into a list.
But with your code and regex debugging:
#!/usr/bin/env perl
use strict;
use warnings;
use re 'debug';
$_ = 'GXY';
if ( $_ =~ m/X/gi ) { print "Matches X\n"; }
if ( $_ =~ m/Y/gi ) { print "Matches Y\n"; }
if ( $_ =~ m/G/gi ) { print "Matches G\n"; }
else { print "No match on G\n"; }
You'll get (snipped for brevity):
Matching REx "X" against "GXY"
Matching REx "Y" against "Y"
Matching REx "G" against ""
The first match 'eats' "GX" to find "X", leaving "Y" for the next match, but nothing at all for the "G" match.
The simple workaround is omit the g flag, because then you're saying explicitly 'match once' and you'll get:
Matches X
Matches Y
Matches G
Alternatively, you can use the global match with a character class:
$_ = 'GXY';
my #matches = m/([GYX])/g; #implicitly operates on $_
print "Match on $_\n" for #matches;

How to refer to matched part in regex

I am using the following code to search for a substring and print it out with a few characters before and after it. Somehow Perl takes issue with me using $1 and complains about
Use of uninitialized value $1 in concatenation (.) or string.
I cannot figure out why...can you?
use List::Util qw[min max];
my $word = "test";
my $lines = "this is just a test to find something out";
my $context = 3;
while ($lines =~ m/\b$word\b/g ) { # as long as pattern is found...
print "$word\ ";
print "$1";
print substr ($lines, max(pos($lines)-length($1)-$context, 0), length($1)+$context); # check: am I possibly violating any boundaries here
}
You have to capture $word into regex group $1 by using parentheses,
while ($lines =~ m/\b($word)\b/g)
When you use $1, you are asking the code to use the first captured group from the regex and since your regex doesn't have any, well, that variable won't exist.
You can either refer to the whole match with $& or you add a capture group to your regex and keep using $1.
i.e. Either:
use List::Util qw[min max];
my $word = "test";
my $lines = "this is just a test to find something out";
my $context = 3;
while ($lines =~ m/\b$word\b/g ) { # as long as pattern is found...
print "$word\ ";
print "$&";
print substr ($lines, max(pos($lines)-length($&)-$context, 0), length($&)+$context); # check: am I possibly violating any boundaries here
}
Or
use List::Util qw[min max];
my $word = "test";
my $lines = "this is just a test to find something out";
my $context = 3;
while ($lines =~ m/(\b$word\b)/g ) { # as long as pattern is found...
print "$word\ ";
print "$1";
print substr ($lines, max(pos($lines)-length($1)-$context, 0), length($1)+$context); # check: am I possibly violating any boundaries here
}
Note: It doesn't matter whether you use (\b$word\b) or (\b$word)\b or \b($word\b) or \b($word)\b here because \b is a 'string' of 0 length.
When you want to address a matched part in regex, put it in parenthes. Than you'll be able to address this mathced part via $1 variable (for first pair of parenthes), $2 (for the second pair) and so on.
The values $1, $2 and so on hold the strings found by capture groups. When a match is performed all of these variables are set to undef. The code in the question does not have any capture groups and hence $1 is never given a value, it is undefined.
Running the code below shows the effect. Initially $1, $2 and $3 are not defined. The first match sets $1 and $2 but not $3. The second match sets only $1 but not that $2 is cleared to be undefined. The third match has no capture groups and all three are undefined.
use strict;
use warnings;
sub show
{
printf "\$1: %s\n", (defined $1 ? $1 : "-undef-");
printf "\$2: %s\n", (defined $2 ? $2 : "-undef-");
printf "\$3: %s\n", (defined $3 ? $3 : "-undef-");
print "\n";
}
my $text = "abcdefghij";
show();
$text =~ m/ab(cd)ef(gh)ij/; # First match
show();
$text =~ m/ab(cd)efghij/; # Second match
show();
$text =~ m/abcdefghij/; # Third match
show();
$1 will have no value unless you are actually capturing something.
You can adjust your boundary collection method to using lookahead and lookbehinds.
use strict;
use warnings;
my $lines = "this is just a test to find something out";
my $word = "test";
my $extra = 10;
while ($lines =~ m/(?:(?<=(.{$extra}))|(.{0,$extra}))\b(\Q$word\E)\b(?=(.{0,$extra}))/gs ) {
my $pre = $1 // $2;
my $word = $3;
my $post = $4;
print "'...$pre<$word>$post...'\n";
}
Outputs:
'...is just a <test> to find s...'

pattern matching in regular expression (Perl)

Make a pattern that will match three consecutive copies of whatever is currently contained in $what. That is, if $what is fred, your pattern should match fredfredfred. If $what is fred|barney, your pattern should match fredfredbarney, barneyfredfred, barneybarneybarney, or many other variations. (Hint: You should set $what at the top of the pattern test program with a statement like my $what = 'fred|barney';)
But my solution to this is just too easy so I'm assuming its wrong. My solution is:
#! usr/bin/perl
use warnings;
use strict;
while (<>){
chomp;
if (/fred|barney/ig) {
print "pattern found! \n";
}
}
It display what I want. And I didn't even have to save the pattern in a variable. Can someone help me through this? Or enlighten me if I'm doing/understanding the problem wrong?
This example should clear up what was wrong with your solution:
my #tests = qw(xxxfooxx oofoobar bar bax rrrbarrrrr);
my $str = 'foo|bar';
for my $test (#tests) {
my $match = $test =~ /$str/ig ? 'match' : 'not match';
print "$test did $match\n";
}
OUTPUT
xxxfooxx did match
oofoobar did match
bar did match
bax did not match
rrrbarrrrr did match
SOLUTION
#!/usr/bin/perl
use warnings;
use strict;
# notice the example has the `|`. Meaning
# match "fred" or "barney" 3 times.
my $str = 'fred|barney';
my #tests = qw(fred fredfredfred barney barneybarneybarny barneyfredbarney);
for my $test (#tests) {
if( $test =~ /^($str){3}$/ ) {
print "$test matched!\n";
} else {
print "$test did not match!\n";
}
}
OUTPUT
$ ./test.pl
fred did not match!
fredfredfred matched!
barney did not match!
barneybarneybarny did not match!
barneyfredbarney matched!
use strict;
use warnings;
my $s="barney/fred";
my #ra=split("/", $s);
my $test="barneybarneyfred"; #etc, this will work on all permutations
if ($test =~ /^(?:$ra[0]|$ra[1]){3}$/)
{
print "Valid\n";
}
else
{
print "Invalid\n";
}
Split delimited your string based off of "/". (?:$ra[0]|$ra[1]) says group, but do not extract, "barney" or "fred", {3} says exactly three copies. Add an i after the closing "/" if the case doesn't matter. The ^ says "begins with," and the $ says "ends with."
EDIT:
If you need the format to be barney\fred, use this:
my $s="barney\\fred";
my #ra=split(/\\/, $s);
If you know that the matching will always be on fred and barney, then you can just replace $ra[0], $ra[1] with fred and barney.

How can I count the amount of spaces at the start of a string in Perl?

How can I count the amount of spaces at the start of a string in Perl?
I now have:
$temp = rtrim($line[0]);
$count = ($temp =~ tr/^ //);
But that gives me the count of all spaces.
$str =~ /^(\s*)/;
my $count = length( $1 );
If you just want actual spaces (instead of whitespace), then that would be:
$str =~ /^( *)/;
Edit: The reason why tr doesn't work is it's not a regular expression operator. What you're doing with $count = ( $temp =~ tr/^ // ); is replacing all instances of ^ and with itself (see comment below by cjm), then counting up how many replacements you've done. tr doesn't see ^ as "hey this is the beginning of the string pseudo-character" it sees it as "hey this is a ^".
You can get the offset of a match using #-. If you search for a non-whitespace character, this will be the number of whitespace characters at the start of the string:
#!/usr/bin/perl
use strict;
use warnings;
for my $s ("foo bar", " foo bar", " foo bar", " ") {
my $count = $s =~ /\S/ ? $-[0] : length $s;
print "'$s' has $count whitespace characters at its start\n";
}
Or, even better, use #+ to find the end of the whitespace:
#!/usr/bin/perl
use strict;
use warnings;
for my $s ("foo bar", " foo bar", " foo bar", " ") {
$s =~ /^\s*/;
print "$+[0] '$s'\n";
}
Here's a script that does this for every line of stdin. The relevant snippet of code is the first in the body of the loop.
#!/usr/bin/perl
while ($x = <>) {
$s = length(($x =~ m/^( +)/)[0]);
print $s, ":", $x, "\n";
}
tr/// is not a regex operator. However, you can use s///:
use strict; use warnings;
my $t = (my $s = " \t\n sdklsdjfkl");
my $n = 0;
++$n while $s =~ s{^\s}{};
print "$n \\s characters were removed from \$s\n";
$n = ( $t =~ s{^(\s*)}{} ) && length $1;
print "$n \\s characters were removed from \$t\n";
Since the regexp matcher returns the parenthesed matches when called in a list context, CanSpice's answer can be written in a single statement:
$count = length( ($line[0] =~ /^( *)/)[0] );
This prints amount of white space
echo " hello" |perl -lane 's/^(\s+)(.*)+$/length($1)/e; print'
3

Perl regex: how to know number of matches

I'm looping through a series of regexes and matching it against lines in a file, like this:
for my $regex (#{$regexs_ref}) {
LINE: for (#rawfile) {
/#$regex/ && do {
# do something here
next LINE;
};
}
}
Is there a way for me to know how many matches I've got (so I can process it accordingly..)?
If not maybe this is the wrong approach..? Of course, instead of looping through every regex, I could just write one recipe for each regex. But I don't know what's the best practice?
If you do your matching in list context (i.e., basically assigning to a list), you get all of your matches and groupings in a list. Then you can just use that list in scalar context to get the number of matches.
Or am I misunderstanding the question?
Example:
my #list = /$my_regex/g;
if (#list)
{
# do stuff
print "Number of matches: " . scalar #list . "\n";
}
You will need to keep track of that yourself. Here is one way to do it:
#!/usr/bin/perl
use strict;
use warnings;
my #regexes = (
qr/b/,
qr/a/,
qr/foo/,
qr/quux/,
);
my %matches = map { $_ => 0 } #regexes;
while (my $line = <DATA>) {
for my $regex (#regexes) {
next unless $line =~ /$regex/;
$matches{$regex}++;
}
}
for my $regex (#regexes) {
print "$regex matched $matches{$regex} times\n";
}
__DATA__
foo
bar
baz
In CA::Parser's processing associated with matches for /$CA::Regex::Parser{Kills}{all}/, you're using captures $1 all the way through $10, and most of the rest use fewer. If by the number of matches you mean the number of captures (the highest n for which $n has a value), you could use Perl's special #- array (emphasis added):
#LAST_MATCH_START
#-
$-[0] is the offset of the start of the last successful match. $-[n] is the offset of the start of the substring matched by n-th subpattern, or undef if the subpattern did not match.
Thus after a match against $_, $& coincides with substr $_, $-[0], $+[0] - $-[0]. Similarly, $n coincides with
substr $_, $-[n], $+[n] - $-[n]
if $-[n] is defined, and $+ coincides with
substr $_, $-[$#-], $+[$#-] - $-[$#-]
One can use $#- to find the last matched subgroup in the last successful match. Contrast with $#+, the number of subgroups in the regular expression. Compare with #+.
This array holds the offsets of the beginnings of the last successful submatches in the currently active dynamic scope. $-[0] is the offset into the string of the beginning of the entire match. The n-th element of this array holds the offset of the nth submatch, so $-[1] is the offset where $1 begins, $-[2] the offset where $2 begins, and so on.
After a match against some variable $var:
$` is the same as substr($var, 0, $-[0])
$& is the same as substr($var, $-[0], $+[0] - $-[0])
$' is the same as substr($var, $+[0])
$1 is the same as substr($var, $-[1], $+[1] - $-[1])
$2 is the same as substr($var, $-[2], $+[2] - $-[2])
$3 is the same as substr($var, $-[3], $+[3] - $-[3])
Example usage:
#! /usr/bin/perl
use warnings;
use strict;
my #patterns = (
qr/(foo(bar(baz)))/,
qr/(quux)/,
);
chomp(my #rawfile = <DATA>);
foreach my $pattern (#patterns) {
LINE: for (#rawfile) {
/$pattern/ && do {
my $captures = $#-;
my $s = $captures == 1 ? "" : "s";
print "$_: got $captures capture$s\n";
};
}
}
__DATA__
quux quux quux
foobarbaz
Output:
foobarbaz: got 3 captures
quux quux quux: got 1 capture
How about below code:
my $string = "12345yx67hjui89";
my $count = () = $string =~ /\d/g;
print "$count\n";
It prints 9 here as expected.