Perl regex: how to know number of matches - regex

I'm looping through a series of regexes and matching it against lines in a file, like this:
for my $regex (#{$regexs_ref}) {
LINE: for (#rawfile) {
/#$regex/ && do {
# do something here
next LINE;
};
}
}
Is there a way for me to know how many matches I've got (so I can process it accordingly..)?
If not maybe this is the wrong approach..? Of course, instead of looping through every regex, I could just write one recipe for each regex. But I don't know what's the best practice?

If you do your matching in list context (i.e., basically assigning to a list), you get all of your matches and groupings in a list. Then you can just use that list in scalar context to get the number of matches.
Or am I misunderstanding the question?
Example:
my #list = /$my_regex/g;
if (#list)
{
# do stuff
print "Number of matches: " . scalar #list . "\n";
}

You will need to keep track of that yourself. Here is one way to do it:
#!/usr/bin/perl
use strict;
use warnings;
my #regexes = (
qr/b/,
qr/a/,
qr/foo/,
qr/quux/,
);
my %matches = map { $_ => 0 } #regexes;
while (my $line = <DATA>) {
for my $regex (#regexes) {
next unless $line =~ /$regex/;
$matches{$regex}++;
}
}
for my $regex (#regexes) {
print "$regex matched $matches{$regex} times\n";
}
__DATA__
foo
bar
baz

In CA::Parser's processing associated with matches for /$CA::Regex::Parser{Kills}{all}/, you're using captures $1 all the way through $10, and most of the rest use fewer. If by the number of matches you mean the number of captures (the highest n for which $n has a value), you could use Perl's special #- array (emphasis added):
#LAST_MATCH_START
#-
$-[0] is the offset of the start of the last successful match. $-[n] is the offset of the start of the substring matched by n-th subpattern, or undef if the subpattern did not match.
Thus after a match against $_, $& coincides with substr $_, $-[0], $+[0] - $-[0]. Similarly, $n coincides with
substr $_, $-[n], $+[n] - $-[n]
if $-[n] is defined, and $+ coincides with
substr $_, $-[$#-], $+[$#-] - $-[$#-]
One can use $#- to find the last matched subgroup in the last successful match. Contrast with $#+, the number of subgroups in the regular expression. Compare with #+.
This array holds the offsets of the beginnings of the last successful submatches in the currently active dynamic scope. $-[0] is the offset into the string of the beginning of the entire match. The n-th element of this array holds the offset of the nth submatch, so $-[1] is the offset where $1 begins, $-[2] the offset where $2 begins, and so on.
After a match against some variable $var:
$` is the same as substr($var, 0, $-[0])
$& is the same as substr($var, $-[0], $+[0] - $-[0])
$' is the same as substr($var, $+[0])
$1 is the same as substr($var, $-[1], $+[1] - $-[1])
$2 is the same as substr($var, $-[2], $+[2] - $-[2])
$3 is the same as substr($var, $-[3], $+[3] - $-[3])
Example usage:
#! /usr/bin/perl
use warnings;
use strict;
my #patterns = (
qr/(foo(bar(baz)))/,
qr/(quux)/,
);
chomp(my #rawfile = <DATA>);
foreach my $pattern (#patterns) {
LINE: for (#rawfile) {
/$pattern/ && do {
my $captures = $#-;
my $s = $captures == 1 ? "" : "s";
print "$_: got $captures capture$s\n";
};
}
}
__DATA__
quux quux quux
foobarbaz
Output:
foobarbaz: got 3 captures
quux quux quux: got 1 capture

How about below code:
my $string = "12345yx67hjui89";
my $count = () = $string =~ /\d/g;
print "$count\n";
It prints 9 here as expected.

Related

How can I determine the position a pattern matched in a string?

I want to match a pattern with this format /vX.X.X/ where X a number. For example: /v1.1.1/ and /v1.0.300/. After matching the pattern, how can I get the position in the string where I found the pattern?
#- contains the offsets at which the match and captures were found.
$-[0] is the offset at which the pattern matched.
$-[1] is the offset at which the first capture matched.
$-[2] is the offset at which the second capture matched.
etc.
As such, you can use the following:
if ( $s =~ m{/v\d+\.\d+\.\d+/}a ) {
say "Matched at position $-[0]";
}
The approach you take depends on what you are trying to accomplish and the rest of the stuff around the problem. Since you haven't said anything about this, here's a shotgun blast of different ideas. Not all of them may be appropriate for what you are doing.
The #- special variable has the offsets for the starting position of
the match groups. The first element is the start of the entire match,
the second element (index 1) is the start of the $1 match, and so
on. If your pattern is the entire string you want, then you can use
the first element in that array:
if( $string =~ /\bv\d+\.\d+\.\d+\b/ ) {
my $position = $-[0];
say "Position is $position";
}
If you have other stuff around you pattern and the stuff you want is
in the first match group, you can use the second element (remember
that match groups are numbered by the order of the opening parens):
if( $string =~ /before (v\d+\.\d+\.\d+) after/ ) {
my $position = $-[1];
say "Position is $position";
}
When your pattern changes, you may need to update with element you
use.
There's also #+ that works the same but has the ending position. I
have a bunch of examples of this in the first edition of Mastering
Perl. I save it for that book because
I find that many people get confused on which element corresponds to
which part of the pattern. Consider if you'll remember this later.
You can use index to get the position of the matched string:
if( $string =~ /\b(v\d+\.\d+\.\d+)\b/ ) {
my $matched = $1;
my $position = index( $string, $matched );
say "Position is $position";
}
Using the /p flag and ${^PREMATCH} variable from Perl v5.10, count
the positions before the matched part of the string:
use v5.10;
if( $string =~ /\bv\d+\.\d+\.\d+\b/p ) {
my $position = length ${^PREMATCH};
say "Position is $position";
}
Use the /g flag in scalar context and Perl remembers the string
position where the match ended. Subtract the match length to see
where the match started:
if( $string =~ /\b(v\d+\.\d+\.\d+)\b/g ) {
my $matched = $1;
my $position = pos( $string ) - length($1);
say "Position is $position";
}
If there can be multiple matches per string, you'll have to adjust
these. One way uses a while loops since condition is still a scalar
context:
while( $string =~ /\b(v\d+\.\d+\.\d+)\b/g ) {
my $matched = $1;
my $position = pos( $string ) - length($1);
say "Position is $position";
}

Use of uninitialized value $1 in RegEx [duplicate]

Working from an example found else where on stackoverflow.com I am attempting to replace on the Nth instance of a regex match on a string in Perl. My code is as follows:
#!/usr/bin/env perl
use strict;
use warnings;
my $num_args = $#ARGV +1;
if($num_args != 3) {
print "\nUsage: replace_integer.pl occurance replacement to_replace";
print "\nE.g. `./replace_integer.pl 1 \"INTEGER_PLACEHOLDER\" \"method(0 , 1, 6);\"`";
print "\nWould output: \"method(INTEGER_PLACEMENT , 1, 6);\"\n";
exit;
}
my $string =$ARGV[2];
my $cont =0;
sub replacen {
my ($index,$original,$replacement) = #_;
$cont++;
return $cont == $index ? $replacement: $original;
}
sub replace_quoted {
my ($string, $index,$replacement) = #_;
$cont = 0; # initialize match counter
$string =~ s/[0-9]+/replacen($index,$1,$replacement)/eg;
return $string;
}
my $result = replace_quoted ( $string, $ARGV[0] ,$ARGV[1]);
print "RESULT: $result\n";
For
./replace_integer.pl 3 "INTEGER_PLACEHOLDER" "method(0, 1 ,6);"
I'd expect an output of
RESULT: method(0, 1 ,INTEGER_PLACEHOLDER);
Unfortunately I get an output of
RESULT: method(, ,INTEGER_PLACEHOLDER);
With these warnings/errors
Use of uninitialized value in substitution iterator at ./replace_integer.pl line 26.
Use of uninitialized value in substitution iterator at ./replace_integer.pl line 26.
Line 26 is the following line:
$string =~ s/[0-9]+/replacen($index,$1,$replacement)/eg;
I have determined this is due to $1 being uninitialised. To my understanding $1 should have the value of the last match. Given my very simple regex ([0-9]+) I see no reason why it should be uninitialised.
I am aware there are easier ways to find and replace the Nth instance in sed but I will require Perl's back and forward references once this hurdle is overcome (not supported by sed)
Does anyone know the cause of this error and how to fix it?
I am using Perl v5.18.2 , built for x86_64-linux-gnu-thread-multi
Thank you for your time.
$1 is only set after you capture a pattern, for example:
$foo =~ /([0-9]+)/;
# $1 equals whatever was matched between the parens above
Try wrapping your matching in parens to capture it to $1
I would write it like this
The while loop iterates over occurrences of the \d+ pattern in the string. When the Nth occurrence is found the last match is replaced using a call to substr using the values in built-in arrays #- (the start of the last match) and #+ (the end of the last match)
#!/usr/bin/env perl
use strict;
use warnings;
#ARGV = ( 3, 'INTEGER_PLACEHOLDER', 'method(0, 1, 6);' );
if ( #ARGV != 3 ) {
print qq{\nUsage: replace_integer.pl occurrence replacement to_replace};
print qq{\nE.g. `./replace_integer.pl 1 "INTEGER_PLACEHOLDER" "method(0 , 1, 6);"`};
print qq{\nWould output: "method(INTEGER_PLACEMENT , 1, 6);"\n};
exit;
}
my ( $occurrence, $replacement, $string ) = #ARGV;
my $n;
while ( $string =~ /\d+/g ) {
next unless ++$n == $occurrence;
substr $string, $-[0], $+[0]-$-[0], $replacement;
last;
}
print "RESULT: $string\n";
output
$ replace_integer.pl 3 INTEGER_PLACEHOLDER 'method(0, 1, 6);'
RESULT: method(0, 1, INTEGER_PLACEHOLDER);
$ replace_integer.pl 2 INTEGER_PLACEHOLDER 'method(0, 1, 6);'
RESULT: method(0, INTEGER_PLACEHOLDER, 6);
$ replace_integer.pl 1 INTEGER_PLACEHOLDER 'method(0, 1, 6);'
RESULT: method(INTEGER_PLACEHOLDER, 1, 6);

What are the roles of ${$exp}, $-[$exp], and $+[$exp] in this Perl example which extracts the locations where a regular expression matches?

What is the meaning of $#- and $-[$exp]
$x = "Mmm...donut, thought Homer";
$x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches
foreach $exp (1..$#-)
{
print "Match $exp: '${$exp}' at position ($-[$exp],$+[$exp])\n";
}
OUTPUT:
Match 1: 'Mmm' at position (0,3)
Match 2: 'donut' at position (6,11)
In Perl, $#ary is the index of the last element of the array #ary. Therefore, $#- is the index of the last element of the array #-:
$-[0] is the offset of the start of the last successful match. $-[n] is the offset of the start of the substring matched by n-th subpattern, or undef if the subpattern did not match.
Therefore, 1 .. $#- is a range of indices into #-.
${$exp} is a symbolic dereference. If $exp is one, you get the text of $1. I would not use this in real code. Instead, use #+ to extract the substring.
Also, while you know this example is going to match, in real life, never use the digit variables without ensuring the previous match succeeded.
#!/usr/bin/env perl
use strict;
use warnings;
my $x = "Mmm...donut, thought Homer";
if ( $x =~ /^(Mmm|Yech) [.]{3} (donut|peas)/x ) {
for my $i (1 .. $#-) {
my ($s, $e) = ($-[$i], $+[$i]);
printf(
"Match %d: '%s' at position [%d,%d)\n",
$i, substr($x, $s, $e - $s), $s, $e
);
}
}
Output:
Match 1: 'Mmm' at position [0,3)
Match 2: 'donut' at position [6,11)
Note that:
$+[1] is the offset past where $1 ends, $+[2] the offset past where $2 ends, and so on. You can use $#+ to determine how many subgroups were in the last successful match.
Hence, the abuse of the half-closed interval notation above.

How to refer to matched part in regex

I am using the following code to search for a substring and print it out with a few characters before and after it. Somehow Perl takes issue with me using $1 and complains about
Use of uninitialized value $1 in concatenation (.) or string.
I cannot figure out why...can you?
use List::Util qw[min max];
my $word = "test";
my $lines = "this is just a test to find something out";
my $context = 3;
while ($lines =~ m/\b$word\b/g ) { # as long as pattern is found...
print "$word\ ";
print "$1";
print substr ($lines, max(pos($lines)-length($1)-$context, 0), length($1)+$context); # check: am I possibly violating any boundaries here
}
You have to capture $word into regex group $1 by using parentheses,
while ($lines =~ m/\b($word)\b/g)
When you use $1, you are asking the code to use the first captured group from the regex and since your regex doesn't have any, well, that variable won't exist.
You can either refer to the whole match with $& or you add a capture group to your regex and keep using $1.
i.e. Either:
use List::Util qw[min max];
my $word = "test";
my $lines = "this is just a test to find something out";
my $context = 3;
while ($lines =~ m/\b$word\b/g ) { # as long as pattern is found...
print "$word\ ";
print "$&";
print substr ($lines, max(pos($lines)-length($&)-$context, 0), length($&)+$context); # check: am I possibly violating any boundaries here
}
Or
use List::Util qw[min max];
my $word = "test";
my $lines = "this is just a test to find something out";
my $context = 3;
while ($lines =~ m/(\b$word\b)/g ) { # as long as pattern is found...
print "$word\ ";
print "$1";
print substr ($lines, max(pos($lines)-length($1)-$context, 0), length($1)+$context); # check: am I possibly violating any boundaries here
}
Note: It doesn't matter whether you use (\b$word\b) or (\b$word)\b or \b($word\b) or \b($word)\b here because \b is a 'string' of 0 length.
When you want to address a matched part in regex, put it in parenthes. Than you'll be able to address this mathced part via $1 variable (for first pair of parenthes), $2 (for the second pair) and so on.
The values $1, $2 and so on hold the strings found by capture groups. When a match is performed all of these variables are set to undef. The code in the question does not have any capture groups and hence $1 is never given a value, it is undefined.
Running the code below shows the effect. Initially $1, $2 and $3 are not defined. The first match sets $1 and $2 but not $3. The second match sets only $1 but not that $2 is cleared to be undefined. The third match has no capture groups and all three are undefined.
use strict;
use warnings;
sub show
{
printf "\$1: %s\n", (defined $1 ? $1 : "-undef-");
printf "\$2: %s\n", (defined $2 ? $2 : "-undef-");
printf "\$3: %s\n", (defined $3 ? $3 : "-undef-");
print "\n";
}
my $text = "abcdefghij";
show();
$text =~ m/ab(cd)ef(gh)ij/; # First match
show();
$text =~ m/ab(cd)efghij/; # Second match
show();
$text =~ m/abcdefghij/; # Third match
show();
$1 will have no value unless you are actually capturing something.
You can adjust your boundary collection method to using lookahead and lookbehinds.
use strict;
use warnings;
my $lines = "this is just a test to find something out";
my $word = "test";
my $extra = 10;
while ($lines =~ m/(?:(?<=(.{$extra}))|(.{0,$extra}))\b(\Q$word\E)\b(?=(.{0,$extra}))/gs ) {
my $pre = $1 // $2;
my $word = $3;
my $post = $4;
print "'...$pre<$word>$post...'\n";
}
Outputs:
'...is just a <test> to find s...'

How to find the largest repeating string with overlap in a line

I have a series of lines such as
my $string = "home test results results-apr-25 results-apr-251.csv";
#str = $string =~ /(\w+)\1+/i;
print "#str";
How do I find the largest repeating string with overlap which are separated by whitespace?
In this case I'm looking for the output :
results-apr-25
It looks like you need the String::LCSS_XS which calculates Longest Common SubStrings. Don't try it's Perl-only twin brother String::LCSS because there are bugs in that one.
use strict;
use warnings;
use String::LCSS_XS;
*lcss = \&String::LCSS_XS::lcss; # Manual import of `lcss`
my $var = 'home test results results-apr-25 results-apr-251.csv';
my #words = split ' ', $var;
my $longest;
my ($first, $second);
for my $i (0 .. $#words) {
for my $j ($i + 1 .. $#words) {
my $lcss = lcss(#words[$i,$j]);
unless ($longest and length $lcss <= length $longest) {
$longest = $lcss;
($first, $second) = #words[$i,$j];
}
}
}
printf qq{Longest common substring is "%s" between "%s" and "%s"\n}, $longest, $first, $second;
output
Longest common substring is "results-apr-25" between "results-apr-25" and "results-apr-251.csv"
my $var = "home test results results-apr-25 results-apr-251.csv";
my #str = split " ", $var;
my %h;
my $last = pop #str;
while (my $curr = pop #str ) {
if(($curr =~/^$last/) || $last=~/^$curr/) {
$h{length($curr)}= $curr ;
}
$last = $curr;
}
my $max_key = max(keys %h);
print $h{$max_key},"\n";
If you want to make it without a loop, you will need the /g regex modifier.
This will get you all the repeating string:
my #str = $string =~ /(\S+)(?=\s\1)/ig;
I have replaced \w with \S (in your example, \w doesn't match -), and used a look-ahead: (?=\s\1) means match something that is before \s\1, without matching \s\1 itself—this is required to make sure that the next match attempt starts after the first string, not after the second.
Then, it is simply a matter of extracting the longest string from #str:
my $longest = (sort { length $b <=> length $a } #str)[0];
(Do note that this is a legible but far from being the most efficient way of finding the longest value, but this is the subject of a different question.)
How about:
my $var = "home test results results-apr-25 results-apr-251.csv";
my $l = length $var;
for (my $i=int($l/2); $i; $i--) {
if ($var =~ /(\S{$i}).*\1/) {
say "found: $1";
last;
}
}
output:
found: results-apr-25