How can I use Perl to validate this data containing balanced text? - regex

I have a text file filled with sentences with unique pattern. The unique pattern is:
NAME [ e_NAME ]
simple rule: the "NAME" must follow after "e_" if the "e_" appearers inside the brackets!
The problem comes out when the string is complicated. I'll show the end point situations that may be hard to analyse:
Lines that won't match the rule:
(1) NAME1[blabla+NAME2[blabla+e_BAD2]+e_NAME1]
(2) NAME1[blabla] + NAME2[e_BAD2]
(3) NAME1[NAME2[blabla+e_BAD2]] + NAME3[e_BAD3]
(4) NAME1[e_NAME1BAD1] -> means it has to be only NAME1
Lines that match the rule:
(1) FOO1[blabla + 1]
(2) [blalbla] + bla
(3) bla + blabla
(4) FOO1[ccc + ddd + FOO2[e_FOO2]] = 123
(5) FOO1[cc + FOO2[ dd ] ] + FOO3[e_FOO3]
I already asked this question but I couldn't catch this end points...

Edited after requirements were clarified
Either Text::Balanced or Regexp::Common might be useful. I initially posted an answer using the former but didn't like it very much. The following example uses Regexp::Common and seems fairly straightforward.
use strict;
use warnings;
use Regexp::Common;
my $PRE = '[^[]*?';
my $VAR = '\w+';
my $BRACK = $RE{balanced}{-parens=>'[]'};
my $POST = '.*';
while (<DATA>){
my ($bad, $full);
# Brackets, if any, must balance
$bad = 1 unless s/\[/[/g == s/\]/]/g;
$full = $_;
until ($bad){
# Find some bracketed text and store all components.
my ($pre, $var, $brack, $post) =
$full =~ /^($PRE)($VAR)($BRACK)($POST)$/;
last unless defined $brack;
# Create a copy of the bracketed text, removing both the outer
# brackets and all instances of inner-bracketed text.
chop (my $clean = substr $brack, 1);
$clean =~ s/$BRACK/ /g;
# If e_FOO exists, FOO must equal $var.
$bad = 1 if $clean =~ /e_(\w+)/ and $1 ne $var;
# Remove the part of $full we've already checked.
substr($full, 0, length($pre) + length($var) + 1, '');
}
print if $bad;
}
# Your test data, with some trailing comments.
__DATA__
NAME1[blabla+NAME2[blabla+e_BAD2]+e_NAME1] NOT OK 1
NAME1[blabla] + NAME2[e_BAD2] NOT OK 2
NAME1[NAME2[blabla+e_BAD2]] + NAME3[e_BAD3] NOT OK 3
NAME1[e_NAME1BAD1] NOT OK 4
FOO1[blabla + 1] OK 1
[blalbla] + bla OK 2
bla + blabla OK 3
FOO1[ccc + ddd + FOO2[e_FOO2]] = 123 OK 4
FOO1[cc + FOO2[ dd ] ] + FOO3[e_FOO3] OK 5

Maybe you are looking for something like:
if ($string =~ /(\w+)\[e\\_(\w+)/ && $1 eq $2) {
print "Pattern '$1' contained in string '$string'\n";
}

Based on the accepted answer to your first question, I came up with this:
use strict;
use warnings;
while (<DATA>) {
my $l = $_;
while (s/(\w+)\[([^\[\]]*)\]//) {
my ($n, $chk) = ($1, $2);
unless ($chk =~ /\be_$n\b/) {
warn "Bad line: $l";
last;
}
}
}
The \b checks for a word boundary. This version still doesn't check for unbalanced brackets, but it does seem to catch all the examples you gave, and will also complain when the e_NAME1 is inside another nested block, like so:
NAME1[stuff + NAME2[e_NAME1 + e_NAME2] + morestuff]

use Text::Balanced;
CPAN is wonderful.

Related

perl multiline regex to separate comments within paragraphs

The script below works, but it requires a kludge. By "kludge" I mean a line of code which makes the script do what I want --- but I do not understand why the line is necessary. Evidently, I do not understand exactly what the multiline regex substitution, ending /mg, is doing.
Is there not a more elegant way to accomplish the task?
The script reads through a file by paragraphs. It partitions each paragraph into two subsets: $text and $cmnt. The $text includes the left part of every line, i.e., from the first column up to the first %, if it exists, or to end of the line if it doesn't. The $cmnt includes the rest.
Motivation: The files to be read are LaTeX markup, where % announces the beginning of a comment. We could change the value of $breaker to equal # if we were reading through a perl script. After separating $text from $cmnt, one could perform a match across lines such as
print "match" if ($text =~ /WOLF\s*DOG/s);
Please see the line labeled "kludge."
Without that line, something funny happens after the last % in a record. If there are lines of $text
(material not commented out by %) after the last commented line of the record, those lines are included both at the end of $cmnt and in $text.
In the example below, this means that without the kludge, in record 2, "cat lion" is included both in the $text, where it belongs, and also in $cmnt.
(The kludge causes an unnecessary % to appear at the end of every non-void $cmnt. This is because the kludge-pasted-on % announces a final, fictitious empty comment line.)
According to https://perldoc.perl.org/perlre.html#Modifiers, the /m regex modifier means
Treat the string being matched against as multiple lines. That is, change "^" and "$" from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string.
Therefore, I would expect the 2nd match in
s/^([^$breaker]*)($breaker.*?)$/$2/mg
to start with the first %, to extend as far of end-of-line, and stop there. So even without the kludge, it should not include the "cat lion" in record 2? But obviously it does, so I am misreading, or missing, some part of the documentation. I suspect it has to do with the /g regex modifier?
#!/usr/bin/perl
use strict; use warnings;
my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<>)
{
$count_record++;
my $text = $_;
my $cmnt;
s/[\n]*\z/$breaker/; # kludge
s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
if ($text =~ s/^([^$breaker]*)($breaker.*?)$/$1/mg) # non-greedy
{
$cmnt = $_;
die "cmnt does not match" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$/$2/mg); # non-greedy
}
else
{
$cmnt = '';
}
print "\nRECORD $count_record:\n";
print "******** text==";
print "\n|";
print $text;
print "|\n";
print "******** cmnt==|";
print $cmnt;
print "|\n";
}
Example file to run it on:
dog wolf % flea
DOG WOLF % FLEA
DOG WOLLLLLLF % FLLLLLLEA
% what was that?
cat lion
no comments in this line
%The last paragraph of this file is nothing but a single-line comment.
You must also delete the lines that does not contain a comment from $cmnt:
use feature qw(say);
use strict;
use warnings;
my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<>)
{
$count_record++;
my $text = $_;
my $cmnt;
s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
if ($text =~ s/^([^$breaker]*)($breaker.*?)$/$1/mg) # non-greedy
{
$cmnt = $_;
$cmnt =~ s/^[^$breaker]*?$//mg;
die "cmnt does not match" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$/$2/mg); # non-greedy
}
else
{
$cmnt = '';
}
print "\nRECORD $count_record:\n";
print "******** text==";
print "\n|";
print $text;
print "|\n";
print "******** cmnt==|";
print $cmnt;
print "|\n";
}
Output:
RECORD 1:
******** text==
|dog wolf
DOG WOLF
DOG WOLLLLLLF
|
******** cmnt==|% flea
% FLEA
% FLLLLLLEA
|
RECORD 2:
******** text==
|
cat lion
|
******** cmnt==|% what was that?
|
RECORD 3:
******** text==
|no comments in this line
|
******** cmnt==||
RECORD 4:
******** text==
||
******** cmnt==|%The last paragraph of this file is nothing but a single-line comment.
|
My main source of confusion was a failure to distinguish between
whether or not an entire record matches -- here, a record is potentially a multi-line paragraph, and
whether or not a line inside a record matches.
The following script incorporates insights from both answers that others offered, and includes extensive explanation.
#!/usr/bin/perl
use strict; use warnings;
my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<DATA>)
{
$count_record++;
my $text = $_;
my $cmnt;
s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
print "RECORD $count_record:";
print "\n|"; print $_; print "|\n";
# https://perldoc.perl.org/perlre.html#Modifiers
# the following regex:
# ^ /m: ^==start of line, not of record
# ([^$breaker]*) zero or more characters that are not $breaker
# ($breaker.*?) non-greedy: the first instance of $breaker, followed by everything after $breaker
# $ /m: $==end of line, not of record
# /g: "globally match the pattern repeatedly in the string"
if ($text =~ s/^([^$breaker]*)($breaker.*?)$/$1/mg)
{
$cmnt = $_;
# In at least one line of this record, the pattern above has matched.
# But this does not mean every line matches. There may be any number of
# lines inside the record that do not match /$breaker/; for these lines,
# in spite of /g, there will be no match, and thus the exclusion of $1 and printing only of $2,
# in the substitution below, will not take place. Thus, those particular lines must be deleted from $cmnt.
# Thus:
$cmnt =~ s/^[^$breaker]*?$/\n/mg; # remove entire line if it does not match /$breaker/
# recall that /m guarantees that ^ and $ match the start and end of the line, not of the record.
die "code error: cmnt does not match this record" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$/$2/mg);
if ( $text =~ /\S/ )
{
print "|text|==\n|$text|\n";
}
else
{
print "NO text found\n";
}
print "|cmnt|==\n|$cmnt|\n";
}
else
{
print "NO comment found\n";
}
}
__DATA__
one dogs% one comment %d**n lies %statistics
two %two comment
thuh-ree
fower
fi-yiv % (he means 5)
SIX 66 % ¿666==antichrist?
seven % the seventh seal, the seven days
ate
niner
ten
As Douglass said to Lincoln ...
%Darryl Pinckney
Regular expression modifier mg assumes that a string it applied to includes multiple lines (includes \n in the string). It instructs regular expression to look through all lines in the string.
Please study following code which should simplify solution to your problem.
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my $breaker = '%';
my #records = do { local $/ = ''; <DATA> };
for( #records ) {
my %hash = ( /(.*?)$breaker(.*)/mg );
next unless %hash;
say Dumper(\%hash);
}
__DATA__
dog wolf % flea
DOG WOLF % FLEA
DOG WOLLLLLLF % FLLLLLLEA
% what was that?
cat lion
no comments in this line
%The last paragraph of this file is nothing but a single-line comment.
Output
$VAR1 = {
'DOG WOLF ' => ' FLEA ',
'dog wolf ' => ' flea ',
'DOG WOLLLLLLF ' => ' FLLLLLLEA '
};
$VAR1 = {
'' => ' what was that?'
};
$VAR1 = {
'' => 'The last paragraph of this file is nothing but a single-line comment.'
};

Matching all characters in a string except one in any position

How to match (preferably in perl) all strings that match the query string except one character?
Query: TLAQLLLDK
Want to match: xLAQLLLDK, TxAQLLLDK, TLxQLLLDK, etc.
Where 'x' is any capital letter '[A-Z]'.
Use alternation operator.
^(?:[A-Z]LAQLLLDK|T[A-Z]AQLLLDK|TL[A-Z]QLLLDK|.....)$
Likewise fill all..
You can do that by writing a terrible regular expression, which will be horribly slow to build and even slower to execute, or you can just don't use regexes for things like these and write a function that just compares both strings character after character, allows for one "mistake" and returns True only if there was exactly one mistake.
How to match (preferably in perl) all strings that match the query string except one character?
Expanding the answer of #Avinash, by generating the required regular expression dynamically at run time:
my $query = 'TLAQLLLDK';
my $re_proto = '(' . join( '|', map { (my$x=$query)=~s/^(.{$_})./$1\[A-Za-z]/; $x; } (0 .. length($query)-1) ) . ')';
my $re = qr/^$re_proto$/;
my #input = qw(xLAQLLLDK TxAQLLLDK TLxQLLLDK);
my #matches = grep { /$re/ } #input;
print "#matches\n";
(I had to include the [a-z] too, since your example input uses the x as the marker.)
If you need to do that very often, I would advise to cache the generated regular expressions.
Is this what you are looking for?
#!/usr/bin/perl
use strict;
my #str = ("ULAQLLLDK","TAAQLLLDK","TLCQLLLDK","TLAQLLLDK");
while(<#str>){
if (/[A-S,U-Z]LAQLLLDK|T[A-K,M-Z]AQLLLDK|TL[B-Z]QLLLDK/ ){
print "$_\n";
}
}
output:
ULAQLLLDK
TAAQLLLDK
TLCQLLLDK
There are only 9 x 25 = 225 such strings, so you may as well generate them all and put them in a hash for comparison
use strict;
use warnings;
use 5.010;
my %matches;
my $s = 'TLAQLLLDK';
for my $i (0 .. length($s) - 1) {
my $c = substr $s, $i, 1;
for my $cc ('A' .. 'Z') {
substr(my $ss = $s, $i, 1) = $cc;
++$matches{$ss} unless $cc eq $c;
}
}
printf "%d matches found\n", scalar keys %matches;
for ( qw/ TLAQLLLDK TLAQLXLDK / ) {
printf "\$matches{%s} = %s\n", $_, $matches{$_} // 'undef';
}
output
225 matches found
$matches{TLAQLLLDK} = undef
$matches{TLAQLXLDK} = 1

In regular expression matching of Perl, is it possible to know number of matches in a{n,}?

What I mean is:
For example, a{3,} will match 'a' at least three times greedly. It may find five times, 10 times, etc. I need this number. I need this number for the rest of the code.
I can do the rest less efficiently without knowing it, but I thought maybe Perl has some built-in variable to give this number or is there some trick to get it?
Just capture it and use length.
if (/(a{3,})/) {
print length($1), "\n";
}
Use #LAST_MATCH_END and #LAST_MATCH_START
my $str = 'jlkjmkaaaaaamlmk';
$str =~ /a{3,}/;
say $+[0]-$-[0];
Output:
6
NB: This will work only with a one-character pattern.
Here's an idea (maybe this is what you already had?) assuming the pattern you're interested in counting has multiple characters and variable length:
capture the substring which matches the pattern{3,} subpattern
then match the captured substring globally against pattern (note the absence of the quantifier), and force a list context on =~ to get the number of matches.
Here's a sample code to illustrate this (where $patt is the subpattern you're interested in counting)
my $str = "some catbratmatrattatblat thing";
my $patt = qr/b?.at/;
if ($str =~ /some ((?:$patt){3,}) thing/) {
my $count = () = $1 =~ /$patt/g;
print $count;
...
}
Another (admittedly somewhat trivial) example with 2 subpatterns
my $str = "some catbratmatrattatblat thing 11,33,446,70900,";
my $patt1 = qr/b?.at/;
my $patt2 = qr/\d+,/;
if ($str =~ /some ((?:$patt1){3,}) thing ((?:$patt2){2,})/) {
my ($substr1, $substr2) = ($1, $2);
my $count1 = () = $substr1 =~ /$patt1/g;
my $count2 = () = $substr2 =~ /$patt2/g;
say "count1: " . $count1;
say "count2: " . $count2;
}
Limitation(s) of this approach:
Fails miserably with lookarounds. See amon's example.
If you have a pattern of type /AB{n,}/ where A and B are complex patterns, we can split the regex into multiple pieces:
my $string = "ABABBBB";
my $n = 3;
my $count = 0;
TRY:
while ($string =~ /A/gc) {
my $pos = pos $string; # remember position for manual backtracking
$count++ while $string =~ /\GB/g;
if ($count < $n) {
$count = 0;
pos($string) = $pos; # restore previous position
} else {
last TRY;
}
}
say $count;
Output: 4
However, embedding code into the regex to do the counting may be more desirable, as it is more general:
my $string = "ABABBBB";
my $count;
$string =~ /A(?{ $count = 0 })(?:B(?{ $count++ })){3,}/ and say $count;
Output: 4.
The downside is that this code won't run on older perls. (Code was tested on v14 & v16).
Edit: The first solution will fail if the B pattern backtracks, e.g. $B = qr/BB?/. That pattern should match the ABABBBB string three times, but the strategy will only let it match two times. The solution using embedded code allows proper backtracking.

When there is a similar pattern in an expression, how to extract the occurence of the last instance in perl?

The value of $s is dynamic. I need to extract the values that occur after the last | in between each [].
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
my #parts = split(/\]/, $s);
foreach my $part (#parts)
{
# Need to extract the values that occur after the last '|'
# (for example: !, .1iit, 10:48AM, Calculator, Coffee)
# and store each of the values separately in a hash
}
Could someone help me out in this?
Thanks,
Best to transform the string into a more useful data structure, then take the needed elements. Why is this best? Because right now you need the last element, but perhaps next time you will need some other part. Since its not harder to do it right, why not?
#!/usr/bin/perl
use strict;
use warnings;
# Only needed for Dumper
use Data::Dumper;
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
# Extract each group between []
# Then transform each group into an array reference by splitting on |
my #groups = map { [ split /\|/ ] } ($s =~ /\[([^\]]*)\]/g);
# Inspect the data structure
print Dumper \#groups;
# Print only the last element of each sub-array
print "$_\n" for map {$_->[-1]} #groups;
If needed the third elements of the sub-arrays could be transformed into hashrefs quite easily too. ,however since that wasn't needed, I leave that as an exercise for the reader (I always love saying that when I get the chance!).
Edit: since I found it interesting I ended up creating these hashrefs, here is the code that would replace the my #groups line:
my #groups = map { [ map { /\{([^\}]*)\}/ ? { split /(?:=|,)/, $1 } : $_ } (split /\|/) ] } ($s =~ /\[([^\]]*)\]/g);
or more properly commented (map commands are read from the back, so the comments start at the bottom and follow by number, comments like #/N pair with those like #N)
my #groups = map { #/1
[ #/2
map { #/3
/\{([^\}]*)\}/ #4 ... and if any element (separated by pipes in #3)
# is surrounded by curly braces
? { #5 ... then return a hash ref
split /(?:=|,)/, $1 #6 ... whose elements are given
# pairwise between '=' or ',' signs
} #/5
: $_ #7 ... otherwise (from 'if' in #4 ) return the element as is
} (split /\|/) #3 ... where each element is separated by pipes (i.e. |)
] #2 ... return an array ref
} ($s =~ /\[([^\]]*)\]/g); #1 For each element between sqr braces (i.e. [])
The generic way:
#subparts = split /\|/, $part;
$tail = $subparts[$#subparts];
If you only ever need the last part separately:
$part =~ /([^\|]*)$/ and $tail = $1;
my ($value) = $part =~ m/[^|]\|(.+)$/;
print "$part => $value\n";
and another way:
my $s =
"[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
my #parts = $s =~ m/\|([^|]+)]/g;
print join( "\n", #parts );
Since you insist on a regex:
#matches = $s =~ /\|([^|]+?)]/g
Using /g will dump all matches into the array #matches
You really don't need a regex... just use split(). The results are stored in %results
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
foreach my $part (split(/\]/, $s))
{
#pieces = split(/\|/, $part);
$results{$pieces[-1]} = $pieces[-1];
}
With regexes, when you think “I want the last of,” you should immediately think of the pattern .* because regex greed does just what you want.
For example, matching /^(.*)a(.*)$/ chops up "abababab" into
ababab in $1
a matched by the literal in the pattern
b in $2
Let's think through the process of the match. Imagine .* as Augustus Gloop.
Augustus: Ausgezeichnet! The ^ anchor means I get to start at the beginning. From there, I shall eat all the candies!
Willie Wonka: But, my dear Augustus, you must share with the other children.
Augustus: Fine, I get "abababa" and they get "b". Happy?
Willie Wonka: But the next child in line doesn't like b candies.
Augustus: Then I shall keep "ababab" for myself and leave "ab" for the others.
At this point, Augustus has his big pile, humble little Charlie Bucket gets his single a, and Veruca Salt—although scowling about the meager quantity—gets at least something now.
In other words, $2 contains everything after the last a. To be persnickety, the ^ and $ anchors are redundant, but I like keeping them for added emphasis.
Putting this into action, you could write
#! /usr/bin/env perl
use strict;
use warnings;
sub last_fields {
local($_) = #_;
my #last;
push #last, $1 =~ /^.*\|(.+)$/ ? $1 : undef
while /\[(.*?)\]/g;
wantarray ? #last : \#last;
}
The outer while breaks up the string into [...] chunks and assumes that right square-bracket cannot occur inside a chunk. Within each chunk, we use /^.*\|(.+)$/ to capture in $1 everything after the last pipe.
Testing it with your example looks like
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!]" .
"[0|0|{A=167,B=2,C=67,D=17}|.1iit]" .
"[196|0|{A=244,B=6,C=67,D=12}|10:48AM]" .
"[204|0|{A=9,B=201,C=61,D=11}|Calculator]" .
"[66|0|{A=145,B=450,C=49,D=14}|Coffee]";
use Test::More tests => 6;
my #lasts = last_fields $s;
# yes, is_deeply could do this in a single call,
# but it's laid out explicitly here for expository benefit
is $lasts[0], "!";
is $lasts[1], ".1iit";
is $lasts[2], "10:48AM";
is $lasts[3], "Calculator";
is $lasts[4], "Coffee";
is scalar #lasts, 5;
All the tests pass:
$ ./match-last-of
1..6
ok 1
ok 2
ok 3
ok 4
ok 5
ok 6
The output of prove is nicer. Run it yourself to see the color coding.
$ prove ./match-last-of
./match-last-of .. ok
All tests successful.
Files=1, Tests=6, 0 wallclock secs ( 0.02 usr 0.01 sys + 0.02 cusr 0.00 csys = 0.05 CPU)
Result: PASS

Perl regex: how to know number of matches

I'm looping through a series of regexes and matching it against lines in a file, like this:
for my $regex (#{$regexs_ref}) {
LINE: for (#rawfile) {
/#$regex/ && do {
# do something here
next LINE;
};
}
}
Is there a way for me to know how many matches I've got (so I can process it accordingly..)?
If not maybe this is the wrong approach..? Of course, instead of looping through every regex, I could just write one recipe for each regex. But I don't know what's the best practice?
If you do your matching in list context (i.e., basically assigning to a list), you get all of your matches and groupings in a list. Then you can just use that list in scalar context to get the number of matches.
Or am I misunderstanding the question?
Example:
my #list = /$my_regex/g;
if (#list)
{
# do stuff
print "Number of matches: " . scalar #list . "\n";
}
You will need to keep track of that yourself. Here is one way to do it:
#!/usr/bin/perl
use strict;
use warnings;
my #regexes = (
qr/b/,
qr/a/,
qr/foo/,
qr/quux/,
);
my %matches = map { $_ => 0 } #regexes;
while (my $line = <DATA>) {
for my $regex (#regexes) {
next unless $line =~ /$regex/;
$matches{$regex}++;
}
}
for my $regex (#regexes) {
print "$regex matched $matches{$regex} times\n";
}
__DATA__
foo
bar
baz
In CA::Parser's processing associated with matches for /$CA::Regex::Parser{Kills}{all}/, you're using captures $1 all the way through $10, and most of the rest use fewer. If by the number of matches you mean the number of captures (the highest n for which $n has a value), you could use Perl's special #- array (emphasis added):
#LAST_MATCH_START
#-
$-[0] is the offset of the start of the last successful match. $-[n] is the offset of the start of the substring matched by n-th subpattern, or undef if the subpattern did not match.
Thus after a match against $_, $& coincides with substr $_, $-[0], $+[0] - $-[0]. Similarly, $n coincides with
substr $_, $-[n], $+[n] - $-[n]
if $-[n] is defined, and $+ coincides with
substr $_, $-[$#-], $+[$#-] - $-[$#-]
One can use $#- to find the last matched subgroup in the last successful match. Contrast with $#+, the number of subgroups in the regular expression. Compare with #+.
This array holds the offsets of the beginnings of the last successful submatches in the currently active dynamic scope. $-[0] is the offset into the string of the beginning of the entire match. The n-th element of this array holds the offset of the nth submatch, so $-[1] is the offset where $1 begins, $-[2] the offset where $2 begins, and so on.
After a match against some variable $var:
$` is the same as substr($var, 0, $-[0])
$& is the same as substr($var, $-[0], $+[0] - $-[0])
$' is the same as substr($var, $+[0])
$1 is the same as substr($var, $-[1], $+[1] - $-[1])
$2 is the same as substr($var, $-[2], $+[2] - $-[2])
$3 is the same as substr($var, $-[3], $+[3] - $-[3])
Example usage:
#! /usr/bin/perl
use warnings;
use strict;
my #patterns = (
qr/(foo(bar(baz)))/,
qr/(quux)/,
);
chomp(my #rawfile = <DATA>);
foreach my $pattern (#patterns) {
LINE: for (#rawfile) {
/$pattern/ && do {
my $captures = $#-;
my $s = $captures == 1 ? "" : "s";
print "$_: got $captures capture$s\n";
};
}
}
__DATA__
quux quux quux
foobarbaz
Output:
foobarbaz: got 3 captures
quux quux quux: got 1 capture
How about below code:
my $string = "12345yx67hjui89";
my $count = () = $string =~ /\d/g;
print "$count\n";
It prints 9 here as expected.