Pattern matching in perl (Lookahead and Condition on word Index) - regex

I have a long string, containing alphabetic words and each delimited by one single character ";" . The whole string also starts and ends with a ";" .
How do I count the number of occurrences of a pattern (started with ";") if index of a success match is divisible by 5.
Example:
$String = ";the;fox;jumped;over;the;dog;the;duck;and;the;frog;"
$Pattern = ";the(?=;f)"
OUTPUT: 1
Since:
Note 1: In above case, the $Pattern ;the(?=;f) exists as the 1st and 10th words in the $String; however; the output result would be 1, since only the index of second match (10) is divisible by 5.
Note 2: Every word delimited by ";" counts toward the index set.
Index of the = 1 -> this does not match since 1 is not divisible by 5
Index of fox = 2
Index of jumped = 3
Index of over = 4
Index of the = 5 -> this does not match since the next word (dog) starts with "d" not "f"
Index of dog = 6
Index of the = 7 -> this does not match since 7 is not divisible by 5
Index of duck = 8
Index of and = 9
Index of the = 10 -> this does match since 10 is divisible by 5 and the next word (frog) starts with "f"
Index of frog = 11
If possible, I am wondering if there is a way to do this with a single pattern matching without using list or array as the $String is extremely long.

Use Backtracking control verbs to process the string 5 words at a time
One solution is to add a boundary condition that the pattern is preceded by 4 other words.
Then setup an alteration so that if your pattern is not matched, the 5th word is gobbled and then skipped using backtracking control verbs.
The following demonstrates:
#!/usr/bin/env perl
use strict;
use warnings;
my $string = ";the;fox;jumped;over;the;dog;the;duck;and;the;frog;";
my $pattern = qr{;the(?=;f)};
my #matches = $string =~ m{
(?: ;[^;]* ){4} # Preceded by 4 words
(
$pattern # Match Pattern
|
;(*SKIP)(*FAIL) # Or consume 5th word and skip to next part of string.
)
}xg;
print "Number of Matches = " . #matches . "\n";
Outputs:
Number of Matches = 1
Live Demo
Supplemental Example using Numbers 1 through 100 in words
For additional testing, the following constructs a string of all numbers in word format from 1 to 100 using Lingua::EN::Numbers.
For the pattern it looks for a number that's a single word with the next number that begins with the letter S.
use Lingua::EN::Numbers qw(num2en);
my $string = ';' . join( ';', map { num2en($_) } ( 1 .. 100 ) ) . ';';
my $pattern = qr{;\w+(?=;s)};
my #matches = $string =~ m{(?:;[^;]*){4}($pattern|;(*SKIP)(*FAIL))}g;
print "#matches\n";
Outputs:
;five ;fifteen ;sixty ;seventy
Reference for more techniques
The following question from last month is a very similar problem. However, I provided 5 different solutions in addition to the one demonstrated here:
In Perl, how to count the number of occurences of successful matches based on a condition on their absolute positions

You can count the number of semicolons in each substring up to the matching position. For a million-word string, it takes 150 seconds.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $string = join ';', q(),
map { qw( the fox jumped over the dog the duck and the frog)[int rand 11] }
1 .. 1000;
$string .= ';';
my $pattern = qr/;the(?=;f)/;
while ($string =~ /$pattern/g) {
my $count = substr($string, 0, pos $string) =~ tr/;//;
say $count if 0 == $count % 5;
}

Revised Answer
One relatively simple way to achieve what you want is by replacing the delimiters in the original text that occur on a 5-word-index boundary:
$text =~ s/;/state $idx++ % 5 ? ',' : ';'/eg;
Now you just need to trivially adjust your $pattern to look for ;the,f instead of ;the;f. You can use the =()= pseudo-operator to return the count:
my $count =()= $text =~ /;the(?=,f)/g;
Original answer after the break. (Thanks to #choroba for pointing out the correct interpretation of the question.)
Character-Based Answer
This uses the /g regex modifier in combination with pos() to look at matching words. For illustration, I print out all matches (not just those on 5-character boundaries), but I print (match) beside those on 5-char boundaries. The output is:
;the;fox;jumped;over;the;dog;the;duck;and;the;frog
^....^....^....^....^....^....^....^....^....^....
`the' #0 (match)
`the' #41
And the code is:
#!/usr/bin/env perl
use 5.010;
my $text = ';the;fox;jumped;over;the;dog;the;duck;and;the;frog';
say $text;
say '^....^....' x 5;
my $pat = qr/;(the)(?=;f)/;
#$pat = qr/;([^;]+)/;
while ($text =~ /$pat/g) {
my $pos = pos($text) - length($1) - 1;
say "`$1' \#$pos". ($pos % 5 ? '' : ' (match)');
}

First of, pos is also possible as a left hand side expression. You could make use of the \G assertion in combination with index (since speed is of concern for you). I expanded your example to showcase that it only "matches" for divisibles of 5 (your example also allowed for indices not divisible by 5 to be 1 a solution, too). Since you only wanted the number of matches, I only used a $count variable and incremented. If you want something more, use the normal if {} clause and do something in the block.
my $string = ";the;fox;jumped;over;the;dog;the;duck;and;the;frog;or;the;fish";
my $pattern = qr/;the(?=;f)/;
my ($index,$count, $position) = (0,0,0);
while(0 <= ($position = index $string, ';',$position)){
pos $string = $position++; #add one to $position, to terminate the loop
++$count if (!(++$index % 5) and $string =~/\G$pattern/);
}
say $count; # says 1, not 2
You could use the experimental features of regexes to solve you problem (especially the (?{}) blocks). Before you do, you really should read the corresponding section in the perldocs.
my ($index, $count) = (0,0);
while ($string =~ /; # the `;'
(?(?{not ++$index % 5}) # if with a code condition
the(?=;f) # almost your pattern, but we'll have to count
|(*FAIL)) # else fail
/gx) {
$count++;
}

Related

Perl: Matching 3 pairs of numbers from 4 consecutive numbers

I am writing some code and I need to do the following:
Given a 4 digit number like "1234" I need to get 3 pairs of numbers (the first 2, the 2 in the middle, and the last 2), in this example I need to get "12" "23" and "34".
I am new to perl and don't know anything about regex. In fact, I am writing a script for personal use and I've started reading about Perl some days ago because I figured it was going to be a better language for the task at hand (need to do some statistics with the numbers and find patterns)
I have the following code but when testing I processed 6 digit numbers, because I "forgot" that the numbers I would be processing are 4 digits, so it failed with the real data, of course
foreach $item (#totaldata)
{
my $match;
$match = ($item =~ m/(\d\d)(\d\d)(\d\d)/);
if ($match)
{
($arr1[$i], $arr2[$i], $arr3[$i]) = ($item =~ m/(\d\d)(\d\d)(\d\d)/);
$processednums++;
$i++;
}
}
Thank you.
You can move last matching position with pos()
pos directly accesses the location used by the regexp engine to store the offset, so assigning to pos will change that offset..
my $item = 1234;
my #arr;
while ($item =~ /(\d\d)/g) {
push #arr, $1;
pos($item)--;
}
print "#arr\n"; # 12 23 34
The simplest way would be to use a global regex pattern search
It is nearly always best to separate verificaton of the input data from processing, so the program below first rejects any values that are not four characters long or that contain a non-digit character
Then the regex pattern finds all points in the string that are followed by two digits, and captures them
use strict;
use warnings 'all';
for my $val ( qw/ 1234 6572 / ) {
next if length($val) != 4 or $val =~ /\D/;
my #pairs = $val =~ /(?=(\d\d))/g;
print "#pairs\n";
}
output
12 23 34
65 57 72
Here's a pretty loud example demonstrating how you can use substr() to fetch out the portions of the number, while ensuring that what you're dealing with is in fact exactly a four-digit number.
use warnings;
use strict;
my ($one, $two, $three);
while (my $item = <DATA>){
if ($item =~ /^\d{4}$/){
$one = substr $item, 0, 2;
$two = substr $item, 1, 2;
$three = substr $item, 2, 2;
print "one: $one, two: $two, three: $three\n";
}
}
__DATA__
1234
abcd
a1b2c3
4567
891011
Output:
one: 12, two: 23, three: 34
one: 45, two: 56, three: 67
foreach $item (#totaldata) {
if ( my #match = $item =~ m/(?=(\d\d))/ ) {
($heads[$i], $middles[$i], $tails[$i]) = #match;
$processednums++;
$i++;
}
}

In regular expression matching of Perl, is it possible to know number of matches in a{n,}?

What I mean is:
For example, a{3,} will match 'a' at least three times greedly. It may find five times, 10 times, etc. I need this number. I need this number for the rest of the code.
I can do the rest less efficiently without knowing it, but I thought maybe Perl has some built-in variable to give this number or is there some trick to get it?
Just capture it and use length.
if (/(a{3,})/) {
print length($1), "\n";
}
Use #LAST_MATCH_END and #LAST_MATCH_START
my $str = 'jlkjmkaaaaaamlmk';
$str =~ /a{3,}/;
say $+[0]-$-[0];
Output:
6
NB: This will work only with a one-character pattern.
Here's an idea (maybe this is what you already had?) assuming the pattern you're interested in counting has multiple characters and variable length:
capture the substring which matches the pattern{3,} subpattern
then match the captured substring globally against pattern (note the absence of the quantifier), and force a list context on =~ to get the number of matches.
Here's a sample code to illustrate this (where $patt is the subpattern you're interested in counting)
my $str = "some catbratmatrattatblat thing";
my $patt = qr/b?.at/;
if ($str =~ /some ((?:$patt){3,}) thing/) {
my $count = () = $1 =~ /$patt/g;
print $count;
...
}
Another (admittedly somewhat trivial) example with 2 subpatterns
my $str = "some catbratmatrattatblat thing 11,33,446,70900,";
my $patt1 = qr/b?.at/;
my $patt2 = qr/\d+,/;
if ($str =~ /some ((?:$patt1){3,}) thing ((?:$patt2){2,})/) {
my ($substr1, $substr2) = ($1, $2);
my $count1 = () = $substr1 =~ /$patt1/g;
my $count2 = () = $substr2 =~ /$patt2/g;
say "count1: " . $count1;
say "count2: " . $count2;
}
Limitation(s) of this approach:
Fails miserably with lookarounds. See amon's example.
If you have a pattern of type /AB{n,}/ where A and B are complex patterns, we can split the regex into multiple pieces:
my $string = "ABABBBB";
my $n = 3;
my $count = 0;
TRY:
while ($string =~ /A/gc) {
my $pos = pos $string; # remember position for manual backtracking
$count++ while $string =~ /\GB/g;
if ($count < $n) {
$count = 0;
pos($string) = $pos; # restore previous position
} else {
last TRY;
}
}
say $count;
Output: 4
However, embedding code into the regex to do the counting may be more desirable, as it is more general:
my $string = "ABABBBB";
my $count;
$string =~ /A(?{ $count = 0 })(?:B(?{ $count++ })){3,}/ and say $count;
Output: 4.
The downside is that this code won't run on older perls. (Code was tested on v14 & v16).
Edit: The first solution will fail if the B pattern backtracks, e.g. $B = qr/BB?/. That pattern should match the ABABBBB string three times, but the strategy will only let it match two times. The solution using embedded code allows proper backtracking.

When there is a similar pattern in an expression, how to extract the occurence of the last instance in perl?

The value of $s is dynamic. I need to extract the values that occur after the last | in between each [].
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
my #parts = split(/\]/, $s);
foreach my $part (#parts)
{
# Need to extract the values that occur after the last '|'
# (for example: !, .1iit, 10:48AM, Calculator, Coffee)
# and store each of the values separately in a hash
}
Could someone help me out in this?
Thanks,
Best to transform the string into a more useful data structure, then take the needed elements. Why is this best? Because right now you need the last element, but perhaps next time you will need some other part. Since its not harder to do it right, why not?
#!/usr/bin/perl
use strict;
use warnings;
# Only needed for Dumper
use Data::Dumper;
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
# Extract each group between []
# Then transform each group into an array reference by splitting on |
my #groups = map { [ split /\|/ ] } ($s =~ /\[([^\]]*)\]/g);
# Inspect the data structure
print Dumper \#groups;
# Print only the last element of each sub-array
print "$_\n" for map {$_->[-1]} #groups;
If needed the third elements of the sub-arrays could be transformed into hashrefs quite easily too. ,however since that wasn't needed, I leave that as an exercise for the reader (I always love saying that when I get the chance!).
Edit: since I found it interesting I ended up creating these hashrefs, here is the code that would replace the my #groups line:
my #groups = map { [ map { /\{([^\}]*)\}/ ? { split /(?:=|,)/, $1 } : $_ } (split /\|/) ] } ($s =~ /\[([^\]]*)\]/g);
or more properly commented (map commands are read from the back, so the comments start at the bottom and follow by number, comments like #/N pair with those like #N)
my #groups = map { #/1
[ #/2
map { #/3
/\{([^\}]*)\}/ #4 ... and if any element (separated by pipes in #3)
# is surrounded by curly braces
? { #5 ... then return a hash ref
split /(?:=|,)/, $1 #6 ... whose elements are given
# pairwise between '=' or ',' signs
} #/5
: $_ #7 ... otherwise (from 'if' in #4 ) return the element as is
} (split /\|/) #3 ... where each element is separated by pipes (i.e. |)
] #2 ... return an array ref
} ($s =~ /\[([^\]]*)\]/g); #1 For each element between sqr braces (i.e. [])
The generic way:
#subparts = split /\|/, $part;
$tail = $subparts[$#subparts];
If you only ever need the last part separately:
$part =~ /([^\|]*)$/ and $tail = $1;
my ($value) = $part =~ m/[^|]\|(.+)$/;
print "$part => $value\n";
and another way:
my $s =
"[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
my #parts = $s =~ m/\|([^|]+)]/g;
print join( "\n", #parts );
Since you insist on a regex:
#matches = $s =~ /\|([^|]+?)]/g
Using /g will dump all matches into the array #matches
You really don't need a regex... just use split(). The results are stored in %results
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
foreach my $part (split(/\]/, $s))
{
#pieces = split(/\|/, $part);
$results{$pieces[-1]} = $pieces[-1];
}
With regexes, when you think “I want the last of,” you should immediately think of the pattern .* because regex greed does just what you want.
For example, matching /^(.*)a(.*)$/ chops up "abababab" into
ababab in $1
a matched by the literal in the pattern
b in $2
Let's think through the process of the match. Imagine .* as Augustus Gloop.
Augustus: Ausgezeichnet! The ^ anchor means I get to start at the beginning. From there, I shall eat all the candies!
Willie Wonka: But, my dear Augustus, you must share with the other children.
Augustus: Fine, I get "abababa" and they get "b". Happy?
Willie Wonka: But the next child in line doesn't like b candies.
Augustus: Then I shall keep "ababab" for myself and leave "ab" for the others.
At this point, Augustus has his big pile, humble little Charlie Bucket gets his single a, and Veruca Salt—although scowling about the meager quantity—gets at least something now.
In other words, $2 contains everything after the last a. To be persnickety, the ^ and $ anchors are redundant, but I like keeping them for added emphasis.
Putting this into action, you could write
#! /usr/bin/env perl
use strict;
use warnings;
sub last_fields {
local($_) = #_;
my #last;
push #last, $1 =~ /^.*\|(.+)$/ ? $1 : undef
while /\[(.*?)\]/g;
wantarray ? #last : \#last;
}
The outer while breaks up the string into [...] chunks and assumes that right square-bracket cannot occur inside a chunk. Within each chunk, we use /^.*\|(.+)$/ to capture in $1 everything after the last pipe.
Testing it with your example looks like
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!]" .
"[0|0|{A=167,B=2,C=67,D=17}|.1iit]" .
"[196|0|{A=244,B=6,C=67,D=12}|10:48AM]" .
"[204|0|{A=9,B=201,C=61,D=11}|Calculator]" .
"[66|0|{A=145,B=450,C=49,D=14}|Coffee]";
use Test::More tests => 6;
my #lasts = last_fields $s;
# yes, is_deeply could do this in a single call,
# but it's laid out explicitly here for expository benefit
is $lasts[0], "!";
is $lasts[1], ".1iit";
is $lasts[2], "10:48AM";
is $lasts[3], "Calculator";
is $lasts[4], "Coffee";
is scalar #lasts, 5;
All the tests pass:
$ ./match-last-of
1..6
ok 1
ok 2
ok 3
ok 4
ok 5
ok 6
The output of prove is nicer. Run it yourself to see the color coding.
$ prove ./match-last-of
./match-last-of .. ok
All tests successful.
Files=1, Tests=6, 0 wallclock secs ( 0.02 usr 0.01 sys + 0.02 cusr 0.00 csys = 0.05 CPU)
Result: PASS

Perl regex: how to know number of matches

I'm looping through a series of regexes and matching it against lines in a file, like this:
for my $regex (#{$regexs_ref}) {
LINE: for (#rawfile) {
/#$regex/ && do {
# do something here
next LINE;
};
}
}
Is there a way for me to know how many matches I've got (so I can process it accordingly..)?
If not maybe this is the wrong approach..? Of course, instead of looping through every regex, I could just write one recipe for each regex. But I don't know what's the best practice?
If you do your matching in list context (i.e., basically assigning to a list), you get all of your matches and groupings in a list. Then you can just use that list in scalar context to get the number of matches.
Or am I misunderstanding the question?
Example:
my #list = /$my_regex/g;
if (#list)
{
# do stuff
print "Number of matches: " . scalar #list . "\n";
}
You will need to keep track of that yourself. Here is one way to do it:
#!/usr/bin/perl
use strict;
use warnings;
my #regexes = (
qr/b/,
qr/a/,
qr/foo/,
qr/quux/,
);
my %matches = map { $_ => 0 } #regexes;
while (my $line = <DATA>) {
for my $regex (#regexes) {
next unless $line =~ /$regex/;
$matches{$regex}++;
}
}
for my $regex (#regexes) {
print "$regex matched $matches{$regex} times\n";
}
__DATA__
foo
bar
baz
In CA::Parser's processing associated with matches for /$CA::Regex::Parser{Kills}{all}/, you're using captures $1 all the way through $10, and most of the rest use fewer. If by the number of matches you mean the number of captures (the highest n for which $n has a value), you could use Perl's special #- array (emphasis added):
#LAST_MATCH_START
#-
$-[0] is the offset of the start of the last successful match. $-[n] is the offset of the start of the substring matched by n-th subpattern, or undef if the subpattern did not match.
Thus after a match against $_, $& coincides with substr $_, $-[0], $+[0] - $-[0]. Similarly, $n coincides with
substr $_, $-[n], $+[n] - $-[n]
if $-[n] is defined, and $+ coincides with
substr $_, $-[$#-], $+[$#-] - $-[$#-]
One can use $#- to find the last matched subgroup in the last successful match. Contrast with $#+, the number of subgroups in the regular expression. Compare with #+.
This array holds the offsets of the beginnings of the last successful submatches in the currently active dynamic scope. $-[0] is the offset into the string of the beginning of the entire match. The n-th element of this array holds the offset of the nth submatch, so $-[1] is the offset where $1 begins, $-[2] the offset where $2 begins, and so on.
After a match against some variable $var:
$` is the same as substr($var, 0, $-[0])
$& is the same as substr($var, $-[0], $+[0] - $-[0])
$' is the same as substr($var, $+[0])
$1 is the same as substr($var, $-[1], $+[1] - $-[1])
$2 is the same as substr($var, $-[2], $+[2] - $-[2])
$3 is the same as substr($var, $-[3], $+[3] - $-[3])
Example usage:
#! /usr/bin/perl
use warnings;
use strict;
my #patterns = (
qr/(foo(bar(baz)))/,
qr/(quux)/,
);
chomp(my #rawfile = <DATA>);
foreach my $pattern (#patterns) {
LINE: for (#rawfile) {
/$pattern/ && do {
my $captures = $#-;
my $s = $captures == 1 ? "" : "s";
print "$_: got $captures capture$s\n";
};
}
}
__DATA__
quux quux quux
foobarbaz
Output:
foobarbaz: got 3 captures
quux quux quux: got 1 capture
How about below code:
my $string = "12345yx67hjui89";
my $count = () = $string =~ /\d/g;
print "$count\n";
It prints 9 here as expected.

How can I find the first occurrence of a pattern in a string from some starting position?

I have a string of arbitrary length, and starting at position p0, I need to find the first occurrence of one of three 3-letter patterns.
Assume the string contain only letters. I need to find the count of triplets starting at position p0 and jumping forward in triplets until the first occurrence of either 'aaa' or 'bbb' or 'ccc'.
Is this even possible using just a regex?
Moritz says this might be faster than a regex. Even if it's a little slower, it's easier to understand at 5 am. :)
#0123456789.123456789.123456789.
my $string = "alsdhfaaasccclaaaagalkfgblkgbklfs";
my $pos = 9;
my $length = 3;
my $regex = qr/^(aaa|bbb|ccc)/;
while( $pos < length $string )
{
print "Checking $pos\n";
if( substr( $string, $pos, $length ) =~ /$regex/ )
{
print "Found $1 at $pos\n";
last;
}
$pos += $length;
}
$string=~/^ # from the start of the string
(?:.{$p0}) # skip (don't capture) "$p0" occurrences of any character
(?:...)*? # skip 3 characters at a time,
# as few times as possible (non-greedy)
(aaa|bbb|ccc) # capture aaa or bbb or ccc as $1
/x;
(Assuming p0 is 0-based).
Of course, it's probably more efficient to use substr on the string to skip forward:
substr($string, $p0)=~/^(?:...)*?(aaa|bbb|ccc)/;
You can't really count with regexes, but you can do something like this:
pos $string = $start_from;
$string =~ m/\G # anchor to previous pos()
((?:...)*?) # capture everything up to the match
(aaa|bbb|ccc)
/xs or die "No match"
my $result = length($1) / 3;
But I think it's a bit faster to use substr() and unpack() to split into triple and walk the triples in a for-loop.
(edit: it's length(), not lenght() ;-)
The main part of this is split /(...)/. But at the end of this, you'll have your positions and occurrence data.
my #expected_triplets = qw<aaa bbb ccc>;
my $data_string
= 'fjeidoaaaivtrxxcccfznaaauitbbbfzjasdjfncccftjtjqznnjgjaaajeitjgbbblafjan'
;
my $place = 0;
my #triplets = grep { length } split /(...)/, $data_string;
my %occurrence_for = map { $_, [] } #expected_triplets;
foreach my $i ( 0..#triplets ) {
my $triplet = $triplets[$i];
push( #{$occurrence_for{$triplet}}, $i ) if exists $occurrence_for{$triplet};
}
Or for simple counting by regex (it uses Experimental (??{}))
my ( $count, %count );
my $data_string
= 'fjeidoaaaivtrxxcccfznaaauitbbbfzjasdjfncccftjtjqznnjgjaaajeitjgbbblafjan'
;
$data_string =~ m/(aaa|bbb|ccc)(??{ $count++; $count{$^N}++ })/g;
If speed is a serious concern, you can, depending on what the 3 strings are, get really fancy by creating a tree (e.g. Aho-Corasick algorithm or similar).
A map for every possible state is possible, e.g. state[0]['a'] = 0 if no strings begin with 'a'.