Perl pattern matching with pattern arithmetic

Perl pattern matching with pattern arithmetic - regex

I want to understand how can I do arithmetic on matched sub-patterns in perl regex.
This is just a sample code and I want to understand how can I use \1 (already matched sub-pattern. in this case - 7) to match pattern+1 (8)
my $y = 77668;
if($y =~ /(\d)\1(\d)\2\1+1/) #How to increment a previously
#matched sub-pattern and form a pattern?
{
print $y;
}
EDIT
From the answers, I see that pattern arithmetic is not possible.
This is what I want to achieve.
I want to form a regex which will match this pattern:
N-3N-2N-1NNN+1N+2N+3 (N = 3,4,5,6

Its possible via regex code blocks:
my $y = 77668;
if($y =~ /(\d)\1(\d)\2(??{$1+1})/ ) {
print $y;
}
In this snippet (??{ CODE }) returns another regex that must match, so this regex looks like "8" ($1+1). As a result, whole regex will match only if 5th digit is greather and 1st by 1. But drawback with 1st digit is 9, this code block will return "10", so possible its wrong behavior, but you said nothing about what must be done in this case.
Now about N-3N-2N-1NNN+1N+2N+3 question, you can match it with this regex:
my $n = 5;
if( $y =~ /(??{ ($n-3).($n-2).($n-1).$n.($n+1).($n+2).($n+3) })/ ){
Or more "scalable" way:
my $n = 5;
if( $y =~ /(??{ $s=''; $s .= $n+$_ foreach(-3..3); $s; })/ ){
Again, what we must do if $n == 2 ?? $n-3 will be -1. Its not a simply digit cus it have sign, so you should think about this cases.
One another way. Match what we have and then check it.
if( $y =~ /(\d)(\d)(\d)(\d)(\d)(\d)(\d)/ ) {
if( $1 == ($4-3) && $2 == ($4-2) && $3 == ($4-1) && $6 == ($4+1) && $7 == ($4+2) && $7 == ($4+3) ) {
#...
Seems this method litle bit clumsy, but its obivious to everyone (i hope).
Also, you can optimize your regex since 7 ascending digits streak is not so frequent combination, plus get some lulz from co-workers xD:
sub check_number {
my $i;
for($i=1; $i<length($^N); $i++) {
last if substr($^N, $i, 1)<=substr($^N, $i-1, 1);
}
return $i<length($^N) ? "(*FAIL)" : "(*ACCEPT)";
}
if( $y =~ /[0123][1234][2345][3456][4567][5678][6789](??{ check_number() })/ ) {
Or... Maybe most human-friendly method:
if( $y =~ /0123456|1234567|2345678|3456789/ ) {
Seems last variant is bingo xD Its good example about not searching regex when things are so simple)

Of course this is possible. We are talking about Perl regexes after all. But it will be rather ugly:
say "55336"=~m{(\d)\1(\d)\2(\d)(?(?{$1+1==$3})|(*F))}?"match":"fail";
or pretty-printed:
say "55336" =~ m{ (\d)\1 (\d)\2 (\d)
(? (?{$1+1==$3}) # true-branch: nothing
|(*FAIL)
)
}x
? "match" : "fail";
What does this do? We collect the digits in ordinary captures. At the end, we use an if-else pattern:
(? (CONDITION) TRUE | FALSE )
We can embed code into a regex with (?{ code }). The return value of this code can be used as a condition. The (*FAIL) (short: (*F)) verb causes the match to fail. Use (*PRUNE) if you only want a branch, not the whole pattern to fail.
Embedded code is also great for debugging. However, older perls cannot use regexes inside this regex code :-(
So we can match lots of stuff and test it for validity inside the pattern itself. However, it might be a better idea to do that outside of the pattern like:
"string" =~ /regex/ and (conditions)
Now to your main pattern N-3N-2N-1NNN+1N+2N+3 (I hope I parsed it correctly):
my $super_regex = qr{
# N -3 N-2 N-1 N N N+1 N+2 N+3
(\d)-3\1-2\1-1\1\1(\d)(\d)(\d)
(?(?{$1==$2-1 and $1==$3-2 and $1==$4-3})|(*F))
}x;
say "4-34-24-144567" =~ $super_regex ? "match" : "fail";
Or did you mean
my $super_regex = qr{
#N-3 N-2 N-1 N N N+1 N+2 N+3
(\d)(\d)(\d) (\d)\4 (\d)(\d)(\d)
(? (?{$1==$4-3 and $2==$4-2 and $3==$4-1 and
$5==$4+1 and $6==$4+2 and $7==$4+3})|(*F))
}x;
say "123445678" =~ $super_regex ? "match" : "fail";
The scary thing is that these even works (with perl 5.12).
We could also generate parts of the pattern at match-time with the (??{ code }) construct — the return value of this code is used as a pattern:
my $super_regex = qr{(\d)(??{$1+1})(??{$1+2})}x;
say "234"=~$super_regex ? "match":"fail"
et cetera. However, I think readability suffers more this way.
If you need more than nine captures, you can use named captures with the
(?<named>pattern) ... \k<named>
constructs. The contents are also available in the %+ hash, see perlvar for that.
To dive further into the secrets of Perl regexes, I recommend reading perlre a few times.

Related

Pattern matching in perl (Lookahead and Condition on word Index)

I have a long string, containing alphabetic words and each delimited by one single character ";" . The whole string also starts and ends with a ";" .
How do I count the number of occurrences of a pattern (started with ";") if index of a success match is divisible by 5.
Example:
$String = ";the;fox;jumped;over;the;dog;the;duck;and;the;frog;"
$Pattern = ";the(?=;f)"
OUTPUT: 1
Since:
Note 1: In above case, the $Pattern ;the(?=;f) exists as the 1st and 10th words in the $String; however; the output result would be 1, since only the index of second match (10) is divisible by 5.
Note 2: Every word delimited by ";" counts toward the index set.
Index of the = 1 -> this does not match since 1 is not divisible by 5
Index of fox = 2
Index of jumped = 3
Index of over = 4
Index of the = 5 -> this does not match since the next word (dog) starts with "d" not "f"
Index of dog = 6
Index of the = 7 -> this does not match since 7 is not divisible by 5
Index of duck = 8
Index of and = 9
Index of the = 10 -> this does match since 10 is divisible by 5 and the next word (frog) starts with "f"
Index of frog = 11
If possible, I am wondering if there is a way to do this with a single pattern matching without using list or array as the $String is extremely long.

Use Backtracking control verbs to process the string 5 words at a time
One solution is to add a boundary condition that the pattern is preceded by 4 other words.
Then setup an alteration so that if your pattern is not matched, the 5th word is gobbled and then skipped using backtracking control verbs.
The following demonstrates:
#!/usr/bin/env perl
use strict;
use warnings;
my $string = ";the;fox;jumped;over;the;dog;the;duck;and;the;frog;";
my $pattern = qr{;the(?=;f)};
my #matches = $string =~ m{
(?: ;[^;]* ){4} # Preceded by 4 words
(
$pattern # Match Pattern
|
;(*SKIP)(*FAIL) # Or consume 5th word and skip to next part of string.
)
}xg;
print "Number of Matches = " . #matches . "\n";
Outputs:
Number of Matches = 1
Live Demo
Supplemental Example using Numbers 1 through 100 in words
For additional testing, the following constructs a string of all numbers in word format from 1 to 100 using Lingua::EN::Numbers.
For the pattern it looks for a number that's a single word with the next number that begins with the letter S.
use Lingua::EN::Numbers qw(num2en);
my $string = ';' . join( ';', map { num2en($_) } ( 1 .. 100 ) ) . ';';
my $pattern = qr{;\w+(?=;s)};
my #matches = $string =~ m{(?:;[^;]*){4}($pattern|;(*SKIP)(*FAIL))}g;
print "#matches\n";
Outputs:
;five ;fifteen ;sixty ;seventy
Reference for more techniques
The following question from last month is a very similar problem. However, I provided 5 different solutions in addition to the one demonstrated here:
In Perl, how to count the number of occurences of successful matches based on a condition on their absolute positions

You can count the number of semicolons in each substring up to the matching position. For a million-word string, it takes 150 seconds.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $string = join ';', q(),
map { qw( the fox jumped over the dog the duck and the frog)[int rand 11] }
1 .. 1000;
$string .= ';';
my $pattern = qr/;the(?=;f)/;
while ($string =~ /$pattern/g) {
my $count = substr($string, 0, pos $string) =~ tr/;//;
say $count if 0 == $count % 5;
}

Revised Answer
One relatively simple way to achieve what you want is by replacing the delimiters in the original text that occur on a 5-word-index boundary:
$text =~ s/;/state $idx++ % 5 ? ',' : ';'/eg;
Now you just need to trivially adjust your $pattern to look for ;the,f instead of ;the;f. You can use the =()= pseudo-operator to return the count:
my $count =()= $text =~ /;the(?=,f)/g;
Original answer after the break. (Thanks to #choroba for pointing out the correct interpretation of the question.)
Character-Based Answer
This uses the /g regex modifier in combination with pos() to look at matching words. For illustration, I print out all matches (not just those on 5-character boundaries), but I print (match) beside those on 5-char boundaries. The output is:
;the;fox;jumped;over;the;dog;the;duck;and;the;frog
^....^....^....^....^....^....^....^....^....^....
`the' #0 (match)
`the' #41
And the code is:
#!/usr/bin/env perl
use 5.010;
my $text = ';the;fox;jumped;over;the;dog;the;duck;and;the;frog';
say $text;
say '^....^....' x 5;
my $pat = qr/;(the)(?=;f)/;
#$pat = qr/;([^;]+)/;
while ($text =~ /$pat/g) {
my $pos = pos($text) - length($1) - 1;
say "`$1' \#$pos". ($pos % 5 ? '' : ' (match)');
}

First of, pos is also possible as a left hand side expression. You could make use of the \G assertion in combination with index (since speed is of concern for you). I expanded your example to showcase that it only "matches" for divisibles of 5 (your example also allowed for indices not divisible by 5 to be 1 a solution, too). Since you only wanted the number of matches, I only used a $count variable and incremented. If you want something more, use the normal if {} clause and do something in the block.
my $string = ";the;fox;jumped;over;the;dog;the;duck;and;the;frog;or;the;fish";
my $pattern = qr/;the(?=;f)/;
my ($index,$count, $position) = (0,0,0);
while(0 <= ($position = index $string, ';',$position)){
pos $string = $position++; #add one to $position, to terminate the loop
++$count if (!(++$index % 5) and $string =~/\G$pattern/);
}
say $count; # says 1, not 2
You could use the experimental features of regexes to solve you problem (especially the (?{}) blocks). Before you do, you really should read the corresponding section in the perldocs.
my ($index, $count) = (0,0);
while ($string =~ /; # the `;'
(?(?{not ++$index % 5}) # if with a code condition
the(?=;f) # almost your pattern, but we'll have to count
|(*FAIL)) # else fail
/gx) {
$count++;
}

Perl all matches of a regexp in a given string

Perl's regexp matching is left-greedy, so that the regexp
/\A (a+) (.+) \z/x
matching the string 'aaab', will set $1='aaa' and $2='b'.
(The \A and \z are just to force start and end of the string.)
You can also give non-greedy qualifiers, as
/\A (a+?) (.+?) \z/x
This will still match, but give $1='a' and $2='aab'.
But I would like to check all possible ways to generate the string, which are
$1='aaa' $2='b'
$1='aa' $2='ab'
$1='a' $2='aab'
The first way corresponds to the default left-greedy behaviour, and the third way corresponds to making the first match non-greedy, but there may be ways in between those extremes. Is there a regexp engine (whether Perl's, or some other such as PCRE or RE2) which can be made to try all possible ways that the regexp specified generates the given string?
Among other things, this would let you implement 'POSIX-compatible' regexp matching where the longest total match is picked. In my case I really would like to see every possibility.
(One way would be to munge the regexp itself, replacing the + modifier with {1,1} on the first attempt, then {1,2}, {1,3} and so on - for each combination of + and * modifiers in the regexp. That is very laborious and slow, and it's not obvious when to stop. I hope for something smarter.)
Background
To answer Jim G.'s question on what problem this might solve, consider a rule-based translation system between two languages, given by the rules
translate(any string of one or more 'a' . y) = 'M' . translate(y)
translate('ab') = 'U'
Then there is a possible result of translate('aaab'), namely 'MU'.
You might try to put these rules into Perl code based on regexps, as
our #m;
my #rules = (
[ qr/\A (a+) (.*) \z/x => sub { 'M' . translate($m[1]) } ],
[ qr/\A ab \z/x => sub { 'U' } ],
);
where translate runs over each of #rules and tries to apply them in turn:
sub translate {
my $in = shift;
foreach (#rules) {
my ($lhs, $rhs) = #$_;
$in =~ $lhs or next;
local #m = ($1, $2);
my $r = &$rhs;
next if index($r, 'fail') != -1;
return $r;
}
return 'fail';
}
However, calling translate('aaab') returns 'fail'. This is because
it tries to apply the first rule matching (a+)(.*) and the regexp
engine finds the match with the longest possible string of 'a'.
Using the answer suggested by ikegami, we can try all ways in which
the regular expression generates the string:
use re 'eval';
sub translate {
my $in = shift;
foreach (#rules) {
my ($lhs, $rhs) = #$_;
local our #matches;
$in =~ /$lhs (?{ push #matches, [ $1, $2 ] }) (*FAIL)/x;
foreach (#matches) {
local #m = #$_;
my $r = &$rhs;
next if index($r, 'fail') != -1;
return $r;
}
}
return 'fail';
}
Now translate('aaab') returns 'MU'.

local our #matches;
'aaab' =~ /^ (a+) (.+) \z (?{ push #matches, [ $1, $2 ] }) (*FAIL)/x;

In regular expression matching of Perl, is it possible to know number of matches in a{n,}?

What I mean is:
For example, a{3,} will match 'a' at least three times greedly. It may find five times, 10 times, etc. I need this number. I need this number for the rest of the code.
I can do the rest less efficiently without knowing it, but I thought maybe Perl has some built-in variable to give this number or is there some trick to get it?

Just capture it and use length.
if (/(a{3,})/) {
print length($1), "\n";
}

Use #LAST_MATCH_END and #LAST_MATCH_START
my $str = 'jlkjmkaaaaaamlmk';
$str =~ /a{3,}/;
say $+[0]-$-[0];
Output:
6
NB: This will work only with a one-character pattern.

Here's an idea (maybe this is what you already had?) assuming the pattern you're interested in counting has multiple characters and variable length:
capture the substring which matches the pattern{3,} subpattern
then match the captured substring globally against pattern (note the absence of the quantifier), and force a list context on =~ to get the number of matches.
Here's a sample code to illustrate this (where $patt is the subpattern you're interested in counting)
my $str = "some catbratmatrattatblat thing";
my $patt = qr/b?.at/;
if ($str =~ /some ((?:$patt){3,}) thing/) {
my $count = () = $1 =~ /$patt/g;
print $count;
...
}
Another (admittedly somewhat trivial) example with 2 subpatterns
my $str = "some catbratmatrattatblat thing 11,33,446,70900,";
my $patt1 = qr/b?.at/;
my $patt2 = qr/\d+,/;
if ($str =~ /some ((?:$patt1){3,}) thing ((?:$patt2){2,})/) {
my ($substr1, $substr2) = ($1, $2);
my $count1 = () = $substr1 =~ /$patt1/g;
my $count2 = () = $substr2 =~ /$patt2/g;
say "count1: " . $count1;
say "count2: " . $count2;
}
Limitation(s) of this approach:
Fails miserably with lookarounds. See amon's example.

If you have a pattern of type /AB{n,}/ where A and B are complex patterns, we can split the regex into multiple pieces:
my $string = "ABABBBB";
my $n = 3;
my $count = 0;
TRY:
while ($string =~ /A/gc) {
my $pos = pos $string; # remember position for manual backtracking
$count++ while $string =~ /\GB/g;
if ($count < $n) {
$count = 0;
pos($string) = $pos; # restore previous position
} else {
last TRY;
}
}
say $count;
Output: 4
However, embedding code into the regex to do the counting may be more desirable, as it is more general:
my $string = "ABABBBB";
my $count;
$string =~ /A(?{ $count = 0 })(?:B(?{ $count++ })){3,}/ and say $count;
Output: 4.
The downside is that this code won't run on older perls. (Code was tested on v14 & v16).
Edit: The first solution will fail if the B pattern backtracks, e.g. $B = qr/BB?/. That pattern should match the ABABBBB string three times, but the strategy will only let it match two times. The solution using embedded code allows proper backtracking.

Break from regex loop in Perl

In Perl regex, how can I break from /ge loop..?
Let's say the code is:
s/\G(foo)(bar)(;|$)/{ break if $3 ne ';'; print "$1\n"; '' }/ge;
...break here doesn't work, but it should illustrate what I mean.

Generally, I would write this as a while statement:
while( s/(foo)(bar)/$1/ ) {
# my code to determine if I should stop
if(something) {
last;
}
}
The caveat with this method is that your search/replace will start at the beginning each time, which may matter depending on your regex.
If you really wanted to do it in the regex, you could write a function that returns an unmodified string if you reached your end point, such as a count in this case:
my $count=0;
sub myfunc {
my ($string, $a, $b) = #_;
$count++;
if($count > 3) {
return $string;
}
return $a;
}
$mystring = "foobar foobar, foobar + foobar and foobar";
$mystring =~ s/((foo)(bar))/myfunc($1,$2,$3)/ge;
# result: $mystring => "foo foo, foo + foobar and foobar"
If I knew your specific case, I could probably provide a more helpful example.

You can use some experimental features to emulate a break statement, the Perl documentation for some of these features warn that they may change in future versions of Perl.
my $str = "abcdef";
my $stop = 0;
$str =~ s/(?(?{ $stop })(?!))(.)/ $stop = 1 if $1 ge "c"; "X" /ge;
print "$str\n";
This will print XXXdef.
A piece wise explanation:
(?(condition)yes-pattern) if the pattern in in condition matches then match yes-pattern, otherwise don't match anything.
(?{ code }) execute code, inside a conditional if the code is true execute the yes-pattern
(?!) will always fail to match, it's meaning is something like "Don't match nothing" and since 'nothing' can be matched at any point in a string it will fail.
So when $stop is true the pattern can never match, and when $stop is false it matches.

How can I find the first occurrence of a pattern in a string from some starting position?

I have a string of arbitrary length, and starting at position p0, I need to find the first occurrence of one of three 3-letter patterns.
Assume the string contain only letters. I need to find the count of triplets starting at position p0 and jumping forward in triplets until the first occurrence of either 'aaa' or 'bbb' or 'ccc'.
Is this even possible using just a regex?

Moritz says this might be faster than a regex. Even if it's a little slower, it's easier to understand at 5 am. :)
#0123456789.123456789.123456789.
my $string = "alsdhfaaasccclaaaagalkfgblkgbklfs";
my $pos = 9;
my $length = 3;
my $regex = qr/^(aaa|bbb|ccc)/;
while( $pos < length $string )
{
print "Checking $pos\n";
if( substr( $string, $pos, $length ) =~ /$regex/ )
{
print "Found $1 at $pos\n";
last;
}
$pos += $length;
}

$string=~/^ # from the start of the string
(?:.{$p0}) # skip (don't capture) "$p0" occurrences of any character
(?:...)*? # skip 3 characters at a time,
# as few times as possible (non-greedy)
(aaa|bbb|ccc) # capture aaa or bbb or ccc as $1
/x;
(Assuming p0 is 0-based).
Of course, it's probably more efficient to use substr on the string to skip forward:
substr($string, $p0)=~/^(?:...)*?(aaa|bbb|ccc)/;

You can't really count with regexes, but you can do something like this:
pos $string = $start_from;
$string =~ m/\G # anchor to previous pos()
((?:...)*?) # capture everything up to the match
(aaa|bbb|ccc)
/xs or die "No match"
my $result = length($1) / 3;
But I think it's a bit faster to use substr() and unpack() to split into triple and walk the triples in a for-loop.
(edit: it's length(), not lenght() ;-)

The main part of this is split /(...)/. But at the end of this, you'll have your positions and occurrence data.
my #expected_triplets = qw<aaa bbb ccc>;
my $data_string
= 'fjeidoaaaivtrxxcccfznaaauitbbbfzjasdjfncccftjtjqznnjgjaaajeitjgbbblafjan'
;
my $place = 0;
my #triplets = grep { length } split /(...)/, $data_string;
my %occurrence_for = map { $_, [] } #expected_triplets;
foreach my $i ( 0..#triplets ) {
my $triplet = $triplets[$i];
push( #{$occurrence_for{$triplet}}, $i ) if exists $occurrence_for{$triplet};
}
Or for simple counting by regex (it uses Experimental (??{}))
my ( $count, %count );
my $data_string
= 'fjeidoaaaivtrxxcccfznaaauitbbbfzjasdjfncccftjtjqznnjgjaaajeitjgbbblafjan'
;
$data_string =~ m/(aaa|bbb|ccc)(??{ $count++; $count{$^N}++ })/g;

If speed is a serious concern, you can, depending on what the 3 strings are, get really fancy by creating a tree (e.g. Aho-Corasick algorithm or similar).
A map for every possible state is possible, e.g. state[0]['a'] = 0 if no strings begin with 'a'.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perl pattern matching with pattern arithmetic - regex

Related

Pattern matching in perl (Lookahead and Condition on word Index)

Perl all matches of a regexp in a given string

In regular expression matching of Perl, is it possible to know number of matches in a{n,}?

Break from regex loop in Perl

How can I find the first occurrence of a pattern in a string from some starting position?

Categories

Resources