How can I determine the position a pattern matched in a string? - regex

I want to match a pattern with this format /vX.X.X/ where X a number. For example: /v1.1.1/ and /v1.0.300/. After matching the pattern, how can I get the position in the string where I found the pattern?

#- contains the offsets at which the match and captures were found.
$-[0] is the offset at which the pattern matched.
$-[1] is the offset at which the first capture matched.
$-[2] is the offset at which the second capture matched.
etc.
As such, you can use the following:
if ( $s =~ m{/v\d+\.\d+\.\d+/}a ) {
say "Matched at position $-[0]";
}

The approach you take depends on what you are trying to accomplish and the rest of the stuff around the problem. Since you haven't said anything about this, here's a shotgun blast of different ideas. Not all of them may be appropriate for what you are doing.
The #- special variable has the offsets for the starting position of
the match groups. The first element is the start of the entire match,
the second element (index 1) is the start of the $1 match, and so
on. If your pattern is the entire string you want, then you can use
the first element in that array:
if( $string =~ /\bv\d+\.\d+\.\d+\b/ ) {
my $position = $-[0];
say "Position is $position";
}
If you have other stuff around you pattern and the stuff you want is
in the first match group, you can use the second element (remember
that match groups are numbered by the order of the opening parens):
if( $string =~ /before (v\d+\.\d+\.\d+) after/ ) {
my $position = $-[1];
say "Position is $position";
}
When your pattern changes, you may need to update with element you
use.
There's also #+ that works the same but has the ending position. I
have a bunch of examples of this in the first edition of Mastering
Perl. I save it for that book because
I find that many people get confused on which element corresponds to
which part of the pattern. Consider if you'll remember this later.
You can use index to get the position of the matched string:
if( $string =~ /\b(v\d+\.\d+\.\d+)\b/ ) {
my $matched = $1;
my $position = index( $string, $matched );
say "Position is $position";
}
Using the /p flag and ${^PREMATCH} variable from Perl v5.10, count
the positions before the matched part of the string:
use v5.10;
if( $string =~ /\bv\d+\.\d+\.\d+\b/p ) {
my $position = length ${^PREMATCH};
say "Position is $position";
}
Use the /g flag in scalar context and Perl remembers the string
position where the match ended. Subtract the match length to see
where the match started:
if( $string =~ /\b(v\d+\.\d+\.\d+)\b/g ) {
my $matched = $1;
my $position = pos( $string ) - length($1);
say "Position is $position";
}
If there can be multiple matches per string, you'll have to adjust
these. One way uses a while loops since condition is still a scalar
context:
while( $string =~ /\b(v\d+\.\d+\.\d+)\b/g ) {
my $matched = $1;
my $position = pos( $string ) - length($1);
say "Position is $position";
}

Related

How to verify if a variable value contains a character and ends with a number using Perl

I am trying to check if a variable contains a character "C" and ends with a number, in minor version. I have :
my $str1 = "1.0.99.10C9";
my $str2 = "1.0.99.10C10";
my $str3 = "1.0.999.101C9";
my $str4 = "1.0.995.511";
my $str5 = "1.0.995.AC";
I would like to put a regex to print some message if the variable has C in 4th place and ends with number. so, for str1,str2,str3 -> it should print "matches". I am trying below regexes, but none of them working, can you help correcting it.
my $str1 = "1.0.99.10C9";
if ( $str1 =~ /\D+\d+$/ ) {
print "Candy match1\n";
}
if ( $str1 =~ /\D+C\d+$/ ) {
print "Candy match2\n";
}
if ($str1 =~ /\D+"C"+\d+$/) {
print "candy match3";
}
if ($str1 =~ /\D+[Cc]+\d+$/) {
print "candy match4";
}
if ($str1 =~ /\D+\\C\d+$/) {
print "candy match5";
}
if ($str1 =~ /C[^.]*\d$/)
C matches the letter C.
[^.]* matches any number of characters that aren't .. This ensures that the match won't go across multiple fields of the version number, it will only match the last field.
\d matches a digit.
$ matches the end of the string. So the digit has to be at the end.
I found it really helpful to use https://www.regextester.com/109925 to test and analyse my regex strings.
Let me know if this regex works for you:
((.*\.){3}(.*C\d{1}))
Following your format, this regex assums 3 . with characters between, and then after the third . it checks if the rest of the string contains a C.
EDIT:
If you want to make sure the string ends in a digit, and don't want to use it to check longer strings containing the formula, use:
^((.*\.){3}(.*C\d{1}))$
Lets look what regex should look like:
start{digit}.{digit}.{2-3 digits}.{2-3 digits}C{1-2 digits}end
very very strict qr/^1\.0\.9{2,3}\.101?C\d+\z/ - must start with 1.0.99[9]?.
very strict qr/^1\.\0.\d{2,3}\.\d{2,3}C\d{1,2}\z/ - must start with 1.0.
strict qr/^\d\.\d\.\d{2,3}\.\d{2,3}C\d{1,2}\z/
relaxed qr/^\d\.\d\.\d+\.\d+C\d+\z/
very relaxed qr/\.\d+C\d+\z/
use strict;
use warnings;
use feature 'say';
my #data = qw/1.0.99.10C9 1.0.99.10C10 1.0.999.101C9 1.0.995.511 1.0.995.AC/;
#my $re = qr/^\d\.\d\.\d+\.\d+C\d+\z/;
my $re = qr/^\d\.\d\.\d{2,3}\.\d{2,3}C\d+\z/;
say '--- Input Data ---';
say for #data;
say '--- Matching -----';
for( #data ) {
say 'match ' . $_ if /$re/;
}
Output
--- Input Data ---
1.0.99.10C9
1.0.99.10C10
1.0.999.101C9
1.0.995.511
1.0.995.AC
--- Matching -----
match 1.0.99.10C9
match 1.0.99.10C10
match 1.0.999.101C9

Matching a variable in a string in Perl from the end

I want to match a variable character in a given string, but from the end.
Ideas on how to do this action?
for example:
sub removeCharFromEnd {
my $string = shift;
my $char = shift;
if($string =~ m/$char/){ // I want to match the char, searching from the end, $doesn't work
print "success";
}
}
Thank you for your assistance.
There is no regex modifier that would force Perl regex engine to parse the string from right to left. Thus, the most convenient way to achieve that is via a negative lookahead:
m/$char(?!.*$char)/
The (?!.*$char) negative lookahead will require the absence (=will fail the match if found) of a $char after any 0+ chars other than linebreak chars (use s modifier if you are running the regex against a multiline string input).
The regex engine works from left to right.
You can use the natural greediness of quantifiers to reach the end of the string and find the last char with the backtracking mechanism:
if($string =~ m/.*\K$char/s) { ...
\K marks the position of the match result beginning.
Other ways:
you can also reverse the string and use your previous pattern.
you can search all occurrences and take the last item in the list
I'm having trouble understanding what you want. Your subroutine is called removeCharFromEnd, so perhaps you want to remove $char from $string if it appears at the end of the string
You can do that like this
sub removeCharFromEnd {
my ( $string, $char ) = #_;
if ( $string =~ s/$char\z// ) {
print "success";
}
$string;
}
Or perhaps you want to remove the last occurrence of $char wherever it is. You can do that with
s/.*\K$char//
The subroutine I have written returns the modified string, so you would have to assign the result to a variable to save it. You can write
my $s = 'abc';
$s = removeCharFromEnd($s, 'c');
say $s;
output
ab
If you just want to modify the string in place then you should write
$ARGV[0] =~ s/$char\z//
using whichever substitution you choose. Then you can do this
my $s = 'abc';
removeCharFromEnd($s, 'c');
say $s;
This produces the same output
To get Perl to search from the end of a string, reverse the string.
sub removeCharFromEnd {
my $string = reverse shift #_;
my $char = quotemeta reverse shift #_;
$string =~ s/$char//;
$string = reverse $string;
return $string;
}
print removeCharFromEnd(qw( abcabc b )), "\n";
print removeCharFromEnd(qw( abcdefabcdef c )), "\n";
print removeCharFromEnd(qw( !"/$%?&*!"/$%?&* $ )), "\n";

What are the roles of ${$exp}, $-[$exp], and $+[$exp] in this Perl example which extracts the locations where a regular expression matches?

What is the meaning of $#- and $-[$exp]
$x = "Mmm...donut, thought Homer";
$x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches
foreach $exp (1..$#-)
{
print "Match $exp: '${$exp}' at position ($-[$exp],$+[$exp])\n";
}
OUTPUT:
Match 1: 'Mmm' at position (0,3)
Match 2: 'donut' at position (6,11)
In Perl, $#ary is the index of the last element of the array #ary. Therefore, $#- is the index of the last element of the array #-:
$-[0] is the offset of the start of the last successful match. $-[n] is the offset of the start of the substring matched by n-th subpattern, or undef if the subpattern did not match.
Therefore, 1 .. $#- is a range of indices into #-.
${$exp} is a symbolic dereference. If $exp is one, you get the text of $1. I would not use this in real code. Instead, use #+ to extract the substring.
Also, while you know this example is going to match, in real life, never use the digit variables without ensuring the previous match succeeded.
#!/usr/bin/env perl
use strict;
use warnings;
my $x = "Mmm...donut, thought Homer";
if ( $x =~ /^(Mmm|Yech) [.]{3} (donut|peas)/x ) {
for my $i (1 .. $#-) {
my ($s, $e) = ($-[$i], $+[$i]);
printf(
"Match %d: '%s' at position [%d,%d)\n",
$i, substr($x, $s, $e - $s), $s, $e
);
}
}
Output:
Match 1: 'Mmm' at position [0,3)
Match 2: 'donut' at position [6,11)
Note that:
$+[1] is the offset past where $1 ends, $+[2] the offset past where $2 ends, and so on. You can use $#+ to determine how many subgroups were in the last successful match.
Hence, the abuse of the half-closed interval notation above.

Perl regex: how to know number of matches

I'm looping through a series of regexes and matching it against lines in a file, like this:
for my $regex (#{$regexs_ref}) {
LINE: for (#rawfile) {
/#$regex/ && do {
# do something here
next LINE;
};
}
}
Is there a way for me to know how many matches I've got (so I can process it accordingly..)?
If not maybe this is the wrong approach..? Of course, instead of looping through every regex, I could just write one recipe for each regex. But I don't know what's the best practice?
If you do your matching in list context (i.e., basically assigning to a list), you get all of your matches and groupings in a list. Then you can just use that list in scalar context to get the number of matches.
Or am I misunderstanding the question?
Example:
my #list = /$my_regex/g;
if (#list)
{
# do stuff
print "Number of matches: " . scalar #list . "\n";
}
You will need to keep track of that yourself. Here is one way to do it:
#!/usr/bin/perl
use strict;
use warnings;
my #regexes = (
qr/b/,
qr/a/,
qr/foo/,
qr/quux/,
);
my %matches = map { $_ => 0 } #regexes;
while (my $line = <DATA>) {
for my $regex (#regexes) {
next unless $line =~ /$regex/;
$matches{$regex}++;
}
}
for my $regex (#regexes) {
print "$regex matched $matches{$regex} times\n";
}
__DATA__
foo
bar
baz
In CA::Parser's processing associated with matches for /$CA::Regex::Parser{Kills}{all}/, you're using captures $1 all the way through $10, and most of the rest use fewer. If by the number of matches you mean the number of captures (the highest n for which $n has a value), you could use Perl's special #- array (emphasis added):
#LAST_MATCH_START
#-
$-[0] is the offset of the start of the last successful match. $-[n] is the offset of the start of the substring matched by n-th subpattern, or undef if the subpattern did not match.
Thus after a match against $_, $& coincides with substr $_, $-[0], $+[0] - $-[0]. Similarly, $n coincides with
substr $_, $-[n], $+[n] - $-[n]
if $-[n] is defined, and $+ coincides with
substr $_, $-[$#-], $+[$#-] - $-[$#-]
One can use $#- to find the last matched subgroup in the last successful match. Contrast with $#+, the number of subgroups in the regular expression. Compare with #+.
This array holds the offsets of the beginnings of the last successful submatches in the currently active dynamic scope. $-[0] is the offset into the string of the beginning of the entire match. The n-th element of this array holds the offset of the nth submatch, so $-[1] is the offset where $1 begins, $-[2] the offset where $2 begins, and so on.
After a match against some variable $var:
$` is the same as substr($var, 0, $-[0])
$& is the same as substr($var, $-[0], $+[0] - $-[0])
$' is the same as substr($var, $+[0])
$1 is the same as substr($var, $-[1], $+[1] - $-[1])
$2 is the same as substr($var, $-[2], $+[2] - $-[2])
$3 is the same as substr($var, $-[3], $+[3] - $-[3])
Example usage:
#! /usr/bin/perl
use warnings;
use strict;
my #patterns = (
qr/(foo(bar(baz)))/,
qr/(quux)/,
);
chomp(my #rawfile = <DATA>);
foreach my $pattern (#patterns) {
LINE: for (#rawfile) {
/$pattern/ && do {
my $captures = $#-;
my $s = $captures == 1 ? "" : "s";
print "$_: got $captures capture$s\n";
};
}
}
__DATA__
quux quux quux
foobarbaz
Output:
foobarbaz: got 3 captures
quux quux quux: got 1 capture
How about below code:
my $string = "12345yx67hjui89";
my $count = () = $string =~ /\d/g;
print "$count\n";
It prints 9 here as expected.

How can I find the first occurrence of a pattern in a string from some starting position?

I have a string of arbitrary length, and starting at position p0, I need to find the first occurrence of one of three 3-letter patterns.
Assume the string contain only letters. I need to find the count of triplets starting at position p0 and jumping forward in triplets until the first occurrence of either 'aaa' or 'bbb' or 'ccc'.
Is this even possible using just a regex?
Moritz says this might be faster than a regex. Even if it's a little slower, it's easier to understand at 5 am. :)
#0123456789.123456789.123456789.
my $string = "alsdhfaaasccclaaaagalkfgblkgbklfs";
my $pos = 9;
my $length = 3;
my $regex = qr/^(aaa|bbb|ccc)/;
while( $pos < length $string )
{
print "Checking $pos\n";
if( substr( $string, $pos, $length ) =~ /$regex/ )
{
print "Found $1 at $pos\n";
last;
}
$pos += $length;
}
$string=~/^ # from the start of the string
(?:.{$p0}) # skip (don't capture) "$p0" occurrences of any character
(?:...)*? # skip 3 characters at a time,
# as few times as possible (non-greedy)
(aaa|bbb|ccc) # capture aaa or bbb or ccc as $1
/x;
(Assuming p0 is 0-based).
Of course, it's probably more efficient to use substr on the string to skip forward:
substr($string, $p0)=~/^(?:...)*?(aaa|bbb|ccc)/;
You can't really count with regexes, but you can do something like this:
pos $string = $start_from;
$string =~ m/\G # anchor to previous pos()
((?:...)*?) # capture everything up to the match
(aaa|bbb|ccc)
/xs or die "No match"
my $result = length($1) / 3;
But I think it's a bit faster to use substr() and unpack() to split into triple and walk the triples in a for-loop.
(edit: it's length(), not lenght() ;-)
The main part of this is split /(...)/. But at the end of this, you'll have your positions and occurrence data.
my #expected_triplets = qw<aaa bbb ccc>;
my $data_string
= 'fjeidoaaaivtrxxcccfznaaauitbbbfzjasdjfncccftjtjqznnjgjaaajeitjgbbblafjan'
;
my $place = 0;
my #triplets = grep { length } split /(...)/, $data_string;
my %occurrence_for = map { $_, [] } #expected_triplets;
foreach my $i ( 0..#triplets ) {
my $triplet = $triplets[$i];
push( #{$occurrence_for{$triplet}}, $i ) if exists $occurrence_for{$triplet};
}
Or for simple counting by regex (it uses Experimental (??{}))
my ( $count, %count );
my $data_string
= 'fjeidoaaaivtrxxcccfznaaauitbbbfzjasdjfncccftjtjqznnjgjaaajeitjgbbblafjan'
;
$data_string =~ m/(aaa|bbb|ccc)(??{ $count++; $count{$^N}++ })/g;
If speed is a serious concern, you can, depending on what the 3 strings are, get really fancy by creating a tree (e.g. Aho-Corasick algorithm or similar).
A map for every possible state is possible, e.g. state[0]['a'] = 0 if no strings begin with 'a'.