How to get the expression of matching capture in Perl

How to get the expression of matching capture in Perl - regex

In Perl, how can I get the expression of a capture that has matched in a regex?
$s = 'aaazzz';
$s =~ s/(a+)|(b+)|(c+)/.../;
$s =~ s/(?<one>a+)|(?<two>b+)|(?<three>c+)/.../;
I mean the expression (e.g. a+), not the string aaa.
I need the expression of both numbered and named captures.

I'd do something like:
use strict;
use warnings;
my #regexes = (
qr/(a+)/,
qr/(b+)/,
qr/(c+)/,
);
my $string = 'aaazzz';
foreach my $re(#regexes) {
if ($string =~ $re) {
print "Used regex is $re\n";
}
}
Output:
Used regex is (?^:(a+))

You could assemble your regex from components and then test which groups matched. For demo purposes, I have only used match and not match and the replace operator 's', but same principle applies.
$s = 'aaazzz';
$part1 = '(a+)';
if ( $s =~ /$part1|(b+)|(c+)/ ) {
if ($1) {
print("$part1 matched\n");
}
else {
print("$part1 did not match\n");
}
}

Related

Dynamically capture regular expression match in Perl

I'm trying to dynamically catch regex matching in Perl. I've known that eval will help me do this but I may be doing something wrong.
Code:
use strict;
use warnings;
my %testHash = (
'(\d+)\/(\d+)\/(\d+)' => '$1$2$3'
);
my $str = '1/12/2016';
foreach my $pattern (keys (%testHash)) {
my $value = $testHash{$pattern};
my $result;
eval {
local $_ = $str;
/$pattern/;
print "\$1 - $1\n";
print "\$2 - $2\n";
print "\$3 - $3\n";
eval { print "$value\n"; }
}
}
Is it also possible to store captured regex patterns in an array?

I believe what you really want is a dynamic version of the following:
say $str =~ s/(\d+)\/(\d+)\/(\d+)/$1$2$3/gr;
String::Substitution provides what we need to achieve that.
use String::Substitution qw( gsub_copy );
for my $pattern (keys(%testHash)) {
my $replacement = $testHash{$pattern};
say gsub_copy($str, $pattern, $replacement);
}
Note that $replacement can also be a callback. This permits far more complicated substitutions. For example, if you wanted to convert 1/12/2016 into 2016-01-12, you could use the following:
'(\d+)/(\d+)/(\d+)' => sub { sprintf "%d-%02d-%02d", #_[3,1,2] },
To answer your actual question:
use String::Substitution qw( interpolate_match_vars last_match_vars );
for my $pattern (keys(%testHash)) {
my $template = $testHash{$pattern};
$str =~ $pattern # Or /$pattern/ if you prefer
or die("No match!\n");
say interpolate_match_vars($template, last_match_vars());
}

I am not completely sure what you want to do here, but I don't think your program does what you think it does.
You are useing eval with a BLOCK of code. That's like a try block. If it dies inside of that eval block, it will catch that error. It will not run your string like it was code. You need a string eval for that.
Instead of explaining that, here's an alternative.
This program uses sprintf and numbers the parameters. The %1$s syntax in the pattern says _take the first argument (1$) and format it as a string (%s). You don't need to localize or assign to $_ to do a match. The =~ operator does that on other variables for you. I also use qr{} to create a quoted regular expression (essentially a variable containing a precompiled pattern) that I can use directly. Because of the {} as delimiter, I don't need to escape the slashes.
use strict;
use warnings;
use feature 'say'; # like print ..., "\n"
my %testHash = (
qr{(\d+)/(\d+)/(\d+)} => '%1$s.%2$s.%3$s',
qr{(\d+)/(\d+)/(\d+) nomatch} => '%1$s.%2$s.%3$s',
qr{(\d+)/(\d+)/(\d\d\d\d)} => '%3$4d-%2$02d-%1$02d',
qr{\d} => '%s', # no capture group
);
my $str = '1/12/2016';
foreach my $pattern ( keys %testHash ) {
my #captures = ( $str =~ $pattern );
say "pattern: $pattern";
if ($#+ == 0) {
say " no capture groups";
next;
}
unless (#captures) {
say " no match";
next;
}
# debug-output
for my $i ( 1 .. $#- ) {
say sprintf " \$%d - %s", $i, $captures[ $i - 1 ];
}
say sprintf $testHash{$pattern}, #captures;
}
I included four examples:
The first pattern is the one you had. It uses %1$s and so on as explained above.
The second one does not match. We check the number of elements in #captured by looking at it in scalar context.
The third one shows that you can also reorder the result, or even use the sprintf formatting.
The last one has no capture group. We check by looking at the index of the last element ($# as the sigil for arrays that usually have an # sigil) in #+, which holds the offsets of the ends of the last successful submatches in the currently active dynamic scope. The first element is the end of the overall match, so if this only has one element, we don't have capture groups.
The output for me is this:
pattern: (?^:(\d+)/(\d+)/(\d\d\d\d))
$1 - 1
$2 - 12
$3 - 2016
2016-12-01
pattern: (?^:(\d+)/(\d+)/(\d+) nomatch)
no match
pattern: (?^:\d)
no capture groups
pattern: (?^:(\d+)/(\d+)/(\d+))
$1 - 1
$2 - 12
$3 - 2016
1.12.2016
Note that the order in the output is mixed up. That's because hashes are not ordered in Perl, and if you iterate over the keys in a hash without sort the order is random.

Apologies! I realized both my question and sample code were both vague. But after reading your suggestions I came of with the following code.
I haven't optimized this code yet and there is a limit to the replacement.
foreach my $key (keys %testHash) {
if ( $str =~ $key ) {
my #matchArr = ($str =~ $key); # Capture all matches
# Search and replace (limited from $1 to $9)
for ( my $i = 0; $i < #matchArr; $i++ ) {
my $num = $i+1;
$testHash{$key} =~ s/\$$num/$matchArr[$i]/;
}
$result = $testHash{$key};
last;
}
}
print "$result\n";

Evaluing the regexp in list context returns the matches. so in your example:
use Data::Dumper; # so we can see the result
foreach my $pattern (keys (%testHash)) {
my #a = ($str =~/$pattern/);
print Dumper(\#a);
}
would do the job.
HTH
Georg

Is it also possible to store captured regex patterns in an array?
Of course it is possible to store captured substrings in an array:
#!/usr/bin/env perl
use strict;
use warnings;
my #patterns = map qr{$_}, qw{
(\d+)/(\d+)/(\d+)
};
my $str = '1/12/2016';
foreach my $pattern ( #patterns ) {
my #captured = ($str =~ $pattern)
or next;
print "'$_'\n" for #captured;
}
Output:
'1'
'12'
'2016'
I do not quite understand what you are trying to do with combinations of local, eval EXPR and eval BLOCK in your code and the purpose of the following hash:
my %testHash = (
'(\d+)\/(\d+)\/(\d+)' => '$1$2$3'
);
If you are trying to codify that this pattern should result in three captures, you can do that like this:
my #tests = (
{
pattern => qr{(\d+)/(\d+)/(\d+)},
ncaptures => 3,
}
);
my $str = '1/12/2016';
foreach my $test ( #tests ) {
my #captured = ($str =~ $test->{pattern})
or next;
unless (#captured == $test->{ncaptures}) {
# handle failure
}
}
See this answer to find out how you can automate counting the number of capture groups in a pattern. Using the technique in that answer:
#!/usr/bin/env perl
use strict;
use warnings;
use Test::More;
my #tests = map +{ pattern => qr{$_}, ncaptures => number_of_capturing_groups($_) }, qw(
(\d+)/(\d+)/(\d+)
);
my $str = '1/12/2016';
foreach my $test ( #tests ) {
my #captured = ($str =~ $test->{pattern});
ok #captured == $test->{ncaptures};
}
done_testing;
sub number_of_capturing_groups {
"" =~ /|$_[0]/;
return $#+;
}
Output:
ok 1
1..1

Perl parse second instance with regex

I have code which get a rate of exchange:
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use POSIX qw(strftime);
use Math::Round;
use CGI qw(header start_html end_html);
use DBI;
sub isfloat {
my $val = shift;
return $val =~ m/^\d+.\d+$/;
}
.....
my $content = get('URL PAGE');
$content =~ /\s+(\d,\d{4})/gi;
my $dolar = $1;
$dolar =~ s/\,/./g;
if (!isfloat($dolar)) {
error("Error USD!");
}
How can I grab second instance /\s+(\d,\d{4})/gi ??
I tried solution from Perl Cookbook like this:
$content =~ /(?:\s+(\d,\d{4})) {2} \s+(\d,\d{4})/i;
but I have errors:
Use of uninitialized value $val in pattern match (m//)
Use of uninitialized value $dolar in substitution (s///)

Assign the pattern match operator result to an array. The array will contain all capture groups from all matches:
my $content = "abc 1,2345 def 0,9876 5,6789";
my #dollars = $content =~ /\s+(\d,\d{4})/g;
# Now, use the captures in #dollars this way:
foreach my $dollar (#dollars[0,1]) {
# process the $dollar items in a loop
}
# ... or this way:
my $dollar1 = shift #dollars;
# process the $dollar1
my $dollar2 = shift #dollars;
# process the $dollar2

A non-greedy Perl regular expression

I need to write a script which does the following:
$ cat testdata.txt
this is my file containing data
for checking pattern matching with a patt on the back!
only one line contains the p word.
$ ./mygrep5 pat th testdata.txt
this is my file containing data
for checking PATTERN MATCHING WITH a PATT ON THe back!
only one line contains the p word.
I have been able to print the line which is amended with the "a" capitalized as well. I have no idea how to only take what is needed.
I have been messing around (below is my script so far) and all I manage to return is the "PATT ON TH" part.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use Data::Dump 'pp';
my ($f, $s, $t) = #ARGV;
my #output_lines;
open(my $fh, '<', $t);
while (my $line = <$fh>) {
if ($line =~ /$f/ && $line =~ /$s/) {
$line =~ s/($f.+?$s)/$1/g;
my $sub_phrase = uc $1;
$line =~ s/$1/$sub_phrase/g;
print $line;
}
#else {
# print $line;
#}
}
close($fh);
which returns: "for checking pattern matching with a PATT ON THe back!"
How can I fix this problem?

It sounds like you want to capitalize from pat to th except for instances of a surrounded by spaces. The easiest way is to uppercase the whole thing, and then fix any instances of A surrounded by spaces.
sub capitalize {
my $s = shift;
my $uc = uc($s);
$uc =~ s/ \s \K A (?=\s) /a/xg;
return $uc;
}
s{ ( \Q$f\E .* \Q$s\E ) }{ capitalize($1) }xseg;
The downside is that will replacing any existing A surrounded by spaces with a. The following is more complicated, but it doesn't suffer from that problem:
sub capitalize {
my $s = shift;
my #parts = $s =~ m{ \G ( \s+ | \S+ ) }xg;
for (#parts) {
$_ = uc($_) if $_ ne "a";
}
return join('', #parts);
}
s{ ( \Q$f\E .* \Q$s\E ) }{ capitalize($1) }xseg;
The rest of the code can be simplified:
#!/usr/bin/perl
use strict;
use warnings;
sub capitalize { ... }
my $f = shift;
my $s = shift;
while (<>) {
s{ ( \Q$f\E .* \Q$s\E ) }{ capitalize($1) }xseg;
print;
}

So, if you want to match each sequence that starts with pat and ends with th, non-greedily, and uppercase that sequence, you can simply use an expression on the right side of your substitution:
$line =~ s/($f.+?$s)/uc($1)/eg;
And that's it.

string capture after pattern match

my $s = '>P1;MOREWORDS';
if ($s =~ m/^>.{2};.*/) {
print "jjjjj\n";
my $or = $s =~ /^>.{2};(.*)/;
}
When I try to print $or, I get 1, instead of of MOREWORDS
I am trying to capture using (.), but failing to do so.
It correctly prints jjjjjj after the match

Match returns a boolean in scalar context. Force list context to make it return the captured strings:
my ($or) = $s =~ /^>.{2};(.*)/;

How can I escape meta-characters when I interpolate a variable in Perl's match operator?

Suppose I have a file containing lines I'm trying to match against:
foo
quux
bar
In my code, I have another array:
foo
baz
quux
Let's say we iterate through the file, calling each element $word, and the internal list we are checking against, #arr.
if( grep {$_ =~ m/^$word$/i} #arr)
This works correctly, but in the somewhat possible case where we have an test case of fo. in the file, the . operates as a wildcard operator in the regex, and fo. then matches foo, which is not acceptable.
This is of course because Perl is interpolating the variable into a regex.
The question:
How do I force Perl to use the variable literally?

Use \Q...\E to escape special symbols directly in perl string after variable value interpolation:
if( grep {$_ =~ m/^\Q$word\E$/i} #arr)

From perlfaq6's answer to How do I match a regular expression that's in a variable?:
We don't have to hard-code patterns into the match operator (or anything else that works with regular expressions). We can put the pattern in a variable for later use.
The match operator is a double quote context, so you can interpolate your variable just like a double quoted string. In this case, you read the regular expression as user input and store it in $regex. Once you have the pattern in $regex, you use that variable in the match operator.
chomp( my $regex = <STDIN> );
if( $string =~ m/$regex/ ) { ... }
Any regular expression special characters in $regex are still special, and the pattern still has to be valid or Perl will complain. For instance, in this pattern there is an unpaired parenthesis.
my $regex = "Unmatched ( paren";
"Two parens to bind them all" =~ m/$regex/;
When Perl compiles the regular expression, it treats the parenthesis as the start of a memory match. When it doesn't find the closing parenthesis, it complains:
Unmatched ( in regex; marked by <-- HERE in m/Unmatched ( <-- HERE paren/ at script line 3.
You can get around this in several ways depending on our situation. First, if you don't want any of the characters in the string to be special, you can escape them with quotemeta before you use the string.
chomp( my $regex = <STDIN> );
$regex = quotemeta( $regex );
if( $string =~ m/$regex/ ) { ... }
You can also do this directly in the match operator using the \Q and \E sequences. The \Q tells Perl where to start escaping special characters, and the \E tells it where to stop (see perlop for more details).
chomp( my $regex = <STDIN> );
if( $string =~ m/\Q$regex\E/ ) { ... }
Alternately, you can use qr//, the regular expression quote operator (see perlop for more details). It quotes and perhaps compiles the pattern, and you can apply regular expression flags to the pattern.
chomp( my $input = <STDIN> );
my $regex = qr/$input/is;
$string =~ m/$regex/ # same as m/$input/is;
You might also want to trap any errors by wrapping an eval block around the whole thing.
chomp( my $input = <STDIN> );
eval {
if( $string =~ m/\Q$input\E/ ) { ... }
};
warn $# if $#;
Or...
my $regex = eval { qr/$input/is };
if( defined $regex ) {
$string =~ m/$regex/;
}
else {
warn $#;
}

The correct answer is - don't use regexps. I'm not saying regexps are bad, but using them for (what equals to) simple equality check is overkill.
Use: grep { lc($_) eq lc($word) } #arr and be happy.

Quotemeta
Returns the value of EXPR with all non-"word" characters backslashed.
http://perldoc.perl.org/functions/quotemeta.html

I don't think you want a regex in this case since you aren't matching a pattern. You're looking for a literal sequence of characters that you already know. Build a hash with the values to match and use that to filter #arr:
open my $fh, '<', $filename or die "...";
my %hash = map { chomp; lc($_), 1 } <$fh>;
foreach my $item ( #arr )
{
next unless exists $hash{ lc($item) };
print "I matched [$item]\n";
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to get the expression of matching capture in Perl - regex

In Perl, how can I get the expression of a capture that has matched in a regex? $s = 'aaazzz'; $s =~ s/(a+)|(b+)|(c+)/.../; $s =~ s/(?<one>a+)|(?<two>b+)|(?<three>c+)/.../; I mean the expression (e.g. a+), not the string aaa. I need the expression of both numbered and named captures.

I'd do something like: use strict; use warnings; my #regexes = ( qr/(a+)/, qr/(b+)/, qr/(c+)/, ); my $string = 'aaazzz'; foreach my $re(#regexes) { if ($string =~ $re) { print "Used regex is $re\n"; } } Output: Used regex is (?^:(a+))

Related

Dynamically capture regular expression match in Perl

Perl parse second instance with regex

A non-greedy Perl regular expression

string capture after pattern match

How can I escape meta-characters when I interpolate a variable in Perl's match operator?

Categories

Resources