Regex $1 into variable interferes with another variable - regex

I have been struggling with a section of my code for a while now and can't figure it out. Seems to have something to do with how $1 is handled but I cannot find anything relevant.
The regex finds the 16640021 and assigns it to a position in the arrays.
my #one;
my #two;
my $articleregex = qr/\s*\d*\/\s*\d*\|\s*(.*?)\|/p; # $1 = article number
my $row = " 7/ 1| 16640021|Taats 3 IP10 |14-03-03| | | 1,0000|st | | 01| | N| 0|";
if ($row =~ /$articleregex/g) {
$one[0] = $1;
}
if ($row =~ /$articleregex/g) {
$two[0] = $1;
}
print $one[0];
print $two[0];
Which outputs
Use of uninitialized value in print at perltest3.pl line 13.
16640021
It appears that the designation of $one[0] somehow interferes with that of $two[0]. This seems strange to me as the two variables and their designations should not be interacting in any way

It's because you used if (//g) instead of if (//).
//g in scalar context sets pos($_)[1] to where the match left off, or unsets pos($_)[1] if the match was unsuccessful[2].
//g in scalar context starts matching at position pos($_)[1].
For example,
$_ = "ab";
say /(.)/g ? $1 : "no match"; # a
say /(.)/g ? $1 : "no match"; # b
say /(.)/g ? $1 : "no match"; # no match
say /(.)/g ? $1 : "no match"; # a
This allows the following to iterate through the matches:
while (/(.)/g) {
say $1;
}
Don't use if (//g)[3]!
$_ is being used to represent the variable being matched against.
Unless /c is also used.
Unless you're unrolling a while (//g), or unless you're using if (//gc) to tokenize.

Related

Perl Grepping from an Array

I need to grep a value from an array.
For example i have a values
#a=('branches/Soft/a.txt', 'branches/Soft/h.cpp', branches/Main/utils.pl');
#Array = ('branches/Soft/a.txt', 'branches/Soft/h.cpp', branches/Main/utils.pl','branches/Soft/B2/c.tct', 'branches/Docs/A1/b.txt');
Now, i need to loop #a and find each value matches to #Array. For Example
It works for me with grep. You'd do it the exact same way as in the More::ListUtils example below, except for having grep instead of any. You can also shorten it to
my $got_it = grep { /$str/ } #paths;
my #matches = grep { /$str/ } #paths;
This by default tests with /m against $_, each element of the list in turn. The $str and #paths are the same as below.
You can use the module More::ListUtils as well. Its function any returns true/false depending on whether the condition in the block is satisfied for any element in the list, ie. whether there was a match in this case.
use warnings;
use strict;
use Most::ListUtils;
my $str = 'branches/Soft/a.txt';
my #paths = ('branches/Soft/a.txt', 'branches/Soft/b.txt',
'branches/Docs/A1/b.txt', 'branches/Soft/B2/c.tct');
my $got_match = any { $_ =~ m/$str/ } #paths;
With the list above, containing the $str, the $got_match is 1.
Or you can roll it by hand and catch the match as well
foreach my $p (#paths) {
print "Found it: $1\n" if $p =~ m/($str)/;
}
This does print out the match.
Note that the strings you show in your example do not contain the one to match. I added it to my list for a test. Without it in the list no match is found in either of the examples.
To test for more than one string, with the added sample
my #strings = ('branches/Soft/a.txt', 'branches/Soft/h.cpp', 'branches/Main/utils.pl');
my #paths = ('branches/Soft/a.txt', 'branches/Soft/h.cpp', 'branches/Main/utils.pl',
'branches/Soft/B2/c.tct', 'branches/Docs/A1/b.txt');
foreach my $str (#strings) {
foreach my $p (#paths) {
print "Found it: $1\n" if $p =~ m/($str)/;
}
# Or, instead of the foreach loop above use
# my $match = grep { /$str/ } #paths;
# print "Matched for $str\n" if $match;
}
This prints
Found it: branches/Soft/a.txt
Found it: branches/Soft/h.cpp
Found it: branches/Main/utils.pl
When the lines with grep are uncommented and foreach ones commented out I get the corresponding prints for the same strings.
The slashes dot in $a will pose a problem so you either have to escape them it when doing regex match or use a simple eq to find the matches:
Regex match with $a escaped:
my #matches = grep { /\Q$a\E/ } #array;
Simple comparison with "equals":
my #matches = grep { $_ eq $a } #array;
With your sample data both will give an empty array #matches because there is no match.
This Solved My Question. Thanks to all especially #zdim for the valuable time and support
my #SVNFILES = ('branches/Soft/a.txt', 'branches/Soft/b.txt');
my #paths = ('branches/Soft/a.txt', 'branches/Soft/b.txt',
'branches/Docs/A1/b.txt', 'branches/Soft/B2/c.tct');
foreach my $svn (#SVNFILES)
{
chomp ($svn);
my $m = grep { /$svn/ } (#paths);
if ( $m eq '0' ) {
print "Files Mismatch\n";
exit 1;
}
}
You should escape characters like '/' and '.' in any regex when you need it as a character.
Likewise :
$a="branches\/Soft\/a\.txt"
Retry whatever you did with either grep or perl with that. If it still doesn't work, tell us precisely what you tried.

How to refer to matched part in regex

I am using the following code to search for a substring and print it out with a few characters before and after it. Somehow Perl takes issue with me using $1 and complains about
Use of uninitialized value $1 in concatenation (.) or string.
I cannot figure out why...can you?
use List::Util qw[min max];
my $word = "test";
my $lines = "this is just a test to find something out";
my $context = 3;
while ($lines =~ m/\b$word\b/g ) { # as long as pattern is found...
print "$word\ ";
print "$1";
print substr ($lines, max(pos($lines)-length($1)-$context, 0), length($1)+$context); # check: am I possibly violating any boundaries here
}
You have to capture $word into regex group $1 by using parentheses,
while ($lines =~ m/\b($word)\b/g)
When you use $1, you are asking the code to use the first captured group from the regex and since your regex doesn't have any, well, that variable won't exist.
You can either refer to the whole match with $& or you add a capture group to your regex and keep using $1.
i.e. Either:
use List::Util qw[min max];
my $word = "test";
my $lines = "this is just a test to find something out";
my $context = 3;
while ($lines =~ m/\b$word\b/g ) { # as long as pattern is found...
print "$word\ ";
print "$&";
print substr ($lines, max(pos($lines)-length($&)-$context, 0), length($&)+$context); # check: am I possibly violating any boundaries here
}
Or
use List::Util qw[min max];
my $word = "test";
my $lines = "this is just a test to find something out";
my $context = 3;
while ($lines =~ m/(\b$word\b)/g ) { # as long as pattern is found...
print "$word\ ";
print "$1";
print substr ($lines, max(pos($lines)-length($1)-$context, 0), length($1)+$context); # check: am I possibly violating any boundaries here
}
Note: It doesn't matter whether you use (\b$word\b) or (\b$word)\b or \b($word\b) or \b($word)\b here because \b is a 'string' of 0 length.
When you want to address a matched part in regex, put it in parenthes. Than you'll be able to address this mathced part via $1 variable (for first pair of parenthes), $2 (for the second pair) and so on.
The values $1, $2 and so on hold the strings found by capture groups. When a match is performed all of these variables are set to undef. The code in the question does not have any capture groups and hence $1 is never given a value, it is undefined.
Running the code below shows the effect. Initially $1, $2 and $3 are not defined. The first match sets $1 and $2 but not $3. The second match sets only $1 but not that $2 is cleared to be undefined. The third match has no capture groups and all three are undefined.
use strict;
use warnings;
sub show
{
printf "\$1: %s\n", (defined $1 ? $1 : "-undef-");
printf "\$2: %s\n", (defined $2 ? $2 : "-undef-");
printf "\$3: %s\n", (defined $3 ? $3 : "-undef-");
print "\n";
}
my $text = "abcdefghij";
show();
$text =~ m/ab(cd)ef(gh)ij/; # First match
show();
$text =~ m/ab(cd)efghij/; # Second match
show();
$text =~ m/abcdefghij/; # Third match
show();
$1 will have no value unless you are actually capturing something.
You can adjust your boundary collection method to using lookahead and lookbehinds.
use strict;
use warnings;
my $lines = "this is just a test to find something out";
my $word = "test";
my $extra = 10;
while ($lines =~ m/(?:(?<=(.{$extra}))|(.{0,$extra}))\b(\Q$word\E)\b(?=(.{0,$extra}))/gs ) {
my $pre = $1 // $2;
my $word = $3;
my $post = $4;
print "'...$pre<$word>$post...'\n";
}
Outputs:
'...is just a <test> to find s...'

$1 and multiple groupings

If I have a regular expression and if I am expecting one or the other terms to match, say
ab*cc or ab*dd
I have a RegEx like
while($line =~ /(ab*cc)|(ab*dd)/g)
{
# print match whether its abcc or abdd
# print $1?
}
But I am unsure how $1 will work.
Is there such a thing as $2 meaning that it will be $1 if it matches abcc or $2 if it matches abdd? How can I extend this if I have say 3 groupings or so, meaning it could either be X or Y or Z?
You could use:
while ($line =~ m/((ab*cc)|(ab*dd))/g)
Now $1 will be whichever of the two terms matches, and $2 will be the same if the first term matches but undefined otherwise, while $3 will be the same if the second term matches but undefined otherwise. The extension to three or more terms should be obvious.
The m// notation is a marginally more explicit notation equivalent to //. Otherwise, it does not alter things. The $1 etc values are determined by the order of the open parentheses (. The outer pair wrap everything that's matched; the two inner pairs capture the terms. Note that if you had m/((ab*cc)+|(ab*dd)+))/g, the contents of $2 or $3 would be the last of the repeated terms, not the complete set of the repeated terms.
Example 1
$ cat example2.pl
#!/usr/bin/env perl
use strict;
use warnings;
while (my $line = <>)
{
chomp $line;
print "Line: <<$line>>\n";
while ($line =~ m/((ab*cc)|(ab*dd))/g)
{
printf "\$1 = <<%s>>; \$2 = <<%s>>; \$3 = <<%s>>\n",
$1 // "undef", $2 // "undef", $3 // "undef";
}
}
$ perl example1.pl
abbccabccaccaddabddabbdddabbbdddd
Line: <<abbccabccaccaddabddabbdddabbbdddd>>
$1 = <<abbcc>>; $2 = <<abbcc>>; $3 = <<undef>>
$1 = <<abcc>>; $2 = <<abcc>>; $3 = <<undef>>
$1 = <<acc>>; $2 = <<acc>>; $3 = <<undef>>
$1 = <<add>>; $2 = <<undef>>; $3 = <<add>>
$1 = <<abdd>>; $2 = <<undef>>; $3 = <<abdd>>
$1 = <<abbdd>>; $2 = <<undef>>; $3 = <<abbdd>>
$1 = <<abbbdd>>; $2 = <<undef>>; $3 = <<abbbdd>>
$
Example 2
$ cat example2.pl
#!/usr/bin/env perl
use strict;
use warnings;
while (my $line = <>)
{
chomp $line;
print "Line: <<$line>>\n";
while ($line =~ m/((ab*cc)+|(ab*dd)+)/g)
{
printf "\$1 = <<%s>>; \$2 = <<%s>>; \$3 = <<%s>>\n",
$1 // "undef", $2 // "undef", $3 // "undef";
}
}
$ perl example2.pl
abbccabccacc
Line: <<abbccabccacc>>
$1 = <<abbccabccacc>>; $2 = <<acc>>; $3 = <<undef>>
$
Captured parenthesis are labeled in the order of their appearance. In an or'ed group like that, either one or the other will match. To test which matched, simply use defined:
while($line =~ /(ab*cc)|(ab*dd)/g)
{
if (defined $1) {
print "first group matched: $1";
} elsif (defined $2) {
print "second group matched: $2";
}
}
If you do not care about distinguishing which group matched, just use a single parenthesis around the entire expression
while($line =~ /(ab*cc|ab*dd)/g)
{
print "Will hold whichever matched: $1";
}

Perl conditional regex extraction

This conditional must match either telco_imac_city or telco_hier_city. When it succeeds I need to extract up to the second underscore of the value that was matched.
I can make it work with this code
if ( ($value =~ /(telco_imac_)city/) || ($value =~ /(telco_hier_)city/) ) {
print "value is: \"$1\"\n";
}
But if possible I would rather use a single regex like this
$value = $ARGV[0];
if ( $value =~ /(telco_imac_)city|(telco_hier_)city/ ) {
print "value is: \"$1\"\n";
}
But if I pass the value telco_hier_city I get this output on testing the second value
Use of uninitialized value $1 in concatenation (.) or string at ./test.pl line 19.
value is: ""
What am I doing wrong?
while (<$input>){
chomp;
print "$1\n" if /(telco_hier|telco_imac)_city/;
}
Perl capture groups are numbered based on the matches in a single statement. Your input, telco_hier_city, matches the second capture of that single regex (/(telco_imac_)city|(telco_hier_)city/), meaning you'd need to use $2:
my $value = $ARGV[0];
if ( $value =~ /(telco_imac_)city|(telco_hier_)city/ ) {
print "value is: \"$2\"\n";
}
Output:
$> ./conditionalIfRegex.pl telco_hier_city
value is: "telco_hier_"
Because there was no match in your first capture group ((telco_imac_)), $1 is uninitialized, as expected.
To fix your original code, use FlyingFrog's regex:
my $value = $ARGV[0];
if ( $value =~ /(telco_hier_|telco_imac_)city/ ) {
print "value is: \"$1\"\n";
}
Output:
$> ./conditionalIfRegex.pl telco_hier_city
value is: "telco_hier_"
$> ./conditionalIfRegex.pl telco_imac_city
value is: "telco_imac_"

Perl regex: how to know number of matches

I'm looping through a series of regexes and matching it against lines in a file, like this:
for my $regex (#{$regexs_ref}) {
LINE: for (#rawfile) {
/#$regex/ && do {
# do something here
next LINE;
};
}
}
Is there a way for me to know how many matches I've got (so I can process it accordingly..)?
If not maybe this is the wrong approach..? Of course, instead of looping through every regex, I could just write one recipe for each regex. But I don't know what's the best practice?
If you do your matching in list context (i.e., basically assigning to a list), you get all of your matches and groupings in a list. Then you can just use that list in scalar context to get the number of matches.
Or am I misunderstanding the question?
Example:
my #list = /$my_regex/g;
if (#list)
{
# do stuff
print "Number of matches: " . scalar #list . "\n";
}
You will need to keep track of that yourself. Here is one way to do it:
#!/usr/bin/perl
use strict;
use warnings;
my #regexes = (
qr/b/,
qr/a/,
qr/foo/,
qr/quux/,
);
my %matches = map { $_ => 0 } #regexes;
while (my $line = <DATA>) {
for my $regex (#regexes) {
next unless $line =~ /$regex/;
$matches{$regex}++;
}
}
for my $regex (#regexes) {
print "$regex matched $matches{$regex} times\n";
}
__DATA__
foo
bar
baz
In CA::Parser's processing associated with matches for /$CA::Regex::Parser{Kills}{all}/, you're using captures $1 all the way through $10, and most of the rest use fewer. If by the number of matches you mean the number of captures (the highest n for which $n has a value), you could use Perl's special #- array (emphasis added):
#LAST_MATCH_START
#-
$-[0] is the offset of the start of the last successful match. $-[n] is the offset of the start of the substring matched by n-th subpattern, or undef if the subpattern did not match.
Thus after a match against $_, $& coincides with substr $_, $-[0], $+[0] - $-[0]. Similarly, $n coincides with
substr $_, $-[n], $+[n] - $-[n]
if $-[n] is defined, and $+ coincides with
substr $_, $-[$#-], $+[$#-] - $-[$#-]
One can use $#- to find the last matched subgroup in the last successful match. Contrast with $#+, the number of subgroups in the regular expression. Compare with #+.
This array holds the offsets of the beginnings of the last successful submatches in the currently active dynamic scope. $-[0] is the offset into the string of the beginning of the entire match. The n-th element of this array holds the offset of the nth submatch, so $-[1] is the offset where $1 begins, $-[2] the offset where $2 begins, and so on.
After a match against some variable $var:
$` is the same as substr($var, 0, $-[0])
$& is the same as substr($var, $-[0], $+[0] - $-[0])
$' is the same as substr($var, $+[0])
$1 is the same as substr($var, $-[1], $+[1] - $-[1])
$2 is the same as substr($var, $-[2], $+[2] - $-[2])
$3 is the same as substr($var, $-[3], $+[3] - $-[3])
Example usage:
#! /usr/bin/perl
use warnings;
use strict;
my #patterns = (
qr/(foo(bar(baz)))/,
qr/(quux)/,
);
chomp(my #rawfile = <DATA>);
foreach my $pattern (#patterns) {
LINE: for (#rawfile) {
/$pattern/ && do {
my $captures = $#-;
my $s = $captures == 1 ? "" : "s";
print "$_: got $captures capture$s\n";
};
}
}
__DATA__
quux quux quux
foobarbaz
Output:
foobarbaz: got 3 captures
quux quux quux: got 1 capture
How about below code:
my $string = "12345yx67hjui89";
my $count = () = $string =~ /\d/g;
print "$count\n";
It prints 9 here as expected.