perl regex matching failed - regex

I want to match two different string and output should come in $1 and $2,
According to me in this example, if $a is 'xy abc', then $1 should be 'xy abc' and $2 should 'abc', but 'abc' part is coming in $3.
Can you please help me to writing a regex in that $1 should have whole string and $2 should
have second part.
I am using perl 5.8.5.
my #data=('abc xy','xy abc');
foreach my $a ( #data) {
print "\nPattern= $a\n";
if($a=~/(abc (xy)|xy (abc))/) {
print "\nMatch: \$1>$1< \$2>$2< \$3>$3<\n";
}
}
Output:
perl test_reg.pl
Pattern= abc xy
Match: $1>abc xy< $2>xy< $3><
Pattern= xy abc
Match: $1>xy abc< $2>< $3>abc<

Can be done with:
(?|(abc (xy))|(xy (abc)))
Why even bother with capturing the whole thing? You can use $& for that.
my #data = ('abc xy', 'xy abc');
for(#data) {
print "String: '$_'\n";
if(/(?|abc (xy)|xy (abc))/) {
print "Match: \$&='$&', \$1='$1'\n";
}
}

Because only one of captures $2 and $3 can be defined, you can write
foreach my $item ( #data) {
print "\nPattern= $item\n";
if ($item=~/(abc (xy)|xy (abc))/) {
printf "Match: whole>%s< part>%s<\n", $1, $2 || $3;
}
}
which gives the output
Pattern= abc xy
Match: whole>abc xy< part>xy<
Pattern= xy abc
Match: whole>xy abc< part>abc<

If you can live with allowing more capture variables than $1 and $2, then use the substrings from the branch of the alternative that matched.
for ('abc xy', 'xy abc') {
print "[$_]:\n";
if (/(abc (xy))|(xy (abc))/) {
print " - match: ", defined $1 ? "1: [$1], 2: [$2]\n"
: "1: [$3], 2: [$4]\n";
}
else {
print " - no match\n";
}
}
Output:
[abc xy]:
- match: 1: [abc xy], 2: [xy]
[xy abc]:
- match: 1: [xy abc], 2: [abc]

Related

perl regex matching, why is it not finding all matches, why is the order important?

I ran into a problem with perl's regex matching. I destilled it down to a small example on the command line. Why is the order in which the matches are attempted important here ?
1.
$ echo "XYG" | perl -ne 'if ($_ =~ m/X/gi) { print "Matches X\n"; } ; if ($_ =~ m/Y/gi) { print "Matches Y\n"; } ; if ($_ =~ m/G/gi) { print "Matches G\n"; } '
Matches X
Matches Y
Matches G
2.
$ echo "GXY" | perl -ne 'if ($_ =~ m/X/gi) { print "Matches X\n"; } ; if ($_ =~ m/Y/gi) { print "Matches Y\n"; } ; if ($_ =~ m/G/gi) { print "Matches G\n"; } else { print "No match on G\n"; } '
Matches X
Matches Y
No match on G
The 1. examples matches all three letters as expected, but the second example does not match the letter G, why ?
However if I create an intermediate variable, here named $aa:
$ echo "GXY" | perl -ne 'if ($_ =~ m/X/gi) { print "Matches X\n"; } ; if ($_ =~ m/Y/gi) { print "Matches Y\n"; } ; $aa = $_; if ($aa =~ m/G/gi) { print "Matches G\n"; } '
Matches X
Matches Y
Matches G
Then the match works again ?
My perl version is:
$ perl -e 'print "$]\n";'
5.022001
On a LM 18.2 machine
$ lsb_release -d
Description: Linux Mint 18.2 Sonya
Ty+BR
Max.
Because if you match a regex in a scalar context like that, and you set the g flag (for global matching) it's iterative - that's to allow you to do things like while ( m/somepattern/g ) { and have it trigger multiple times.
That's because g means:
g - globally match the pattern repeatedly in the string
It'd not be particularly useful if it reset each time you tried it. But you can also use it slightly differently in an array context:
my #matches = $str =~ m/(some_capture)/g;
And that'll select them all into a list.
But with your code and regex debugging:
#!/usr/bin/env perl
use strict;
use warnings;
use re 'debug';
$_ = 'GXY';
if ( $_ =~ m/X/gi ) { print "Matches X\n"; }
if ( $_ =~ m/Y/gi ) { print "Matches Y\n"; }
if ( $_ =~ m/G/gi ) { print "Matches G\n"; }
else { print "No match on G\n"; }
You'll get (snipped for brevity):
Matching REx "X" against "GXY"
Matching REx "Y" against "Y"
Matching REx "G" against ""
The first match 'eats' "GX" to find "X", leaving "Y" for the next match, but nothing at all for the "G" match.
The simple workaround is omit the g flag, because then you're saying explicitly 'match once' and you'll get:
Matches X
Matches Y
Matches G
Alternatively, you can use the global match with a character class:
$_ = 'GXY';
my #matches = m/([GYX])/g; #implicitly operates on $_
print "Match on $_\n" for #matches;

grep a expression in an array of string with Perl

I'm trying to grep a special pattern in an array of strings.
My array of strings is like this :
#list=("First phrase with \"blabla - Word\" and other things "
,"A phrase without... "
,"Second phrase with \"truc - Word\" and etc... "
,"Another phrase without... "
,"Another phrase with \"thing - Word\" and etc... ");
and I tried to grep the pattern "..... - Word" with this function :
#keyw = grep { /\".* Word\"/} (#list);
and I have the following result :
print (Dumper ( #keyw));
$VAR1 = 'First phrase with "blabla - Word" and other things ';
$VAR2 = 'Second phrase with "truc - Word" and etc... ';
$VAR3 = 'Another phrase with "thing - Word" and etc... ';
My grep function is ok to grep the phrase but I would like to grep just the pattern and get the following result :
$VAR1 = '"blabla - Word"';
$VAR2 = '"truc - Word"';
$VAR3 = '"thing - Word"';
Do you know how to reach this result ?
Use map instead of grep:
my #keyw = map { /\"[^"]*? Word\"/ ? $& : () } #list;
It just returns $& (whole match) if the pattern matches and () (empty list) if it does not.
Little caveat: don't use $& with with Perl < 5.18.0.
Here's Casimir et Hippolyte's simpler solution:
my #keyw = map { /(\"[^"]*? Word\")/ } #list;
It works since m// returns a list of captured groups.
Why can't we just use a normal loop?
use strict;
use warnings;
use Data::Dump;
my #phrases = (
"First phrase with \"blabla - Word\" and other things ",
"A phrase without... ",
"Second phrase with \"truc - Word\" and etc... ",
"Another phrase without... ",
"Another phrase with \"thing - Word\" and etc... ",
);
my #matches;
for (#phrases) {
next unless /("[^"]+Word")/;
push(#matches, $1);
}
# prints ["\"blabla - Word\"", "\"truc - Word\"", "\"thing - Word\""]
dd(\#matches);

Perl regex matches string differently when providing it as variable

I've got a question regarding regex matching in Perl. See following code snippet:
my $r = '^\z';
my $s = "";
$r =~ /$s/ ? print "Match\n" : print "No match\n";
Output: Match
following snippet however prints:
my $r = '^\z';
$r =~ /""/ ? print "Match\n" : print "No match\n";
Output: No match
Why? Is this a Perl syntax thing I do not understand?
my $s = "";
is the same as
my $s = '';
or
my $s = q();
i.e. it assings an empty string to $s.
In
$r =~ /""/
you're testing whether $r contains two double quotes.
To assign a pair of double quotes to $s, use
my $s = '""';
my $r = '^\z';
$r =~ /""/ ? print "Match\n" : print "No match\n";
Here, $r is the test string. /""/ is the regex.
There is no "" in $r, so it does not match.
This /""/ is a string delimited by / char, the result of which turns into a regex object.
It is equivalent to:
$str = '""';
$rx = qr/$str/;
The $s variable is casted to nothing. You can check it by yourself, just add use re 'debug'; in the code or add -Mre=debug to perl invocation:
first example
$ perl -Mre=debug -E '$r=q{^\z}; $s=""; $r =~ /$s/ ? say "Match" : say "No match"'
Compiling REx ""
Final program:
1: NOTHING (2)
2: END (0)
minlen 0
Matching REx "" against "^\z"
0 <> <^\z> | 1:NOTHING(2)
0 <> <^\z> | 2:END(0)
Match successful!
Match
Freeing REx: ""
second example
$ perl -Mre=debug -E '$r=q{^\z}; $r =~ /""/ ? say "Match" : say "No match"'
Compiling REx "%"%""
Final program:
1: EXACT <""> (3)
3: END (0)
anchored "%"%"" at 0 (checking anchored isall) minlen 2
Guessing start of match in sv for REx "%"%"" against "^\z"
Did not find anchored substr "%"%""...
Match rejected by optimizer
No match
Freeing REx: "%"%""

$1 and multiple groupings

If I have a regular expression and if I am expecting one or the other terms to match, say
ab*cc or ab*dd
I have a RegEx like
while($line =~ /(ab*cc)|(ab*dd)/g)
{
# print match whether its abcc or abdd
# print $1?
}
But I am unsure how $1 will work.
Is there such a thing as $2 meaning that it will be $1 if it matches abcc or $2 if it matches abdd? How can I extend this if I have say 3 groupings or so, meaning it could either be X or Y or Z?
You could use:
while ($line =~ m/((ab*cc)|(ab*dd))/g)
Now $1 will be whichever of the two terms matches, and $2 will be the same if the first term matches but undefined otherwise, while $3 will be the same if the second term matches but undefined otherwise. The extension to three or more terms should be obvious.
The m// notation is a marginally more explicit notation equivalent to //. Otherwise, it does not alter things. The $1 etc values are determined by the order of the open parentheses (. The outer pair wrap everything that's matched; the two inner pairs capture the terms. Note that if you had m/((ab*cc)+|(ab*dd)+))/g, the contents of $2 or $3 would be the last of the repeated terms, not the complete set of the repeated terms.
Example 1
$ cat example2.pl
#!/usr/bin/env perl
use strict;
use warnings;
while (my $line = <>)
{
chomp $line;
print "Line: <<$line>>\n";
while ($line =~ m/((ab*cc)|(ab*dd))/g)
{
printf "\$1 = <<%s>>; \$2 = <<%s>>; \$3 = <<%s>>\n",
$1 // "undef", $2 // "undef", $3 // "undef";
}
}
$ perl example1.pl
abbccabccaccaddabddabbdddabbbdddd
Line: <<abbccabccaccaddabddabbdddabbbdddd>>
$1 = <<abbcc>>; $2 = <<abbcc>>; $3 = <<undef>>
$1 = <<abcc>>; $2 = <<abcc>>; $3 = <<undef>>
$1 = <<acc>>; $2 = <<acc>>; $3 = <<undef>>
$1 = <<add>>; $2 = <<undef>>; $3 = <<add>>
$1 = <<abdd>>; $2 = <<undef>>; $3 = <<abdd>>
$1 = <<abbdd>>; $2 = <<undef>>; $3 = <<abbdd>>
$1 = <<abbbdd>>; $2 = <<undef>>; $3 = <<abbbdd>>
$
Example 2
$ cat example2.pl
#!/usr/bin/env perl
use strict;
use warnings;
while (my $line = <>)
{
chomp $line;
print "Line: <<$line>>\n";
while ($line =~ m/((ab*cc)+|(ab*dd)+)/g)
{
printf "\$1 = <<%s>>; \$2 = <<%s>>; \$3 = <<%s>>\n",
$1 // "undef", $2 // "undef", $3 // "undef";
}
}
$ perl example2.pl
abbccabccacc
Line: <<abbccabccacc>>
$1 = <<abbccabccacc>>; $2 = <<acc>>; $3 = <<undef>>
$
Captured parenthesis are labeled in the order of their appearance. In an or'ed group like that, either one or the other will match. To test which matched, simply use defined:
while($line =~ /(ab*cc)|(ab*dd)/g)
{
if (defined $1) {
print "first group matched: $1";
} elsif (defined $2) {
print "second group matched: $2";
}
}
If you do not care about distinguishing which group matched, just use a single parenthesis around the entire expression
while($line =~ /(ab*cc|ab*dd)/g)
{
print "Will hold whichever matched: $1";
}

Perl conditional regex extraction

This conditional must match either telco_imac_city or telco_hier_city. When it succeeds I need to extract up to the second underscore of the value that was matched.
I can make it work with this code
if ( ($value =~ /(telco_imac_)city/) || ($value =~ /(telco_hier_)city/) ) {
print "value is: \"$1\"\n";
}
But if possible I would rather use a single regex like this
$value = $ARGV[0];
if ( $value =~ /(telco_imac_)city|(telco_hier_)city/ ) {
print "value is: \"$1\"\n";
}
But if I pass the value telco_hier_city I get this output on testing the second value
Use of uninitialized value $1 in concatenation (.) or string at ./test.pl line 19.
value is: ""
What am I doing wrong?
while (<$input>){
chomp;
print "$1\n" if /(telco_hier|telco_imac)_city/;
}
Perl capture groups are numbered based on the matches in a single statement. Your input, telco_hier_city, matches the second capture of that single regex (/(telco_imac_)city|(telco_hier_)city/), meaning you'd need to use $2:
my $value = $ARGV[0];
if ( $value =~ /(telco_imac_)city|(telco_hier_)city/ ) {
print "value is: \"$2\"\n";
}
Output:
$> ./conditionalIfRegex.pl telco_hier_city
value is: "telco_hier_"
Because there was no match in your first capture group ((telco_imac_)), $1 is uninitialized, as expected.
To fix your original code, use FlyingFrog's regex:
my $value = $ARGV[0];
if ( $value =~ /(telco_hier_|telco_imac_)city/ ) {
print "value is: \"$1\"\n";
}
Output:
$> ./conditionalIfRegex.pl telco_hier_city
value is: "telco_hier_"
$> ./conditionalIfRegex.pl telco_imac_city
value is: "telco_imac_"