Regex to match a line in a multi-lined string in Perl - regex

I have the following code:
use Capture::Tiny qw(capture);
my $cmd = $SOME_CMD;
my ($stdout, $stderr, $exit_status) = capture { system($cmd); };
unless ($exit_status && $stdout =~ /^Repository:\s+(.*)/) {
my $name = $1;
}
It run the $cmd and tries to parse the output. The output looks like:
Information for package perl-base:
Repository: #System
Name: perl-base
Version: 5.10.0-64.81.13.1
For some reason $name is empty probably because it could not group due to multi-lined string. I also tried /^Repository:\s+(.*)/s and /^Repository:\s+(.*)$/ but it didn't work as well.
I want the $name to have #System. How can I do it?

I believe you want the multiline m flag:
use strict;
use warnings;
my $s = 'Information for package perl-base:
Repository: #System
Name: perl-base
Version: 5.10.0-64.81.13.1';
$s =~ /^Repository:\s+(.*)/m;
print $1; # => #System
You can make your regex more accurate with $ to anchor the end of line and + instead of \s+: /^Repository: +(.*)$/m.

$name is empty because it is declared inside a block, which means it is out of scope outside that block. You would know this if you had used use strict, which does not allow you to access undeclared variables.
What you need to do is to declare the variable outside the block:
my $name; # declared outside block
unless ($exit_status && $stdout =~ /^Repository:\s+(.*)/m) {
$name = $1;
}
print "Name is: $name\n"; # accessible outside the block
Also, you need to remove the beginning of line anchor ^, or add the /m modifier.

First, the logic of that unless statement is broken, as it short-circuts on success:
unless ($exit_status && $stdout =~ /^Repository:\s+(.*)/) { ... }
is just a syntactic "convenience" for
if (not ($exit_status && $stdout =~ /^Repository:\s+(.*)/) ) { ... }
So if the command ran successfully and $exit_status is falsey (0 for success) then the &&-ed condition is false right there, and so it short-circuits since it is already decided.† Thus the regex never runs and $1 stays undef.
But it gets worse: if $exit_status were a positive number and the regex matches (quite possible), then the &&-ed condition is true and with not the whole if is false so you don't get its block to run! While there was valid output from the command (since regex matched).
So I'd suggest to disentangle those double-negatives, for something like
if ( $exit_status==0 and $stdout =~ /.../m ) { ... } # but see text
Then there must be an elsif ($exit_status) to interrogate further. But a command may return an exit code as it pleases, and some return non-zero merely to communicate specifics even when they ran successfully! So better break that up, to get to see everything, like
if ($exit_status) { ... } # interrogate
if ($stdout =~ /.../m) { ... } # may have still ran fine even with exit>0
The moral here, if I may emphasize, is about dangers of convoluted code, combined logical negatives, meaningful evaluations inside composite conditions, and all that.
Next, as mentioned, the regex attempts to match a pattern in a multiline string while it uses the anchor ^ -- which anchors the pattern to the beginning of the whole string, not to a line within, as clearly intended; so it would not match the shown text.
With the modifier /m added the behavior of the anchor ^ is changed so to match the beginning of lines within a string.
† If this gets one's head spinning consider the equivalent
if ( (not $exit_status) or (not $stdout =~ /^Repository:\s+(.*)/) ) { ...
With falsey $exit_status the first (not $exit_status) is true so the whole if is true right there and the second expression need not be evaluated and so it isn't (in Perl)
Try it with a one-liner
perl -wE'if ( 0 and do { say "hi" } ) { say "bye" }'
This doesn't print anything; no hi nor bye. With 0 the whole condition is certainly false so the do block isn't evaluated, and the if's block isn't either.
If we change and to or though (or 0 to 1), then the first condition (0) doesn't decide yet and the second condition is evaluated, so hi is printed. That condition is true (printing statements normally return 1) and so bye prints, too.

Related

Best way to deal with "Unescaped braces in regex" inside Perl regex

I recently started learning Perl to automate some mindless data tasks. I work on windows machines, but prefer to use Cygwin. Wrote a Perl script that did everything I wanted fine in Cygwin, but when I tried to run it with Strawberry Perl on Windows via CMD I got the "Unescaped left brace in regex is illegal here in regex," error.
After some reading, I am guessing my Cygwin has an earlier version of Perl and modern versions of Perl which Strawberry is using don't allow for this. I am familiar with escaping characters in regex, but I am getting this error when using a capture group from a previous regex match to do a substitution.
open(my $fh, '<:encoding(UTF-8)', $file)
or die "Could not open file '$file' $!";
my $fileContents = do { local $/; <$fh> };
my $i = 0;
while ($fileContents =~ /(.*Part[^\}]*\})/) {
$defParts[$i] = $1;
$i = $i + 1;
$fileContents =~ s/$1//;
}
Basically I am searching through a file for matches that look like:
Part
{
Somedata
}
Then storing those matches in an array. Then purging the match from the $fileContents so I avoid repeats.
I am certain there are better and more efficient ways of doing any number of these things, but I am surprised that when using a capture group it's complaining about unescaped characters.
I can imagine storing the capture group, manually escaping the braces, then using that for the substitution, but is there a quicker or more efficient way to avoid this error without rewriting the whole block? (I'd like to avoid special packages if possible so that this script is easily portable.)
All of the answers I found related to this error were with specific cases where it was more straightforward or practical to edit the source with the curly braces.
Thank you!
I would just bypass the whole problem and at the same time simplify the code:
my $i = 0;
while ($fileContents =~ s/(.*Part[^\}]*\})//) {
$defParts[$i] = $1;
$i = $i + 1;
}
Here we simply do the substitution first. If it succeeds, it will still set $1 and return true (just like plain /.../), so there's no need to mess around with s/$1// later.
Using $1 (or any variable) as the pattern would mean you have to escape all regex metacharacters (e.g. *, +, {, (, |, etc.) if you want it to match literally. You can do that pretty easily with quotemeta or inline (s/\Q$1//), but it's still an extra step and thus error prone.
Alternatively, you could keep your original code and not use s///. I mean, you already found the match. Why use s/// to search for it again?
while ($fileContents =~ /(.*Part[^\}]*\})/) {
...
substr($fileContents, $-[0], $+[0] - $-[0], "");
}
We already know where the match is in the string. $-[0] is the position of the start and $+[0] the position of the end of the last regex match (thus $+[0] - $-[0] is the length of the matched string). We can then use substr to replace that chunk by "".
But let's keep going with s///:
my $i = 0;
while ($fileContents =~ s/(.*Part[^\}]*\})//) {
$defParts[$i] = $1;
$i++;
}
$i = $i + 1; can be reduced to $i++; ("increment $i").
my #defParts;
while ($fileContents =~ s/(.*Part[^\}]*\})//) {
push #defParts, $1;
}
The only reason we need $i is to add elements to the #defParts array. We can do that by using push, so there's no need for maintaining an extra variable. This saves us another line.
Now we probably don't need to destroy $fileContents. If the substitution exists only for the benefit of this loop (so I doesn't re-match already extracted content), we can do better:
my #defParts;
while ($fileContents =~ /(.*Part[^\}]*\})/g) {
push #defParts, $1;
}
Using /g in scalar context attaches a "current position" to $fileContents, so the next match attempt starts where the previous match left off. This is probably more efficient because it doesn't have to keep rewriting $fileContents.
my #defParts = $fileContents =~ /(.*Part[^\}]*\})/g;
... Or we could just use //g in list context, where it returns a list of all captured groups of all matches, and assign that to #defParts.
my #defParts = $fileContents =~ /.*Part[^\}]*\}/g;
If there are no capture groups in the regex, //g in list context returns the list of all matched strings (as if there had been ( ) around the whole regex).
Feel free to choose any of these. :-)
As for the question of escaping, that's what quotemeta is for,
my $needs_escaping = q(some { data } here);
say quotemeta $needs_escaping;
what prints (on v5.16)
some\ \{\ data\ \}\ here
and works on $1 as well. See linked docs for details. Also see \Q in perlre (search for \Q), which is how this is used inside a regex, say s/\Q$1//;. The \E stops escaping (what you don't need).
Some comments.
Relying on deletion so that the regex keeps finding further such patterns may be a risky design. If it isn't and you do use it there is no need for indices, since we have push
my #defParts;
while ($fileContents =~ /($pattern)/) {
push #defParts, $1;
$fileContents =~ s/\Q$1//;
}
where \Q is added in the regex. Better yet, as explained in melpomene's answer the substitution can be done in the while condition itself
push #defParts, $1 while $fileContents =~ s/($pattern)//;
where I used the statement modifier form (postfix syntax) for conciseness.
With the /g modifier in scalar context, as in while (/($pattern)/g) { .. }, the search continues from the position of the previous match in each iteration, and this is a usual way to iterate over all instances of a pattern in a string. Please read up on use of /g in scalar context as there are details in its behavior that one should be aware of.
However, this is tricky here (even as it works) as the string changes underneath the regex. If efficiency is not a concern, you can capture all matches with /g in list context and then remove them
my #all_matches = $fileContents =~ /$patt/g;
$fileContents =~ s/$patt//g;
While inefficient, as it makes two passes, this is much simpler and clearer.
I expect that Somedata cannot possibly, ever, contain }, for instance as nested { ... }, correct? If it does you have a problem of balanced delimiters, which is far more rounded. One approach is to use the core Text::Balanced module. Search for SO posts with examples.

Perl, Assign regex match to scalar

There's an example snippet in Mail::POP3Client in which theres a piece of syntax that I don't understand why or how it's working:
foreach ( $pop->Head( $i ) ) {
/^(From|Subject):\s+/i and print $_, "\n";
}
The regex bit in particular. $_ remains the same after that line but only the match is printed.
An additional question; How could I assign the match of that regex to a scalar of my own so I can use that instead of just print it?
This is actually pretty tricky. What it's doing is making use of perl's short circuiting feature to make a conditional statement. it is the same as saying this.
if (/^(From|Subject):\s+/i) {
print $_;
}
It works because perl stops evaluating and statements after something evaluates to 0. and unless otherwise specified a regex in the form /regex/ instead of $somevar =~ /regex/ will apply the regex to the default variable, $_
you can store it like this
my $var;
if (/^(From|Subject):\s+/i) {
$var = $_;
}
or you could use a capture group
/^((?:From|Subject):\s+)/i
which will store the whole thing into $1

Perl regex strange behaviour

Method 1:
$C_HOME = "$ENV{EO_HOME}\\common\\";
print $C_HOME;
gives C:\work\System11R1\common\
ie The environment variable is getting expanded.
Method 2:
Parse properties file having
C_HOME = $ENV{EO_HOME}\common\
while(<IN>) {
if(m/(.*)\s+=\s+(.*)/)
{
$o{$1}=$2;
}
}
$C_HOME = $o{"C_HOME"};
print $C_HOME;
This gives a output of $ENV{EO_HOME}\common\
ie The environment variable is not getting expanded.
How do I make sure that the environment variable gets expanded in the second case also.
The problem is in the line:
$o{$1}=$2;
Of course perl will not evaluate $2 automatically as it read it.
If you want, you can evaluate it manually:
$o{$1}=eval($2);
But you must be sure that it is ok from security point of view.
the value of $o{C_HOME} contains the literal string $ENV{C_HOME}\common\. To get the $ENV-value eval-ed, use eval...
$C_HOME = eval $o{"C_HOME"};
I leave it to you to find out why that will fail, however...
Expression must be evaluated:
$C_HOME = eval($o{"C_HOME"});
Perl expands variables in double-quote-like code strings, not in data.
You have to eval a string to explicity interpolate variables inside it, but doing so without checking what you are passing to eval is dangerous.
Instead, look for everything you may want to interpolate inside the string and eval those using a regex substitution with the /ee modifier.
This program looks for all references to elements of the %ENV hash in the config value and replaces them. You may want to add support for whitespace wherever Perl allows it ($ ENV { EO_HOME } compiles just fine). It also assigns test values for %ENV which you will need to remove.
use strict;
use warnings;
my %data;
%ENV = ( EO_HOME => 'C:\work\System11R1' );
while (<DATA>) {
if ( my ($key, $val) = m/ (.*) \s+ = \s* (.*) /x ) {
$val =~ s/ ( \$ENV \{ \w+ \} ) / $1 /gxee;
$data{$key} = $val;
}
}
print $data{C_HOME};
__DATA__
C_HOME = $ENV{EO_HOME}\common\
output
C:\work\System11R1\common\

Why does Perl this program fail to print "Success" when the regex match succeeds?

I am still a beginner with Perl, so please bear with me. This is the situation I'm experiencing:
$var = "AB1234567";
$num = "1234567";
next if $var =~ /$num/;
print "Success!\n";
Now, my understanding is that it should print "Success!\n", but in reality it doesn't. However, if I change the regex to next if $var =~ /"$num"/;, this will actually print "Success!\n".
However, if I change it to $var = "AB123456";, the original regex will work fine.
I understand that when enclosing strings using double quotations, it will dereference interpolate the variable. However, should it not be true in the case of regex? I've done regex using variables without quotations and it worked fine.
Thanks for your help!
EDIT:
I left out semi-colons in my example, but my original problem still stands.
EDIT:
I really should've just copied/pasted. I made a typo and used continue if instead of next if. Again, my problem still exists.
I think you are confused on what continue does. I'm not really sure of what you are trying to do here, but generally continue will break out of an if-block or go to the next iteration of a for-loop, while-loop, etc. When the pattern does match, continue is actually skipping over the print statement. When you change the value of $var, it no longer matches and the print statement is reached.
Try this:
$var = "AB1234567"
$num = "1234567"
print "Success!\n" if $var =~ /$num/;
Randy
I don't think so. This works for me:
$var = "AB1234567";
$num = "1234567";
print "Success!\n" if $var =~ /$num/;
It is either something with your continue statement or the fact that you are missing semicolons in the variable assignments (your example doesn't even run for me....).
Let's look at your code:
$var = "AB1234567";
$num = "1234567";
next if $var =~ /$num/;
print "Success!\n";
I am assuming this is in the body of some loop. When you say
next if $var =~ /$num/;
You are saying "go back and execute the loop body for the next iteration, skipping any following statements in the look body if $var matches /$num/". When $var does indeed match /$num/, the program does exactly what you asked for and skips the print statement.
On the other hand, when you use $var =~ /"$num"/, the pattern on the right hand side becomes /"1234567"/. Now, the conditional in next if $var =~ /"$num"/ cannot succeed. And, your program prints "Success" because the match failed.
#!/usr/bin/perl
use strict; use warnings;
use re 'debug';
my $num = "1234567";
for my $var ( qw( AB1234567 ) ) {
next if $var =~ /$num/;
print "Success!\n";
}
Output:
Compiling REx "1234567"
Final program:
1: EXACT (4)
4: END (0)
anchored "1234567" at 0 (checking anchored isall) minlen 7
Guessing start of match in sv for REx "1234567" against "AB1234567"
Found anchored substr "1234567" at offset 2...
Starting position does not contradict /^/m...
Guessed: match at offset 2
Freeing REx: "1234567"
Now, let's use /"$num"/ as the pattern:
#!/usr/bin/perl
use strict; use warnings;
use re 'debug';
my $num = "1234567";
for my $var ( qw( AB1234567 ) ) {
next if $var =~ /"$num"/;
print "Success!\n";
}
Output:
Compiling REx "%"1234567%""
Final program:
1: EXACT (5)
5: END (0)
anchored "%"1234567%"" at 0 (checking anchored isall) minlen 9
Guessing start of match in sv for REx "%"1234567%"" against "AB1234567"
Did not find anchored substr "%"1234567%""...
Match rejected by optimizer
Success!
Freeing REx: "%"1234567%""
Your code prints Success! when the match fails.
See also perldoc perlsyn:
The next command starts the next iteration of the loop:
LINE: while (<STDIN>) {
next LINE if /^#/; # discard comments
...
}
http://perldoc.perl.org/functions/index.html
C:>perl -e "$x='abc'; $y='bc'; print 'yes' if index($x,$y)>0;"
yes
C:>perl -e "$x='abc'; $y='bc'; print 'yes' if $x =~ /$y/"
yes
If (as is implied) the code is in a loop, you might be interested in qr{} in http://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators
Your code says "If this variable matches this regular expression, do nothing. Otherwise, print Success!.". I doubt that's what you meant for it to do.
Adding the quotation marks does not affect whether the variable is interpolated at all. What it does do is cause the regular expression match to fail. Since the match fails, the next[1] is not executing and execution reaches the print statement and Success! is printed.
The problem with your code is not at all related to regular expressions or to variable interpolation into regular expressions or to quotation marks appearing inside regular expressions. It is a simple logic error of an inverted conditional.
Try replacing the third line with
next unless $var =~ /$num/;
to see what happens when you simply invert the logic of your code to what you probably intended.
[1]: or continue, depending on which version of your code one is looking at

How can I find out what was replaced in a Perl substitution?

Is there any way to find out what was substituted for (the "old" text) after applying the s/// operator? I tried doing:
if (s/(\w+)/new/) {
my $oldTxt = $1;
# ...
}
But that doesn't work. $1 is undefined.
Your code works for me. Copied and pasted from a real terminal window:
$ perl -le '$_ = "*X*"; if (s/(\w+)/new/) { print $1 }'
X
Your problem must be something else.
If you're using 5.10 or later, you don't have to use the potentially-perfomance-killing $&. The ${^MATCH} variable from the /p flag does the same thing but only for the specified regex:
use 5.010;
if( s/abc(\w+)123/new/p ) {
say "I replaced ${^MATCH}"
}
$& does what you want but see the health warning in perlvar
The use of this variable anywhere in a program imposes a considerable performance penalty on all regular expression matches.
If you can find a way to do this without using $&, try that. You could run the regex twice:
my ($match) = /(\w+)/;
if (s/(\w+)/new/) {
my $oldTxt = $match;
# ...
}
You could make the replacement an eval expression:
if (s/(\w+)/$var=$1; "new"/e) { .. do something with $var .. }
You should be able to use the Perl match variables:
$& Contains the string matched by the last pattern match