Perl, Assign regex match to scalar - regex

There's an example snippet in Mail::POP3Client in which theres a piece of syntax that I don't understand why or how it's working:
foreach ( $pop->Head( $i ) ) {
/^(From|Subject):\s+/i and print $_, "\n";
}
The regex bit in particular. $_ remains the same after that line but only the match is printed.
An additional question; How could I assign the match of that regex to a scalar of my own so I can use that instead of just print it?

This is actually pretty tricky. What it's doing is making use of perl's short circuiting feature to make a conditional statement. it is the same as saying this.
if (/^(From|Subject):\s+/i) {
print $_;
}
It works because perl stops evaluating and statements after something evaluates to 0. and unless otherwise specified a regex in the form /regex/ instead of $somevar =~ /regex/ will apply the regex to the default variable, $_
you can store it like this
my $var;
if (/^(From|Subject):\s+/i) {
$var = $_;
}
or you could use a capture group
/^((?:From|Subject):\s+)/i
which will store the whole thing into $1

Related

Best way to deal with "Unescaped braces in regex" inside Perl regex

I recently started learning Perl to automate some mindless data tasks. I work on windows machines, but prefer to use Cygwin. Wrote a Perl script that did everything I wanted fine in Cygwin, but when I tried to run it with Strawberry Perl on Windows via CMD I got the "Unescaped left brace in regex is illegal here in regex," error.
After some reading, I am guessing my Cygwin has an earlier version of Perl and modern versions of Perl which Strawberry is using don't allow for this. I am familiar with escaping characters in regex, but I am getting this error when using a capture group from a previous regex match to do a substitution.
open(my $fh, '<:encoding(UTF-8)', $file)
or die "Could not open file '$file' $!";
my $fileContents = do { local $/; <$fh> };
my $i = 0;
while ($fileContents =~ /(.*Part[^\}]*\})/) {
$defParts[$i] = $1;
$i = $i + 1;
$fileContents =~ s/$1//;
}
Basically I am searching through a file for matches that look like:
Part
{
Somedata
}
Then storing those matches in an array. Then purging the match from the $fileContents so I avoid repeats.
I am certain there are better and more efficient ways of doing any number of these things, but I am surprised that when using a capture group it's complaining about unescaped characters.
I can imagine storing the capture group, manually escaping the braces, then using that for the substitution, but is there a quicker or more efficient way to avoid this error without rewriting the whole block? (I'd like to avoid special packages if possible so that this script is easily portable.)
All of the answers I found related to this error were with specific cases where it was more straightforward or practical to edit the source with the curly braces.
Thank you!
I would just bypass the whole problem and at the same time simplify the code:
my $i = 0;
while ($fileContents =~ s/(.*Part[^\}]*\})//) {
$defParts[$i] = $1;
$i = $i + 1;
}
Here we simply do the substitution first. If it succeeds, it will still set $1 and return true (just like plain /.../), so there's no need to mess around with s/$1// later.
Using $1 (or any variable) as the pattern would mean you have to escape all regex metacharacters (e.g. *, +, {, (, |, etc.) if you want it to match literally. You can do that pretty easily with quotemeta or inline (s/\Q$1//), but it's still an extra step and thus error prone.
Alternatively, you could keep your original code and not use s///. I mean, you already found the match. Why use s/// to search for it again?
while ($fileContents =~ /(.*Part[^\}]*\})/) {
...
substr($fileContents, $-[0], $+[0] - $-[0], "");
}
We already know where the match is in the string. $-[0] is the position of the start and $+[0] the position of the end of the last regex match (thus $+[0] - $-[0] is the length of the matched string). We can then use substr to replace that chunk by "".
But let's keep going with s///:
my $i = 0;
while ($fileContents =~ s/(.*Part[^\}]*\})//) {
$defParts[$i] = $1;
$i++;
}
$i = $i + 1; can be reduced to $i++; ("increment $i").
my #defParts;
while ($fileContents =~ s/(.*Part[^\}]*\})//) {
push #defParts, $1;
}
The only reason we need $i is to add elements to the #defParts array. We can do that by using push, so there's no need for maintaining an extra variable. This saves us another line.
Now we probably don't need to destroy $fileContents. If the substitution exists only for the benefit of this loop (so I doesn't re-match already extracted content), we can do better:
my #defParts;
while ($fileContents =~ /(.*Part[^\}]*\})/g) {
push #defParts, $1;
}
Using /g in scalar context attaches a "current position" to $fileContents, so the next match attempt starts where the previous match left off. This is probably more efficient because it doesn't have to keep rewriting $fileContents.
my #defParts = $fileContents =~ /(.*Part[^\}]*\})/g;
... Or we could just use //g in list context, where it returns a list of all captured groups of all matches, and assign that to #defParts.
my #defParts = $fileContents =~ /.*Part[^\}]*\}/g;
If there are no capture groups in the regex, //g in list context returns the list of all matched strings (as if there had been ( ) around the whole regex).
Feel free to choose any of these. :-)
As for the question of escaping, that's what quotemeta is for,
my $needs_escaping = q(some { data } here);
say quotemeta $needs_escaping;
what prints (on v5.16)
some\ \{\ data\ \}\ here
and works on $1 as well. See linked docs for details. Also see \Q in perlre (search for \Q), which is how this is used inside a regex, say s/\Q$1//;. The \E stops escaping (what you don't need).
Some comments.
Relying on deletion so that the regex keeps finding further such patterns may be a risky design. If it isn't and you do use it there is no need for indices, since we have push
my #defParts;
while ($fileContents =~ /($pattern)/) {
push #defParts, $1;
$fileContents =~ s/\Q$1//;
}
where \Q is added in the regex. Better yet, as explained in melpomene's answer the substitution can be done in the while condition itself
push #defParts, $1 while $fileContents =~ s/($pattern)//;
where I used the statement modifier form (postfix syntax) for conciseness.
With the /g modifier in scalar context, as in while (/($pattern)/g) { .. }, the search continues from the position of the previous match in each iteration, and this is a usual way to iterate over all instances of a pattern in a string. Please read up on use of /g in scalar context as there are details in its behavior that one should be aware of.
However, this is tricky here (even as it works) as the string changes underneath the regex. If efficiency is not a concern, you can capture all matches with /g in list context and then remove them
my #all_matches = $fileContents =~ /$patt/g;
$fileContents =~ s/$patt//g;
While inefficient, as it makes two passes, this is much simpler and clearer.
I expect that Somedata cannot possibly, ever, contain }, for instance as nested { ... }, correct? If it does you have a problem of balanced delimiters, which is far more rounded. One approach is to use the core Text::Balanced module. Search for SO posts with examples.

extract string between two dots

I have a string of the following format:
word1.word2.word3
What are the ways to extract word2 from that string in perl?
I tried the following expression but it assigns 1 to sub:
#perleval $vars{sub} = $vars{string} =~ /.(.*)./; 0#
EDIT:
I have tried several suggestions, but still get the value of 1. I suspect that the entire expression above has a problem in addition to parsing. However, when I do simple assignment, I get the correct result:
#perleval $vars{sub} = $vars{string} ; 0#
assigns word1.word2.word3 to variable sub
. has a special meaning in regular expressions, so it needs to be escaped.
.* could match more than intended. [^.]* is safer.
The match operator (//) simply returns true/false in scalar context.
You can use any of the following:
$vars{sub} = $vars{string} =~ /\.([^.]*)\./ ? $1 : undef;
$vars{sub} = ( $vars{string} =~ /\.([^.]*)\./ )[0];
( $vars{sub} ) = $vars{string} =~ /\.([^.]*)\./;
The first one allows you to provide a default if there's no match.
Try:
/\.([^\.]+)\./
. has a special meaning and would need to be escaped. Then you would want to capture the values between the dots, so use a negative character class like ([^\.]+) meaning at least one non-dot. if you use (.*) you will get:
word1.stuff1.stuff2.stuff3.word2 to result in:
stuff1.stuff2.stuff3
But maybe you want that?
Here is my little example, I do find the perl one liners a little harder to read at times so I break it out:
use strict;
use warnings;
if ("stuff1.stuff2.stuff3" =~ m/\.([^.]+)\./) {
my $value = $1;
print $value;
}
else {
print "no match";
}
result
stuff2
. has a special meaning: any character (see the expression between your parentheses)
Therefore you have to escape it (\.) if you search a literal dot:
/\.(.*)\./
You've got to make sure you're asking for a list when you do the search.
my $x= $string =~ /look for (pattern)/ ;
sets $x to 1
my ($x)= $string =~ /look for (pattern)/ ;
sets $x to pattern.

Perl grep a multi line output for a pattern

I have the below code where I am trying to grep for a pattern in a variable. The variable has a multiline text in it.
Multiline text in $output looks like this
_skv_version=1
COMPONENTSEQUENCE=C1-
BEGIN_C1
COMPONENT=SecurityJNI
TOOLSEQUENCE=T1-
END_C1
CMD_ID=null
CMD_USES_ASSET_ENV=null_jdk1.7.0_80
CMD_USES_ASSET_ENV=null_ivy,null_jdk1.7.3_80
BEGIN_C1_T1
CMD_ID=msdotnet_VS2013_x64
CMD_ID=ant_1.7.1
CMD_FILE=path/to/abcI.vc12.sln
BEGIN_CMD_OPTIONS_RELEASE
-useideenv
The code I am using to grep for the pattern
use strict;
use warnings;
my $cmd_pattern = "CMD_ID=|CMD_USES_ASSET_ENV=";
my #matching_lines;
my $output = `cmd to get output` ;
print "output is : $output\n";
if ($output =~ /^$cmd_pattern(?:null_)?(\w+([\.]?\w+)*)/s ) {
print "1 is : $1\n";
push (#matching_lines, $1);
}
I am getting the multiline output as expected from $output but the regex pattern match which I am using on $output is not giving me any results.
Desired output
jdk1.7.0_80
ivy
jdk1.7.3_80
msdotnet_VS2013_x64
ant_1.7.1
Regarding your regular expression:
You need a while, not an if (otherwise you'll only be matching once); when you make this change you'll also need the /gc modifiers
You don't really need the /s modifier, as that one makes . match \n, which you're not making use of (see note at the end)
You want to use the /m modifier so that ^ matches the beginning of every new line, and not just the beginning of the string
You want to add \s* to your regular expression right after ^, because in at least one of your lines you have a leading space
You need parenthesis around $cmd_pattern; otherwise, you're getting two options, the first one being ^CMD_ID= and the second one being CMD_USES_ASSET_ENV= followed by the rest of your expression
You can also simplify the (\w+([\.]?\w+)*) bit down to (.+).
The result would be:
while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) {
print "1 is : $1\n";
push (#matching_lines, $1);
}
That being said, your regular expression still won't split ivy and jdk1.7.3_80 on its own; I would suggest adding a split and removing _null with something like:
while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) {
my $text = $1;
my #text;
if ($text =~ /,/) {
#text = split /,(?:null_)?/, $text;
}
else {
#text = $text;
}
for (#text) {
print "1 is : $_\n";
push (#matching_lines, $_);
}
}
The only problem you're left with is the lone line CMD_ID=null. I'm gonna leave that to you :-)
(I recently wrote a blog post on best practices for regular expressions - http://blog.codacy.com/2016/03/30/best-practices-for-regular-expressions/ - you'll find there a note to always require the /s in Perl; the reason I mention here that you don't need it is that you're not using the ones you actually need, and that might mean you weren't certain of the meaning of /s)

Global matching in regex in variable

I have a perl script that contains a few regexes in variables, such as this:
my $velar_velar = qr/([a-zA-Z']*(?:[^n\s]g|[^n\s]k))\s+((?:g|k|c[^ieyh])[a-zA-Z']*)/;
Later, I use these in an if (and elsif) statements, but I want this regex to be able to match more than once per line:
$text = "tack go pack go";
if ($text =~ /$velar_velar/g) {
print "Yes";
}
Where it would print "Yes" twice. I have tried the code I have here but it doesn't seem to work. I've also tried putting /g at the end of the regex variable but that does not work either.
How do I get my regex to match more than once when it is a variable? I
Change if ($text =~ /$velar_velar/g) { to while ($text =~ /$velar_velar/g) {.

Why can't I store a regexp in a variable?

Given the following code,
my $string = "foo";
my $regex = s/foo/bar/;
$string =~ $regex;
print $string, "\n";
I would have expected the output to be bar, however it is foo. Why is that the case, and how can I solve that problem?
Note that in my actual case, the regex is more complicated, and I actually want to store several of them in a hash (so I can write something like $string =~ $rules{$key}).
You're looking for substitution, not only the regex part so I guess compiled regex (qr//) is not what you're looking for,
use strict;
use warnings;
my $string = "foo";
my $regex = sub { $_[0] =~ s/foo/bar/ };
$regex->($string);
print $string, "\n";
Your statement
my $regex = s/foo/bar/
is equivalent to
my $regex = $_ =~ s/foo/bar/
s/// returns the number of substitutions made, or it returns false (specifically, the empty string). So $regex is now '' or 1 (it could be more if the /g modifier was in effect) and
$string =~ $regex
is doing 'foo' =~ // or 'foo' =~ /1/ depending on what $_ contained originally.
You can store a regex pattern in a variable but, in your example, the regex is just foo, and there is a lot more going on than just that pattern
The statement s/foo/bar/ is more complex than it seems -- it is a fully-fledged statement that applies a regex pattern to a target string and substitutes a replacement string if the pattern is found. In this case the target string is the default variable $_ and the replacement string is foo. You could think of it as a call to a subroutine
substitute($_, 'foo', 'bar')
and the regex pattern is only the second parameter
What you can do is store a regex pattern. The regex part of that substitution is foo, and you can say
my $pattern = qr/foo/;
s/$pattern/bar/;
But you really should explain the problem that you're trying to solve so that we can help you better
In the assignment, you need to tell Perl not to evaluate the regular expression but just to keep it. This is what qr is for.
But you can't do this with whole substitutions, which is why Сухой27 suggests using a subroutine.