Perl String Regular Expression - Need Explanation

Perl String Regular Expression - Need Explanation - regex

I am pretty new to Perl. I have the following code fragment that works just fine, but I don't fully understand it:
for ($i = 1; $i <= $pop->Count(); $i++) {
foreach ( $pop->Head( $i ) ) {
/^(From|Subject):\s+/i and print $_, "\n";
}
}
$pop->Head is a string or an array of strings returned by the function Mail::POP3Client, and it is the headers of a bunch of emails. Line 3 is some kind of regular expression that extracts the FROM and the SUBJECT from the header.
My question is how does the print function only print the From and the Subject without all the other stuff in the header? What does "and" mean - this surely can't be a boolean and can it? Most important, I want to put the From string into its own variable (my $fromline). How do I do this?
I am hoping that this will be easy for some Perl professional, it has got me baffled!
Thanks in advance.

ARGHHH... The question was edited while I was typing the answer. OK, throwing out the part of my answer that's no longer relevant, and focusing on the specific questions:
The outer loop iterates over all the messages in the mailbox.
The inner loop doesn't specify a loop variable, so the special variable $_ is used.
In each iteration through the inner loop, $_ is one header line from message number $i.
/^(From|Subject):\s+/i and print $_, "\n";
The first part of this line, up to the and is a pattern. We didn't specify what to do with the pattern, so it's implicitly matched against $_. (That's one of the things that makes $_ special.) This gives us a yes/no test: does the pattern match the header line or not?
The pattern tests whether that item begins with (<) either of the words "From" or "Subject", followed immediately by a colon and one or more whitespace characters. (This not the correct pattern to match an RFC 822 header. Whitespace is optional on both sides of the colon. The pattern should more properly be /^(From|Subject)\s*:\s*/i. But that's a separate issue.) the i at the end of the pattern says to ignore case, so from or SUBJECT would be OK.
The and says to continue evaluating (i.e., executing) the expression if there is a match. If there's no match, whatever follows and is ignored.
The rest of the expression prints the header line ($_) and a newline ("\n").
In perl, and and or are boolean operators. They're synonyms for && and ||, except that they have much lower precedence, making it easier to write short-ciruit expressions without clutter from lots of parentheses.
The smallest change that captures the From line into a separate variable would be to add the following line to the inner loop:
/^From\s*:\s*(.*)$/i and $fromline = $1;
You should probably also put
$fromline = undef
before the loop so you can test, after the loop, whether there was a From: line.
There are other ways to do it. In fact, that's one of the mantras of perl: "There's more than one way to do it." I've stripped out the "From: " from the beginning of the line before storing the balance in $fromline, but I don't know your needs.

It's a logical and with short-circuiting. If the left side evaluates to true -- say, if that regular expression matches -- it'll evaluate the right side, the print.
If the expression on the left is false, it doesn't need to evaluate the right hand side, because the net result would still be false, so it skips it.
See also: perldoc perlop

Related

Error while compiling regex function, why am I getting this issue?

My RAKU Code:
sub comments {
if ($DEBUG) { say "<filtering comments>\n"; }
my #filteredtitles = ();
# This loops through each track
for #tracks -> $title {
##########################
# LAB 1 TASK 2 #
##########################
## Add regex substitutions to remove superflous comments and all that follows them
## Assign to $_ with smartmatcher (~~)
##########################
$_ = $title;
if ($_) ~~ s:g:mrx/ .*<?[\(^.*]> / {
# Repeat for the other symbols
########################## End Task 2
# Add the edited $title to the new array of titles
#filteredtitles.push: $_;
}
}
# Updates #tracks
return #filteredtitles;
}
Result when compiling:
Error Compiling! Placeholder variable '#_' may not be used here because the surrounding block doesn't take a signature.
Is there something obvious that I am missing? Any help is appreciated.

So, in contrast with #raiph's answer, here's what I have:
my #tracks = <Foo Ba(r B^az>.map: { S:g / <[\(^]> // };
Just that. Nothing else. Let's dissect it, from the inside out:
This part: / <[\(^]> / is a regular expression that will match one character, as long as it is an open parenthesis (represented by the \() or a caret (^). When they go inside the angle brackets/square brackets combo, it means that is an Enumerated character class.
Then, the: S introduces the non-destructive substitution, i.e., a quoting construct that will make regex-based substitutions over the topic variable $_ but will not modify it, just return its value with the modifications requested. In the code above, S:g brings the adverb :g or :global (see the global adverb in the adverbs section of the documentation) to play, meaning (in the case of the substitution) "please make as many as possible of this substitution" and the final / marks the end of the substitution text, and as it is adjacent to the second /, that means that
S:g / <[\(^]> //
means "please return the contents of $_, but modified in such a way that all its characters matching the regex <[\(^]> are deleted (substituted for the empty string)"
At this point, I should emphasize that regular expressions in Raku are really powerful, and that reading the entire page (and probably the best practices and gotchas page too) is a good idea.
Next, the: .map method, documented here, will be applied to any Iterable (List, Array and all their alikes) and will return a sequence based on each element of the Iterable, altered by a Code passed to it. So, something like:
#x.map({ S:g / foo /bar/ })
essencially means "please return a Sequence of every item on #x, modified by substituting any appearance of the substring foo for bar" (nothing will be altered on #x). A nice place to start to learn about sequences and iterables would be here.
Finally, my one-liner
my #tracks = <Foo Ba(r B^az>.map: { S:g / <[\(^]> // };
can be translated as:
I have a List with three string elements
Foo
Ba(r
B^az
(This would be a placeholder for your "list of titles"). Take that list and generate a second one, that contains every element on it, but with all instances of the chars "open parenthesis" and "caret" removed.
Ah, and store the result in the variable #tracks (that has my scope)

Here's what I ended up with:
my #tracks = <Foo Ba(r B^az>;
sub comments {
my #filteredtitles;
for #tracks -> $_ is copy {
s:g / <[\(^]> //;
#filteredtitles.push: $_;
}
return #filteredtitles;
}
The is copy ensures the variable set up by the for loop is mutable.
The s:g/...//; is all that's needed to strip the unwanted characters.
One thing no one can help you with is the error you reported. I currently think you just got confused.
Here's an example of code that generates that error:
do { #_ }
But there is no way the code you've shared could generate that error because it requires that there is an #_ variable in your code, and there isn't one.
One way I can help in relation to future problems you may report on StackOverflow is to encourage you to read and apply the guidance in Minimal Reproducible Example.
While your code did not generate the error you reported, it will perhaps help you if you know about some of the other compile time and run time errors there were in the code you shared.
Compile-time errors:
You wrote s:g:mrx. That's invalid: Adverb mrx not allowed on substitution.
You missed out the third slash of the s///. That causes mayhem (see below).
There were several run-time errors, once I got past the compile-time errors. I'll discuss just one, the regex:
.*<?[...]> will match any sub-string with a final character that's one of the ones listed in the [...], and will then capture that sub-string except without the final character. In the context of an s:g/...// substitution this will strip ordinary characters (captured by the .*) but leave the special characters.
This makes no sense.
So I dropped the .*, and also the ? from the special character pattern, changing it from <?[...]> (which just tries to match against the character, but does not capture it if it succeeds) to just <[...]> (which also tries to match against the character, but, if it succeeds, does capture it as well).
A final comment is about an error you made that may well have seriously confused you.
In a nutshell, the s/// construct must have three slashes.
In your question you had code of the form s/.../ (or s:g/.../ etc), without the final slash. If you try to compile such code the parser gets utterly confused because it will think you're just writing a long replacement string.
For example, if you wrote this code:
if s/foo/ { say 'foo' }
if m/bar/ { say 'bar' }
it'd be as if you'd written:
if s/foo/ { say 'foo' }\nif m/...
which in turn would mean you'd get the compile-time error:
Missing block
------> if m/⏏bar/ { ... }
expecting any of:
block or pointy block
...
because Raku(do) would have interpreted the part between the second and third /s as the replacement double quoted string of what it interpreted as an s/.../.../ construct, leading it to barf when it encountered bar.
So, to recap, the s/// construct requires three slashes, not two.
(I'm ignoring syntactic variants of the construct such as, say, s [...] = '...'.)

How to fix 'Bareword found' issue in perl eval()

The following code returns "Bareword found where operator expected at (eval 1) line 1, near "*,out" (Missing operator before out?)"
$val = 0;
$name = "abc";
$myStr = '$val = ($name =~ in.*,out [)';
eval($myStr);
As per my understanding, I can resolve this issue by wrapping "in.*,out [" block with '//'s.
But that "in.*,out [" can be varied. (eg: user inputs). and users may miss giving '//'s. therefore, is there any other way to handle this issue.? (eg : return 0 if eval() is trying to return that 'Bareword found where ...')

The magic of (string) eval -- and the danger -- is that it turns a heap of dummy characters into code, compiles and runs it. So can one then use '$x = ,hi'? Well, no, of course, when that string is considered code then that's a loose comma operator there, a syntax eror; and a "bareword" hi.† The string must yield valid code
In a string eval, the value of the expression (which is itself determined within scalar context) is first parsed, and if there were no errors, executed as a block within the lexical context of the current Perl program.
So that string in the question as it stands would be just (badly) invalid code, which won't compile, period. If the in.*,out [ part of the string is in quotes of some sort, then that is legitimate and the =~ operator will take it as a pattern and you have a regex. But then of course why not use regex's normal pattern delimiters, like // (or m{}, etc).
And whichever way that string gets acquired it'll be in a variable, no? So you can have /$input/ in the eval and populate that $input beforehand.
But, above all, are you certain that there is no other way? There always is. The string-eval is complex and tricky and hard to use right and nigh impossible to justify -- and dangerous. It runs arbitrary code! That can break things badly even without any bad intent.
I'd strongly suggest to consider other solutions. Also, it is unclear why there'd be need for eval in the first place -- as you only need the regex pattern as user input (not code) you can have that very regex in normal code with a pattern in a variable, which is populated earlier when the user input is supplied. (Note that taking a pattern from the user may lead to trouble as well.)
† A problem if you're into warnings, and we all are.

The following isn't valid Perl code:
$val = ($name =~ in.*,out [)
You want the following:
$val = $name =~ /in.*,out \[/
(The parens weren't harmful, but didn't help either.)
If the pattern is user-supplied, you can use the following:
$val = $name =~ /$pattern/
(No eval EXPR needed!)
Note from the correction that the pattern in the question isn't correct. You can catch such errors using eval BLOCK
eval { $val = $name =~ /$pattern/ };
die("Bad pattern \"$pattern\" provided: $#") if $#;
A note about user-provided patterns: The above won't let the user execute arbitrary code, but it won't protect you from patterns that would take longer than the lifespan of the universe to complete.

value of binding operator expression in perl

I have some doubt about the outcome of a binding operator expression in perl. I mean expression like
string =~ /pattern/
I have done some simple test
$ss="a1b2c3";
say $ss=~/a/; # 1
say $ss=~/[a-z]/g; # abc
#aa=$ss=~/[a-z]/g;say #aa; # abc
$aa=#aa;say $aa; # 3
$aa=$ss=~/[a-z]/g;say $aa; # 1
note the comment part above is the running result.
So here comes the question, what on earth is returned by $ss=~/[a-z]/g, it seems that it returned an array according to code line 3,4,5. But what about the last line, why it gives 1 instead of 3 which is the length of array?

The return of the match operator depends on the context: in list context it returns all captured matches, in scalar context the true/false. The say imposes list context, but in the first example nothing is captured in the regex so you only get "success."
Next, the behavior of /g modifier also differs across contexts. In list context, with it the string keeps being scanned with the given pattern until all matches are found, and a list with them is returned. These are your second and third examples.
But in scalar context its behavior is a bit specific: with it the search will continue from the position of the last match, the next time round. One typical use is in the loop condition
while (/(\w+)/g) { ... }
This is a bit of a tokenizer: after the body of the loop runs the next word is found, etc.
Then the last example doesn't really make sense; you are getting the "normal" scalar-context matching success/fail, and /g doesn't do anything -- until you match on $ss the next time
perl -wE'
$s=shift||q(abc);
for (1..2) { $m = $s=~/(.)/g; say "$m: $1"; }
'
prints lines 1:a and then 1:b.
Outside of iterative structures (like while condition) the /g in scalar context is usually an error, pointless at best or a quiet bug.
See "Global matching" under "Using regular expressions" in perlretut for /g.
See regex operators in perlop in general, and about /g as well. A useful tool to explore /g workings is pos.

I need to remove a segment till a delimiter ");" after two or three lines of a search keyword in file using perl script

Iam trying to remove a test segment frequently occurs in a file. I have a matching keyword let suppose
FIND_WORD(---------
-----------
-----------);
is terminating after 2 or 3 lines at " );" but i need to remove next few lines also till ");"
I am using
if ($line =~ m/FIND_WORD/) {return 0;}
which removes one line
I need to remove till ");"

Sounds like this is exactly what the flip-flop operator is for.
while (<>) {
print unless /FIND_WORD\(/ .. /\);/;
}
From perlop:
In scalar context, ".." returns a boolean value. The operator is
bistable, like a flip-flop, and emulates the line-range (comma)
operator of sed, awk, and various editors. Each ".." operator
maintains its own boolean state, even across calls to a subroutine
that contains it. It is false as long as its left operand is false.
Once the left operand is true, the range operator stays true until the
right operand is true, AFTER which the range operator becomes false
again.
See this recent GeekUni blog post for more details.

You need to use multi-line matching. You will need to slurp the whole file into a variable and then substitute that.
$all_lines =~ s/FIND_WORD.*?\)\;//sg;
This matches from FIND_WORD to the nearest ");" and deletes it.

Perl - Regexp to manipulate .csv

I've got a function in Perl that reads the last modified .csv in a folder, and parses it's values into variables.
I'm finding some problems with the regular expressions.
My .csv look like:
Title is: "NAME_NAME_NAME"
"Period end","Duration","Sample","Corner","Line","PDP OUT TOTAL","PDP OUT OK","PDP OUT NOK","PDP OUT OK Rate"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","ARG - NAME 1","536","536","0","100%"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","USA - NAME 2","1850","1438","412","77.72%"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","AUS - NAME 3","8","6","2","75%"
.(ignore this dot, you will understand later)
So far, I've had some help to parse the values into some variables, by:
open my $file, "<", $newest_file
or die qq(Cannot open file "$newest_file" for reading.);
while ( my $line = <$file> ) {
my ($date_time, $duration, $sample, $corner, $country_name, $pdp_in_total, $pdp_in_ok, $pdp_in_not_ok, $pdp_in_ok_rate)
= parse_line ',', 0, $line;
my ($date, $time) = split /\s+/, $date_time;
my ($country, $name) = $country_name =~ m/(.+) - (.*)/;
print "$date, $time, $country, $name, $pdp_in_total, $pdp_in_ok_rate";
}
The problems are:
I don't know how to make the first AND second line (that are the column names from the .csv) to be ignored;
The file sometimes come with 2-5 empty lines in the end of the file, as I show in my sample (ignore the dot in the end of it, it doesn't exists in the file).
How can I do this?

When you have a csv file with column headers and want to parse the data into variables, the simplest choice would be to use Text::CSV. This code shows how you get your data into the hash reference $row. (I.e. my %data = %$row)
use strict;
use warnings;
use Text::CSV;
use feature 'say';
my $csv = Text::CSV->new({
binary => 1,
eol => $/,
});
# open the file, I use the DATA internal file handle here
my $title = <DATA>;
# Set the headers using the header line
$csv->column_names( $csv->getline(*DATA) );
while (my $row = $csv->getline_hr(*DATA)) {
# you can now access the variables via their header names, e.g.:
if (defined $row->{Duration}) { # this will skip the blank lines
say $row->{Duration};
}
}
__DATA__
Title is: "NAME_NAME_NAME"
"Period end","Duration","Sample","Corner","Line","PDP IN TOTAL","PDP IN OK","PDP IN NOT OK","PDP IN OK Rate"
"04/12/2014 10:00:00","3600","1","GRPS_INB","CHN - Name 1","1198","1195","3","99.74%"
"04/12/2014 10:00:00","3600","1","GRPS_INB","ARG - Name 2","1198","1069","129","89.23%"
"04/12/2014 10:00:00","3600","1","GRPS_INB","NLD - Name 3","813","798","15","98.15%"
If we print one of the $row variables with Data::Dumper, it shows the structure we are getting back from Text::CSV:
$VAR1 = {
'PDP IN TOTAL' => '1198',
'PDP IN NOT OK' => '3',
'PDP IN OK' => '1195',
'Period end' => '04/12/2014 10:00:00',
'Line' => 'CHN - Name 1',
'Duration' => '3600',
'Sample' => '1',
'PDP IN OK Rate' => '99.74%',
'Corner' => 'GRPS_INB'
};

open ...
my $names_from_first_line = <$file>; # you can use them or just ignore them
while($my line = <$file>) {
unless ($line =~ /\S/) {
# skip empty lines
next;
}
..
}
Also, consider using Text::CSV to handle CSV format

1) I don't know how to make the first line (that are the column names from the .csv) to be ignored;
while ( my $line = <$file> ) {
chomp $line;
next if $. == 1 || $. == 2;
2) The file sometimes come with 2-5 empty lines in the end of the file, as I show in my sample (ignore the dot in the end of it, it doesn't exists in the file).
while ( my $line = <$file> ) {
chomp $line;
next if $. == 1 || $. == 2;
next if $line =~ /^\s*$/;

You know that the valid lines will start with dates. I suggest you simply skip lines that don't start with dates in the format you expect:
while ( my $line = <$file> ) {
warn qq(next if not $line =~ /^"\d{2}-\d{2}-d{4}/;); # Temp debugging line
next if not $line =~ /^"\d{2}-\d{2}-d{4}/;
warn qq($line matched regular expression); # Temp debugging line
...
}
The /^"\d{2}-\d{2}-d{4}",/ is a regular expression pattern. The pattern is between the /.../:
^ - Beginning of the line.
" - Quotation Mark.
\d{2} - Followed by two digits.
- - Followed by a dash.
\d{2] - Followed by two more digits.
- - Followed by a dash.
\d{4} - Followed by four more digits
This should be describing the first part of your line which is the date in MM-DD-YYYY format surrounded by quotes and followed by a comma. The =~ tells Perl that you want the thing on the left to match the regular expression on the right.
Regular expressions can be difficult to understand, and is one of the reasons why Perl has such a reputation of being a write-only language. Regular expressions have been likened to sailor cussing. However, regular expressions is an extremely powerful tool, and worth the effort to learn. And with some experience, you'll be able to easily decode them.
The next if... syntax is similar to:
if (...) {
next;
}
Normally, you shouldn't use post-fix if and never use unless (which is if's opposite). They can make your program more difficult to understand. However, when placed right after the opening line of a loop like this, they make a clear statement that you're filtering out lines you don't want. I could have written this (and many people would argue this is preferable):
next unless $line =~ /^"\d{2}-\d{2}-d{4}",/;
This is saying you want to skip lines unless they match your regular expression. It's all a matter of personal preference and what do you think is easier for the poor schlub who comes along next year and has to figure out what your program is doing.
I actually thought about this and decided that if not ... was saying that I expect almost all lines in the file to match my format, and I want to toss away the few exceptions. To me, next unless ... is saying that there are some lines that match my regular expression, and many lines that don't, and I want to only work on lines that match.
Which gets us to the next part of programming: Watching for things that will break your program. My previous answer didn't do a lot of error checking, but it should. What happens if a line doesn't match your format? What if the split didn't work? What if the fields are not what I expect? You should really check each statement to make sure it actually worked. Almost all functions in Perl will return a zero, a null string, or an undef if they don't work. For example, the open statement.
open my $file, "<", $newest_file
or die qq(Cannot open file "$newest_file" for reading.);
If open doesn't work, it returns a file handle value of zero. The or states that if open doesn't return a non-zero file handle, execute the line that follows which kills your program.
So, look through your program, and see any place where you make an assumption that something works as expected and think what happens if it didn't. Then, add checks in your program to something if you get that exception. It could be that you want to report the error or log the error and skip to the next line. It could be that you want your program to come to a screeching halt. It could be that you can recover from the error and continue. What ever you do, check for possible errors (especially from user input) and handle possible errors.
Debugging
I told you regular expressions are tricky. Yes, I made a mistake assuming that your date was a separate field. Instead, it's followed by a space then the time which means that the final ", in the regular expression should not be there. I've fixed the above code. However, you may still need to test and tweak. Which brings us into debugging in Perl.
You can use warn statements to help debug your program. If you copy a statement, then surround it with warn qq(...);, Perl will print out the line (filling out variables) and the line number. I even create macros in my various editors to do this for me.
The qq(...) is a quote like operator. It's another way to do double quotes around a string. The nice thing is that the string can contain actual quotation marks, and the qq(...); will still work.
Once you've finished debugging, you can search for your warn statements and delete them. Perl comes with a powerful built in debugger, and many IDEs integrate with it. However, sometimes it's just easier to toss in a few warn statements to see what's going on in your code -- especially if you're having issues with regular expressions acting up.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js