What's the best way to clear regex matching variables? - regex

What's the best way to clear/reset all regex matching variables?
Example how $1 isn't reset between regex operations and uses the most recent match:
$_="this is the man that made the new year rumble";
/ (is) /;
/ (isnt) /;
say $1; # outputs "is"
Example how this may be problematic when working with loops:
foreach (...){
/($some_value)/;
&doSomething($1) if $1;
}
Update: I didn't think I'd need to do this, but Example-2 is only an example. This question is about resetting matching variables, not the best way to implement them.
Regardless, originally my coding style was more inline with being explicit and using if-blocks. After coming back to this (Example2) now, it is much more concise in reading many lines of code, I'd find this syntax faster to comprehend.

You should use the return from the match, not the state of the group vars.
foreach (...) {
doSomething($1) if /($some_value)/;
}
$1, etc. are only guaranteed to reflect the most recent match if the match succeeds. You shouldn't be looking at them other than right after a successful match.

Regex captures* are reset by a successful match. To reset regex captures, one would use a trivial match operation that's guaranteed to match.
"a" =~ /a/; # Reset captures to undef.
Yeah, it looks weird, but you asked to do some thing weird.
If you fix your code, you don't need weird-looking workarounds. Fixing your code even reveals a bug!
Fixes:
$_ = "this is the man that made the new year rumble";
if (/ (is) / || / (isnt) /) {
say $1;
} else{
... # You're currently printing something random.
}
and
for (...) {
if (/($some_pattern)/) {
do_something($1);
}
}
* — Backrefs are regex patterns that match previously captured text. e.g. \1, \k<foo>. You're actually talking about "regex capture buffers".

You should test whether the match succeeded. For example:
foreach (...){
/($some_value)/ or next;
doSomething($1) if $1;
}
foreach (...){
doSomething($1) if /($some_value)/ and $1;
}
foreach (...){
if (/($some_value)/) {
doSomething($1) if $1;
}
}
Depending on what $some_value is, and how you want to handle matching the empty string and/or 0, you may or may not need to test $1 at all.

To complement the existing, helpful answers (and the sensible recommendation to normally test the result of a matching operation in a Boolean context and take action only if the test succeeds notwithstanding):
Depending on your scenario, you can approach the problem differently:
Disclaimer: I'm not an experienced Perl programmer; do let me know if there are problems with this approach.
Enclose the matching operation in a do { ... } block scopes all regex-related special variables ($&, $1, ...) to that block.
Thus, you can use a do { ... } to prevent these special variables from getting set in the first place (although the ones from a previous regex operation outside the block will obviously remain in effect); for instance:
$_="this is the man that made the new year rumble";
# Match in current scope; -> $&, $1, ... *are* set.
/ (is) /;
# Match inside a `do` block; the *new* $&, $1, ... values
# are set only *inside* the block;
# `&& $1` passes out the block's version of `$1`.
$do1 = do { / (made) / && $1 };
print "\$1 == '$1'; \$do1 == '$do1'\n"; # -> $1 == 'is'; $do1 == 'made'
The advantage of this approach is that none of the current scope's special regex variables are set or altered; the accepted answer, by contrast, alters variables such as $&, and $'.
The disadvantage is that you must explicitly pass out variables of interest; you do get the result of the matching operation by default, however, and if you're only interested in the contents of capture buffers, that will suffice.

You shoud do it this way:
foreach (...) {
someFnc($1) if /.../;
}
But if you want to stick with your style, then check this as an idea:
$_ = "this is the man that made the new year rumble";
$m = /(is)/ ? $1 : undef;
$m = /(isnt)/ ? $1 : undef;
print $m, "\n" if defined $m;

Assigning captures to a list behave closer to what it sounds like you want.
for ("match", "fail") {
my ($fake_1) = /(m.+)/;
doSomething($fake_1) if $fake_1;
}

Related

How to fix 'Bareword found' issue in perl eval()

The following code returns "Bareword found where operator expected at (eval 1) line 1, near "*,out" (Missing operator before out?)"
$val = 0;
$name = "abc";
$myStr = '$val = ($name =~ in.*,out [)';
eval($myStr);
As per my understanding, I can resolve this issue by wrapping "in.*,out [" block with '//'s.
But that "in.*,out [" can be varied. (eg: user inputs). and users may miss giving '//'s. therefore, is there any other way to handle this issue.? (eg : return 0 if eval() is trying to return that 'Bareword found where ...')
The magic of (string) eval -- and the danger -- is that it turns a heap of dummy characters into code, compiles and runs it. So can one then use '$x = ,hi'? Well, no, of course, when that string is considered code then that's a loose comma operator there, a syntax eror; and a "bareword" hi.† The string must yield valid code
In a string eval, the value of the expression (which is itself determined within scalar context) is first parsed, and if there were no errors, executed as a block within the lexical context of the current Perl program.
So that string in the question as it stands would be just (badly) invalid code, which won't compile, period. If the in.*,out [ part of the string is in quotes of some sort, then that is legitimate and the =~ operator will take it as a pattern and you have a regex. But then of course why not use regex's normal pattern delimiters, like // (or m{}, etc).
And whichever way that string gets acquired it'll be in a variable, no? So you can have /$input/ in the eval and populate that $input beforehand.
But, above all, are you certain that there is no other way? There always is. The string-eval is complex and tricky and hard to use right and nigh impossible to justify -- and dangerous. It runs arbitrary code! That can break things badly even without any bad intent.
I'd strongly suggest to consider other solutions. Also, it is unclear why there'd be need for eval in the first place -- as you only need the regex pattern as user input (not code) you can have that very regex in normal code with a pattern in a variable, which is populated earlier when the user input is supplied. (Note that taking a pattern from the user may lead to trouble as well.)
† A problem if you're into warnings, and we all are.
The following isn't valid Perl code:
$val = ($name =~ in.*,out [)
You want the following:
$val = $name =~ /in.*,out \[/
(The parens weren't harmful, but didn't help either.)
If the pattern is user-supplied, you can use the following:
$val = $name =~ /$pattern/
(No eval EXPR needed!)
Note from the correction that the pattern in the question isn't correct. You can catch such errors using eval BLOCK
eval { $val = $name =~ /$pattern/ };
die("Bad pattern \"$pattern\" provided: $#") if $#;
A note about user-provided patterns: The above won't let the user execute arbitrary code, but it won't protect you from patterns that would take longer than the lifespan of the universe to complete.

Perl switch/case Fails on Literal Regex String Containing Non-Capturing Group '?'

I have text files containing lines like:
2/17/2018 400000098627 =2,000.0 $2.0994 $4,387.75
3/7/2018 1)0000006043 2,000.0 $2.0731 $4,332.78
3/26/2018 4 )0000034242 2,000.0 $2.1729 $4,541.36
4/17/2018 2)0000008516 2,000.0 $2.219 $4,637.71
I am matching them with /^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+([0-9|.|,]+)\s+\$/ But I also have some files with lines in a completely different format, which I match with a different regex. When I open a file I determine which format and assign $pat = '<regex-string>'; in a switch/case block:
$pat = '/^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+([0-9|.|,]+)\s+\$/'
But the ? character that introduces the non-capturing group I use to match repeats after the date and before the first currency amount causes the Perl interpreter to fail to compile the script, reporting on abort:
syntax error at ./report-dates-amounts line 28, near "}continue "
If I delete the ? character, or replace ? with \? escaped character, or first assign $q = '?' then replace ? with $q inside a " string assignment (ie. $pat = "/^\s*(\S+)\s+($q:[0-9|\)| ]+)+\s+([0-9|.|,]+)\s+\$/"; ) the script compiles and runs. If I assign the regex string outside the switch/case block that also works OK. Perl v5.26.1 .
My code also doesn't have any }continue in it, which as reported in the compilation failure is probably some kind of transformation of the switch/case code by Switch.pm into something native the compiler chokes on. Is this some kind of bug in Switch.pm? It fails even when I use given/when in exactly the same way.
#!/usr/local/bin/perl
use Switch;
# Edited for demo
switch($format)
{
# Format A eg:
# 2/17/2018 400000098627 =2,000.0 $2.0994 $4,387.75
# 3/7/2018 1)0000006043 2,000.0 $2.0731 $4,332.78
# 3/26/2018 4 )0000034242 2,000.0 $2.1729 $4,541.36
# 4/17/2018 2)0000008516 2,000.0 $2.219 $4,637.71
#
case /^(?:april|snow)$/i
{ # This is where the ? character breaks compilation:
$pat = '^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+\D?(\S+)\s+\$';
# WORKS:
# $pat = '^\s*(\S+)\s+(' .$q. ':[0-9|\)| ]+)+\s+\D' .$q. '(\S+)\s+\$';
}
# Format B
case /^(?:umberto|petro)$/i
{
$pat = '^(\S+)\s+.*Think 1\s+(\S+)\s+';
}
}
Don't use Switch. As mentionned by #choroba in the comments, Switch uses a source filter, which leads to mysterious and hard to debug errors, as you constated.
The module's documentation itself says:
In general, use given/when instead. It were introduced in perl 5.10.0. Perl 5.10.0 was released in 2007.
However, given/when is not necessarily a good option as it is experimental and likely to change in the future (it seems that this feature was almost removed from Perl v5.28; so you definitely don't want to start using it now if you can avoid it). A good alternative is to use for:
for ($format) {
if (/^(?:april|snow)$/i) {
...
}
elsif (/^(?:umberto|petro)$/i) {
...
}
}
It might look weird a first, but once you get used to it, it's actually reasonable in my opinion. Or, of course, you can use none of this options and just do:
sub pattern_from_format {
my $format = shift;
if ($format =~ /^(?:april|snow)$/i) {
return qr/^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+\D?(\S+)\s+\$/;
}
elsif ($format =~ /^(?:umberto|petro)$/i) {
return qr/^(\S+)\s+.*Think 1\s+(\S+)\s+/;
}
# Some error handling here maybe
}
If, for some reason, you still want to use Switch: use m/.../ instead of /.../.
I have no idea why this bug is happening, however, the documentation says:
Also, the presence of regexes specified with raw ?...? delimiters may cause mysterious errors. The workaround is to use m?...? instead.
Which I misread at first, and therefore tried to use m/../ instead of /../, which fixed the issue.
Another option instead of an if/elsif chain would be to loop over a hash which maps your regular expressions to the values which should be assigned to $pat:
#!/usr/local/bin/perl
my %switch = (
'^(?:april|snow)$' => '^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+\D?(\S+)\s+\$',
'^(?:umberto|petro)$' => '^(\S+)\s+.*Think 1\s+(\S+)\s+',
);
for my $re (keys %switch) {
if ($format =~ /$re/i) {
$pat = $switch{$re};
last;
}
}
For a more general case (i.e., if you're doing more than just assigning a string to a scalar) you could use the same general technique, but use coderefs as the values of your hash, thus allowing it to execute an arbitrary sub based on the match.
This approach can cover a pretty wide range of the functionality usually associated with switch/case constructs, but note that, because the conditions are pulled from the keys of a hash, they'll be evaluated in a random order. If you have data which could match more than one condition, you'll need to take extra precautions to handle that, such as having a parallel array with the conditions in the proper order or using Tie::IxHash instead of a regular hash.

Use regular expressions to find host name

so I have little problem, because I need to print host name which is bettwen "(?# )", for example:
Apr 17 23:39:02 test pure-ftpd: (?#researchscan425.eecs.umich.edu) [INFO] New connection from researchscan425.eecs.umich.edu
And I need to print "researchscan425.eecs.umich.edu".
I tried something like:
if(my ($test) = $linelist =~ /\b\(\?\#(\S*)/)
{
print "$test\n";
}
But it doesn't print me anything.
You can use this regex:
\(\?#(.*?)\)
researchscan425.eecs.umich.edu will be captured into Group 1.
See demo
Sample code:
my $linelist = 'Apr 17 23:39:02 test pure-ftpd: (?#researchscan425.eecs.umich.edu) [INFO] New connection from researchscan425.eecs.umich.edu';
if(my ($test) = $linelist =~ /\(\?#(.*?)\)/)
{
print "$test\n";
}
How about:
if(my ($test) = $linelist =~ /\(\?\#([^\s)]+)/)
You need to remove the \b which exists before (. Because there isn't a word boundary exists before ( (non-word character) and after space (non-word charcater).
my $linelist = 'Apr 17 23:39:02 test pure-ftpd: (?#researchscan425.eecs.umich.edu) [INFO] New connection from researchscan425.eecs.umich.edu';
if(my ($test) = $linelist =~ /\(\?\#([^)]*)/)
{
print "$test\n";
}
The problem here is the definition of \b.
It's "word boundary" - on regex101 that means:
(^\w|\w$|\W\w|\w\W)
Now, why this is causing you problems - ( is not a word character. So the transition from space to bracket doesn't trigger this pattern.
Switch your pattern to:
\s\(\?\#(\S+)
And it'll work. (Note - I've changed * to + because you probably want one or more, not zero or more).
It's amazing what you can do with logging tools or with perl as part of the logging service itself (c.f. Ubic), but even if you're just writing a "quick script" to parse logs for reporting (i.e. something you or someone else won't look at again for months or years) it helps to make them easy to maintain.
One approach to doing this is to process the lines of your log file lines with Regexp::Common. One advantage is that RX::Common matches practically "self document" what you are doing. For example, to match on specific "RFC compliant" definitions of what constitutes a "domain" using the $linelist you posted:
use Regexp::Common qw /net/;
if ( $line =~ /\?\#$RE{net}{domain}{-keep}/ ) { say $1 }
Then, later, if you need you can add other matches e.g "numeric" IPv4 or IPv6 addresses, assign them for use later in the script, etc. (Perl6::Form and IO::All used for demonstration purposes only - try them out!):
use IO::All ;
use Regexp::Common qw/net/;
use Perl6::Form;
my $purelog = io 'logfile.lines.txt' ;
sub _get_ftphost_names {
my #hosts = () ;
while ($_ = $purelog->getline) {
/\(\?\#$RE{net}{IPv6}{-sep => ":" }{-keep}/ ||
/\(\?\#$RE{net}{IPv4}{-keep}/ ||
/\(\?\#$RE{net}{domain}{-keep}/ and push #hosts , $1 ;
}
return \#hosts ;
}
sub _get_bytes_transfered {
... ;
}
my #host_list = _get_ftphost_names ;
print form
"{[[[[[[[[[[(30+)[[[[[[[[[[[[[}", #host_list ;
One of the great things about Regexp::Common (besides stealing regexp ideas from the source) is that it also makes it fairly easy to roll your own matches, You can use those to capture other parts of the file in an easily understandable way adding them piece by piece. Then, as what was supposed to be your four line script grows and transforms itself into a ITIL compliant corporate reporting tool, you and your career can advance apace :-)

Match a pattern not inside of inside of BEGIN and END markers

I'm implementing a script that will check a file (or files) for a given regex pattern and alert the user if the file contains any matches. However, I'd like to be able to allow users to specify exceptions inside the file (i.e. portions of the file that will not be checked). The way I was thinking of implementing this was with BEGIN:EXCEPTION and END:EXCEPTION markers within the file. The way the script works now is as follows:
(assuming file contents in $_)
my $re_dirty = /hello world/; # Simple example
if($re_dirty) {
# alert that the pattern was found in the file
}
I've tried changing this to the following:
my $re_dirty = /hello world/; # Simple example
my $begin_token = 'BEGIN:EXCEPTION';
my $end_token = 'END:EXCEPTION';
if($re_dirty && $_ !~ /${begin_token}.*${re_dirty}.*${end_token}) {
# alert that the patter was found and was not in an exception block
}
However, this has obvious problems:
1. It will match if if there is a exception before and after the pattern but the pattern itself is not inside an exception.
2. It will not match if the pattern is in the file twice, but only of them is in an exception block.
3. Possibly more problems??
A couple of clarifying notes:
1. Exceptions could span multiple lines.
2. There can be more than one exception block per file.
You can use a flip-flop (range operator) in scalar context:
if (/$begin/ .. /$end/) {
if (/$re_dirty/) {
# do stuff
}
}
This particular usage of the range operator will return false (as a statement) until the left hand side returns true, after which it will return true until the right hand side returns true.
Of course, with this approach, you should read the file in line-by-line mode. But that is a better approach overall, with regard to memory usage.
Edit:
If you wanted to match against multiline matches outside of such blocks, you would first have to gather the relevant lines as multiline strings:
my #outside;
my $content;
while (<$file>) {
if ( /$begin/ .. /$end/ ) { # if inside tags
if (defined $content) { # if not empty
push #outside, $content; # store the scalar into array
undef $content; # reset variable
}
} else {
$content .= $_; # store into scalar
}
}
push #outside, $content if defined $content;
for my $portion (#outside) {
if ($portion =~ /$re_dirty/) { # check for multiline matches
# do stuff
}
}
I would do something like this:
(my $portion = $_) =~ s/${begin}.*?${end}//gs; # reject anything inside begin/end blocks
if ($portion =~ $re_dirty) {
# do stuff
}
This way you get in $portion only the relevant parts of your file (those outside the BEGIN/END tokens). Then you can perform a standard regexp match on the relevant part...
Note the use of '?' modifier, to avoid matching from the first begin token to the last end token...
you could add a boolean to your logic:
my $begin_token = 'BEGIN:EXCEPTION';
my $end_token = 'END:EXCEPTION';
my $bool = 0;
$bool = 1 if $begin_token;
$bool = 0 if $end_token
then you could test if $bool is 1 or 0 to skip or not parts of the code

evaluate pattern stored in variable perl regexp

I am trying to find out if basket has apple [simplified version of a big problem]
$check_fruit = "\$fruit =~ \/has\/apple\/";
$fruit="basket/has/mango/";
if ($check_fruit) {
print "apple found\n";
}
check_fruit variable is holding the statement of evaluating the regexp.
However it check_fruit variable always becomes true and shows apple found :(
Can somebody help me here If I am missing something.
Goal to accomplish:
Okay so let me explain:
I have a file with a pattern clause defined on eachline similar to:
Line1: $fruit_origin=~/europe\\/finland/ && $fruit_taste=~/sweet/
Line2: similar stuff that can contain ~10 pattern checks seprated by && or || with metacharacters too
2.I have another a list of fruit attributes from a perl hash containing many such fruits
3 I want to categorize each fruit to see how many fruits fall into category defined by each line of the file seprately.
Sort of fruit count /profile per line Is there an easier way to accomplish this ? Thanks a lot
if ($check_fruit) returns true because $check_fruit is defined, not empty and not zero. If you want to evaluate its content, use eval. But a subroutine would serve better:
sub check_fruit {
my $fruit = shift;
return $fruit =~ m(has/apple);
}
if (check_fruit($fruit)) {
print "Apple found\n";
}
Why is there a need to store the statement in a variable? If you're sure the value isn't set by a user, then you can do
if (eval $check_fruit) {
but this isn't safe if the user can set anything in that expression.
Put the pattern (and only the pattern) into the variable, use the variable inside the regular expression matching delimiters m/.../. If you don't know the pattern in advance then use quotemeta for escaping any meta characters.
It should look like this:
my $check_fruit = '/has/apple/'; # here no quotemeta is needed
my $fruit = 'basket/has/mango/';
if ($fruit =~ m/$check_fruit/) {
# do stuff!
}
$check_fruit is nothing but a variable holding string data. If you want to execute the code it contains, you have to use eval.
There were also some other errors in your code related to string quoting/escaping. This fixes that as well:
use strict;
use warnings;
my $check_fruit = '$apple =~ m|/has/mango|';
my $apple="basket/has/mango/";
if (eval $check_fruit) {
print "apple found\n";
}
However, this is not usually a good design. At the very least, it makes for confusing code. It is also a huge security hole if $check_fruit is coming from the user. You can put a regex into a variable, which is preferable:
Edit: note that a regex that comes from user input can be a security problem as well, but it is more limited in scope.
my $check_fruit = qr|/has/mango|;
my $apple="basket/has/mango/";
if ($apple =~ /$check_fruit/) {
print "apple found\n";
}
There are other things you can do to make your Perl code more dynamic, as well. The best approach would depend on what you are trying to accomplish.