matching cond in perl using double exclaimation - regex

if ($a =~ m!^$var/!)
$var is a key in a two dimensional hash and $a is a key in another hash.
What is the meaning of this expressions?

This is a regular expression ("regex"), where the ! character is used as the delimiter for the pattern that is to be matched in the string that it binds to via the =~ operator (the $a† here).
It may clear it up to consider the same regex with the usual delimiter instead, $a =~ /^$var\// (then m may be omitted); but now any / used in the pattern clearly must be escaped. To avoid that unsightly and noisy \/ combo one often uses another character for the delimiter, as nearly any character may be used (my favorite is the curlies, m{^$var/}). ‡ §
This regex in the question tests whether the value in the variable $a begins with (by ^ anchor) the value of the variable $var followed by / (variables are evaluated and the result used). §
† Not a good choice for a variable name since $a and $b are used by the builtin sort
‡ With the pattern prepared ahead of time the delimiter isn't even needed
my $re = qr{^$var/};
if ($string =~ $re) ...
(but I do like to still use // then, finding it clearer altogether)
Above I use qr but a simple q() would work just fine (while I absolutely recommend qr). These take nearly any characters for the delimiter, as well.
§ Inside a pattern the evaluated variables are used as regex patterns, what is wrong in general (when this is intended they should be compiled using qr and thus used as subpatterns).
An unimaginative example: a variable $var = q(\s) (literal backslash followed by letter s) evaluated inside a pattern yields the \s sequence which is then treated as a regex pattern, for whitespace. (Presumably unintended; we just wanted \ and s.)
This is remedied by using quotemeta, /\Q$var\E/, so that possible metacharacters in $var are escaped; this results in the correct pattern for the literal characters, \\s. So a correct way to write the pattern is m{^\Q$var\E/}.
Failure to do this also allows the injection bug. Thanks to ikegami for commenting on this.

The match operator (m/.../) is one of Perl's "quote-like" operators. The standard usage is to use slashes before and after the regex that goes in the middle of the operator (and if you use slashes, then you can omit the m from the start of the operator). But if the regex itself contains a slash then it is convenient to use a different delimiter instead to avoid having to escape the embedded slash. In your example, the author has decided to use exclamation marks, but any non-whitespace character can be used.
Many Perl operators work like this - m/.../, s/.../.../, tr/.../.../, q/.../, qq/.../, qr/.../, qw/.../, qx/.../ (I've probably forgotten some).

Related

Perl: Regex with Variables inside?

Is there a more elegant way of bringing a variable into a pattern than this (put the patterin in a string before instead of using it directly in //)??
my $z = "1"; # variable
my $x = "foo1bar";
my $pat = "^foo".$z."bar\$";
if ($x =~ /$pat/)
{
print "\nok\n";
}
The qr operator does it
my $pat = qr/^foo${z}bar$/;
unless the delimiters are ', in which case it doesn't interpolate variables.
This operator is the best way to build patterns ahead of time, as it builds a proper regex pattern, accepting everything that one can use in a pattern in a regex. It also takes modifiers, so for one the above can be written as
my $pat = qr/^foo $z bar$/x;
for a little extra clarity (but careful with omitting those {}).†
The initial description from the above perlop link (examples and discussion follow):
qr/STRING/msixpodualn
This operator quotes (and possibly compiles) its STRING as a regular expression. STRING is interpolated the same way as PATTERN in m/PATTERN/. If "'" is used as the delimiter, no variable interpolation is done. Returns a Perl value which may be used instead of the corresponding /STRING/msixpodualn expression. The returned value is a normalized version of the original pattern. It magically differs from a string containing the same characters: ref(qr/x/) returns "Regexp"; however, dereferencing it is not well defined (you currently get the normalized version of the original pattern, but this may change).
† Once a variable is getting interpolated it may be a good idea to escape special characters that may be hiding in it, that could throw off the regex, using quotemeta based \Q ... \E
my $pat = qr/^foo \Q$z\E bar$/x;
If this is used then the variable name need not be delimited either
my $pat = qr/^foo\Q$z\Ebar$/;
Thanks to Håkon Hægland for bringing this up.

What is the difference between qr/ and m/ in Perl?

From Perldoc:
qr/STRING/msixpodualn
This operator quotes (and possibly compiles) its STRING as a regular
expression. STRING is interpolated the same way as PATTERN in
m/PATTERN/.
m/PATTERN/msixpodualngc
/PATTERN/msixpodualngc
Searches a string for a pattern match, and in scalar context returns
true if it succeeds, false if it fails. If no string is specified via
the =~ or !~ operator, the $_ string is searched. (The string
specified with =~ need not be an lvalue--it may be the result of an
expression evaluation, but remember the =~ binds rather tightly.) See
also perlre.
Options are as described in qr// above
I'm sure I'm missing something obvious, but it's not clear at all to me how these options are different - they seem basically synonymous. When would you use qr// instead of m//, or vice versa?
The m// operator is for matching, whereas qr// produces a pattern (as a string) that you can stick in a variable and store for later. It's a quoted regular expression pattern.
Pre-compiling the this way is useful for optimising your run-time cost, e.g. if you are using a fixed pattern in a loop with millions of iterations, or you want to pass patterns around between function calls or use them in a dispatch table.
# match now
if ( $foo =~ m/pattern/ ) { ... }
# compile and use later
my $pattern = qr/pattern/i;
print $pattern; # (?^ui:pattern)
if ($foo =~ m/$pattern/) { ... }
The structure of the string (in this example, (?^ui:pattern)) is explained in perlre. Basically (?:) creates a sub-pattern with built-in flags, and the ^ says which flags not to have. You can use this inside other patterns too, to turn on and off case-insensitivity for parts of your pattern for example.

How to use Regular expression in perl if both regular expression and strings are variables

I have two variables coming from some user inputs. One is a string that needs to be checked and other one is a regular expression as below.
Following code doesn't work.
my $pattern = "/^current.*$/";
my $name = "currentStateVector";
if($name =~ $pattern) {
print "matches \n";
} else {
print "doesn't match \n";
}
And following does.
if($name =~ /^current.*$/) {
print "matches \n";
} else {
print "doesn't match \n";
}
What's the reason for this. I've the regular expression stored in a variable. Is there another way to store this variable or modify it?
The double-quotes that you use interpolate -- they first evaluate what's inside them (variables, escapes, etc) and return a string built with evaluations' results and remaining literals. See Gory details of parsing quoting constructs for an illuminating discussion, with lots of detail.
And your example string happens to have a $/ there, which is one of Perl's global variables (see perlvar) so $pattern is different than expected; print it to see. (In this case the / is erroneous as discussed below but the point stands.)
Instead, either use single quotes to avoid interpretation of characters like $ and \ (etc) so that they are used in regex as such
my $pattern = q(^current.*$);
or, better, use the regex-specific qr operator
my $pattern = qr/^current.*$/;
which builds from its string a proper regex pattern (a special type of Perl value), and allows use of modifiers. In this case you need to escape characters that have a special meaning in regex if you want them to be treated as literals.
Note that there's no need for // for the regex, and they wouldn't be a part of the pattern anyway -- having them around the actual pattern is wrong.
Also, carefully consider all circumstances under which user input may end up being used.
It is brought up in a comment that users may submit a "pattern" with extra /'s. That'd be wrong, as mentioned above; only the pattern itself should be given (surrounded on the command-line by ', so that the shell doesn't interpret particular characters in it). More detail follows.
The /'s are clearly not meant as a part of the pattern, but are rather intended to come with the match operator, to delimit (quote) the regex pattern itself (in the larger expression) so that one can use string literals in the pattern. Or they are used for clarity, and/or to be able to specify global modifiers (even though those can be specified inside patterns as well).
But then if users still type them around the pattern the regex will use those characters as a part of the pattern and will try to match a leading /, etc; it will fail, quietly. Make sure that users know that they need to give a pattern alone, with no delimiters.
If this is likely to be a problem I'd check for delimiters and if found carry on with a "loud" (clear) warning. What makes this tricky is the fact that a pattern starting and ending with a slash is legitimate -- it is possible, if somewhat unlikely, that a user may want actual /'s in their pattern. So you can only ask, or raise a warning, not abort.
Note that with a pattern given in a variable, or with an expression yielding a pattern at runtime, the explicit match operator and delimiters aren't needed for matching; the variable or the expression's return is taken as a search pattern and used for matching. See The basics (perlre) and Binding Operators (perlop).
So you can do simply $name =~ $pattern. Of course $name =~ /$pattern/ is fine as well, where you can then give global modifiers after the closing /
The slashes are part of the matching operator m//, not part of the regex.
When I populate the regex from user input
my $pattern = shift;
and run the script as
58663971.pl '^current.*$'
it matches.

Perl: how to use string variables as search pattern and replacement in regex

I want to use string variables for both search pattern and replacement in regex. The expected output is like this,
$ perl -e '$a="abcdeabCde"; $a=~s/b(.)d/_$1$1_/g; print "$a\n"'
a_cc_ea_CC_e
But when I moved the pattern and replacement to a variable, $1 was not evaluated.
$ perl -e '$a="abcdeabCde"; $p="b(.)d"; $r="_\$1\$1_"; $a=~s/$p/$r/g; print "$a\n"'
a_$1$1_ea_$1$1_e
When I use "ee" modifier, it gives errors.
$ perl -e '$a="abcdeabCde"; $p="b(.)d"; $r="_\$1\$1_"; $a=~s/$p/$r/gee; print "$a\n"'
Scalar found where operator expected at (eval 1) line 1, near "$1$1"
(Missing operator before $1?)
Bareword found where operator expected at (eval 1) line 1, near "$1_"
(Missing operator before _?)
Scalar found where operator expected at (eval 2) line 1, near "$1$1"
(Missing operator before $1?)
Bareword found where operator expected at (eval 2) line 1, near "$1_"
(Missing operator before _?)
aeae
What do I miss here?
Edit
Both $p and $r are written by myself. What I need is to do multiple similar regex replacing without touching the perl code, so $p and $r have to be in a separate data file. I hope this file can be used with C++/python code later.
Here are some examples of $p and $r.
^(.*\D)?((19|18|20)\d\d)年 $1$2<digits>年
^(.*\D)?(0\d)年 $1$2<digits>年
([TKZGD])(\d+)/(\d+)([^\d/]) $1$2<digits>$3<digits>$4
([^/TKZGD\d])(\d+)/(\d+)([^/\d]) $1$3分之$2$4
With $p="b(.)d"; you are getting a string with literal characters b(.)d. In general, regex patterns are not preserved in quoted strings and may not have their expected meaning in a regex. However, see Note at the end.
This is what qr operator is for: $p = qr/b(.)d/; forms the string as a regular expression.
As for the replacement part and /ee, the problem is that $r is first evaluated, to yield _$1$1_, which is then evaluated as code. Alas, that is not valid Perl code. The _ are barewords and even $1$1 itself isn't valid (for example, $1 . $1 would be).
The provided examples of $r have $Ns mixed with text in various ways. One way to parse this is to extract all $N and all else into a list that maintains their order from the string. Then, that can be processed into a string that will be valid code. For example, we need
'$1_$2$3other' --> $1 . '_' . $2 . $3 . 'other'
which is valid Perl code that can be evaluated.
The part of breaking this up is helped by split's capturing in the separator pattern.
sub repl {
my ($r) = #_;
my #terms = grep { $_ } split /(\$\d)/, $r;
return join '.', map { /^\$/ ? $_ : q(') . $_ . q(') } #terms;
}
$var =~ s/$p/repl($r)/gee;
With capturing /(...)/ in split's pattern, the separators are returned as a part of the list. Thus this extracts from $r an array of terms which are either $N or other, in their original order and with everything (other than trailing whitespace) kept. This includes possible (leading) empty strings so those need be filtered out.
Then every term other than $Ns is wrapped in '', so when they are all joined by . we get a valid Perl expression, as in the example above.
Then /ee will have this function return the string (such as above), and evaluate it as valid code.
We are told that safety of using /ee on external input is not a concern here. Still, this is something to keep in mind. See this post, provided by Håkon Hægland in a comment. Along with the discussion it also directs us to String::Substitution. Its use is demonstrated in this post. Another way to approach this is with replace from Data::Munge
For more discussion of /ee see this post, with several useful answers.
Note on using "b(.)d" for a regex pattern
In this case, with parens and dot, their special meaning is maintained. Thanks to kangshiyin for an early mention of this, and to Håkon Hægland for asserting it. However, this is a special case. Double-quoted strings directly deny many patterns since interpolation is done -- for example, "\w" is just an escaped w (what is unrecognized). The single quotes should work, as there is no interpolation. Still, strings intended for use as regex patterns are best formed using qr, as we are getting a true regex. Then all modifiers may be used as well.

How can I match everything that is after the last occurrence of some char in a perl regular expression?

For example, return the part of the string that is after the last x in axxxghdfx445 (should return 445).
my($substr) = $string =~ /.*x(.*)/;
From perldoc perlre:
By default, a quantified subpattern is "greedy", that is, it will match
as many times as possible (given a particular starting location) while
still allowing the rest of the pattern to match.
That's why .*x will match up to the last occurence of x.
The simplest way would be to use /([^x]*)$/
the first answer is a good one,
but when talking about "something that does not contain"...
i like to use the regex that "matches" it
my ($substr) = $string =~ /.*x([^x]*)$/;
very usefull in some case
the simplest way is not regular expression, but a simple split() and getting the last element.
$string="axxxghdfx445";
#s = split /x/ , $string;
print $s[-1];
Yet another way to do it. It's not as simple as a single regular expression, but if you're optimizing for speed, this approach will probably be faster than anything using regex, including split.
my $s = 'axxxghdfx445';
my $p = rindex $s, 'x';
my $match = $p < 0 ? undef : substr($s, $p + 1);
I'm surprised no one has mentioned the special variable that does this, $': "$'" returns everything after the matched string. (perldoc perlre)
my $str = 'axxxghdfx445';
$str =~ /x/;
# $' contains '445';
print $';
However, there is a cost (emphasis mine):
WARNING: Once Perl sees that you need one of $&, "$", or "$'" anywhere
in the program, it has to provide them for every pattern match. This
may substantially slow your program. Perl uses the same mechanism to
produce $1, $2, etc, so you also pay a price for each pattern that
contains capturing parentheses. (To avoid this cost while retaining
the grouping behaviour, use the extended regular expression "(?: ... )"
instead.) But if you never use $&, "$" or "$'", then patterns without
capturing parentheses will not be penalized. So avoid $&, "$'", and
"$`" if you can, but if you can't (and some algorithms really
appreciate them), once you've used them once, use them at will, because
you've already paid the price. As of 5.005, $& is not so costly as the
other two.
But wait, there's more! You get two operators for the price of one, act NOW!
As a workaround for this problem, Perl 5.10.0 introduces
"${^PREMATCH}", "${^MATCH}" and "${^POSTMATCH}", which are equivalent
to "$`", $& and "$'", except that they are only guaranteed to be
defined after a successful match that was executed with the "/p"
(preserve) modifier. The use of these variables incurs no global
performance penalty, unlike their punctuation char equivalents, however
at the trade-off that you have to tell perl when you want to use them.
my $str = 'axxxghdfx445';
$str =~ /x/p;
# ${^POSTMATCH} contains '445';
print ${^POSTMATCH};
I would humbly submit that this route is the best and most straight-forward
approach in most cases, since it does not require that you do special things
with your pattern construction in order to retrieve the postmatch portion, and there
is no performance penalty.
Regular Expression : /([^x]+)$/ #assuming x is not last element of the string.