What is the difference between qr/ and m/ in Perl? - regex

From Perldoc:
qr/STRING/msixpodualn
This operator quotes (and possibly compiles) its STRING as a regular
expression. STRING is interpolated the same way as PATTERN in
m/PATTERN/.
m/PATTERN/msixpodualngc
/PATTERN/msixpodualngc
Searches a string for a pattern match, and in scalar context returns
true if it succeeds, false if it fails. If no string is specified via
the =~ or !~ operator, the $_ string is searched. (The string
specified with =~ need not be an lvalue--it may be the result of an
expression evaluation, but remember the =~ binds rather tightly.) See
also perlre.
Options are as described in qr// above
I'm sure I'm missing something obvious, but it's not clear at all to me how these options are different - they seem basically synonymous. When would you use qr// instead of m//, or vice versa?

The m// operator is for matching, whereas qr// produces a pattern (as a string) that you can stick in a variable and store for later. It's a quoted regular expression pattern.
Pre-compiling the this way is useful for optimising your run-time cost, e.g. if you are using a fixed pattern in a loop with millions of iterations, or you want to pass patterns around between function calls or use them in a dispatch table.
# match now
if ( $foo =~ m/pattern/ ) { ... }
# compile and use later
my $pattern = qr/pattern/i;
print $pattern; # (?^ui:pattern)
if ($foo =~ m/$pattern/) { ... }
The structure of the string (in this example, (?^ui:pattern)) is explained in perlre. Basically (?:) creates a sub-pattern with built-in flags, and the ^ says which flags not to have. You can use this inside other patterns too, to turn on and off case-insensitivity for parts of your pattern for example.

Related

Regular expression chaining/mixing in perl

Consider the following:
my $p1 = "(a|e|o)";
my $p2 = "(x|y|z)";
$text =~ s/($p1)( )[$p2]([0-9])/some action to be done/g;
Is the regular expression pattern in string form equal to the concatenation of the elements in the above? That is, can the above can be written as
$text =~ s/((a|e|o))( )[(x|y|z)]([0-9])/ some action to be done/g;
Well, yes, variables in a pattern get interpolated into the pattern, in a double-quoted context, and the expressions you show are equivalent. See discussion in perlretut tutorial.
(I suggest using qr operator for that instead. See discussion of that in perlretut as well.)
But that pattern clearly isn't right
Why those double parenthesis, ((a|e|o))? Either have the alternation in the variable and capture it in the regex
my $p1 = 'a|e|o'; # then use in a regex as
/($p1)/ # gets interpolated into: /(a|e|o)/
or indicate capture in the variable but then drop parens in the regex
my $p1 = '(a|e|o)'; # use as
/$p1/ # (same as above)
The capturing parenthesis do their job in either way and in both cases a match ( a or e or o) is captured into a suitable variable, $1 in your expression since this is the first capture
A pattern of [(x|y|z)] matches either one of the characters (, x, |,... (etc) -- that [...] is the character class, which matches either of the characters inside (a few have a special meaning). So, again, either use the alternation and capture in your variable
my $p2 = '(x|y|z)'; # then use as
/$p2/
or do it using the character class
my $p2 = 'xyz'; # and use as
/([$p2])/ # --> /([xyz])/
So altogether you'd have something like
use warnings;
use strict;
use feature 'say';
my $text = shift // q(e z7);
my $p1 = 'a|e|o';
my $p2 = 'xyz';
$text =~ s/($p1)(\s)([$p2])([0-9])/replacement/g;
say $_ // 'undef' for $1, $2, $3, $4;
I added \s instead of a literal single space, and I capture the character-class match with () (the pattern from the question doesn't), since that seems to be wanted.
Neither snippets are valid Perl code. They are therefore equivalent, but only in the sense that neither will compile.
But say you have a valid m//, s/// or qr// operator. Then yes, variables in the pattern would be handled as you describe.
For example,
my $p1 = "(a|e|o)";
my $p2 = "(x|y|z)";
$text =~ /($pl)( )[$p2]([0-9])/g;
is equivalent to
$text =~ /((a|e|o))( )[(x|y|z)]([0-9])/g;
As mentioned in an answer to a previous question of yours, (x|y|z) is surely a bug, and should be xyz.

matching cond in perl using double exclaimation

if ($a =~ m!^$var/!)
$var is a key in a two dimensional hash and $a is a key in another hash.
What is the meaning of this expressions?
This is a regular expression ("regex"), where the ! character is used as the delimiter for the pattern that is to be matched in the string that it binds to via the =~ operator (the $a† here).
It may clear it up to consider the same regex with the usual delimiter instead, $a =~ /^$var\// (then m may be omitted); but now any / used in the pattern clearly must be escaped. To avoid that unsightly and noisy \/ combo one often uses another character for the delimiter, as nearly any character may be used (my favorite is the curlies, m{^$var/}). ‡ §
This regex in the question tests whether the value in the variable $a begins with (by ^ anchor) the value of the variable $var followed by / (variables are evaluated and the result used). §
† Not a good choice for a variable name since $a and $b are used by the builtin sort
‡ With the pattern prepared ahead of time the delimiter isn't even needed
my $re = qr{^$var/};
if ($string =~ $re) ...
(but I do like to still use // then, finding it clearer altogether)
Above I use qr but a simple q() would work just fine (while I absolutely recommend qr). These take nearly any characters for the delimiter, as well.
§ Inside a pattern the evaluated variables are used as regex patterns, what is wrong in general (when this is intended they should be compiled using qr and thus used as subpatterns).
An unimaginative example: a variable $var = q(\s) (literal backslash followed by letter s) evaluated inside a pattern yields the \s sequence which is then treated as a regex pattern, for whitespace. (Presumably unintended; we just wanted \ and s.)
This is remedied by using quotemeta, /\Q$var\E/, so that possible metacharacters in $var are escaped; this results in the correct pattern for the literal characters, \\s. So a correct way to write the pattern is m{^\Q$var\E/}.
Failure to do this also allows the injection bug. Thanks to ikegami for commenting on this.
The match operator (m/.../) is one of Perl's "quote-like" operators. The standard usage is to use slashes before and after the regex that goes in the middle of the operator (and if you use slashes, then you can omit the m from the start of the operator). But if the regex itself contains a slash then it is convenient to use a different delimiter instead to avoid having to escape the embedded slash. In your example, the author has decided to use exclamation marks, but any non-whitespace character can be used.
Many Perl operators work like this - m/.../, s/.../.../, tr/.../.../, q/.../, qq/.../, qr/.../, qw/.../, qx/.../ (I've probably forgotten some).

How to use Regular expression in perl if both regular expression and strings are variables

I have two variables coming from some user inputs. One is a string that needs to be checked and other one is a regular expression as below.
Following code doesn't work.
my $pattern = "/^current.*$/";
my $name = "currentStateVector";
if($name =~ $pattern) {
print "matches \n";
} else {
print "doesn't match \n";
}
And following does.
if($name =~ /^current.*$/) {
print "matches \n";
} else {
print "doesn't match \n";
}
What's the reason for this. I've the regular expression stored in a variable. Is there another way to store this variable or modify it?
The double-quotes that you use interpolate -- they first evaluate what's inside them (variables, escapes, etc) and return a string built with evaluations' results and remaining literals. See Gory details of parsing quoting constructs for an illuminating discussion, with lots of detail.
And your example string happens to have a $/ there, which is one of Perl's global variables (see perlvar) so $pattern is different than expected; print it to see. (In this case the / is erroneous as discussed below but the point stands.)
Instead, either use single quotes to avoid interpretation of characters like $ and \ (etc) so that they are used in regex as such
my $pattern = q(^current.*$);
or, better, use the regex-specific qr operator
my $pattern = qr/^current.*$/;
which builds from its string a proper regex pattern (a special type of Perl value), and allows use of modifiers. In this case you need to escape characters that have a special meaning in regex if you want them to be treated as literals.
Note that there's no need for // for the regex, and they wouldn't be a part of the pattern anyway -- having them around the actual pattern is wrong.
Also, carefully consider all circumstances under which user input may end up being used.
It is brought up in a comment that users may submit a "pattern" with extra /'s. That'd be wrong, as mentioned above; only the pattern itself should be given (surrounded on the command-line by ', so that the shell doesn't interpret particular characters in it). More detail follows.
The /'s are clearly not meant as a part of the pattern, but are rather intended to come with the match operator, to delimit (quote) the regex pattern itself (in the larger expression) so that one can use string literals in the pattern. Or they are used for clarity, and/or to be able to specify global modifiers (even though those can be specified inside patterns as well).
But then if users still type them around the pattern the regex will use those characters as a part of the pattern and will try to match a leading /, etc; it will fail, quietly. Make sure that users know that they need to give a pattern alone, with no delimiters.
If this is likely to be a problem I'd check for delimiters and if found carry on with a "loud" (clear) warning. What makes this tricky is the fact that a pattern starting and ending with a slash is legitimate -- it is possible, if somewhat unlikely, that a user may want actual /'s in their pattern. So you can only ask, or raise a warning, not abort.
Note that with a pattern given in a variable, or with an expression yielding a pattern at runtime, the explicit match operator and delimiters aren't needed for matching; the variable or the expression's return is taken as a search pattern and used for matching. See The basics (perlre) and Binding Operators (perlop).
So you can do simply $name =~ $pattern. Of course $name =~ /$pattern/ is fine as well, where you can then give global modifiers after the closing /
The slashes are part of the matching operator m//, not part of the regex.
When I populate the regex from user input
my $pattern = shift;
and run the script as
58663971.pl '^current.*$'
it matches.

What does =~ m/\#F/ mean in Perl?

open DMLOG, "<dmlog.txt" or &error("Can't open log file: $!");
chomp(#entirelog=<DMLOG>);
close DMLOG;
for $line (#entirelog)
{
if ($line =~ m/\#F/)
{
$titlecolumn = $line;
last;
}
}
I found that =~ is a regular expression I think, but I don't quite understand what its doing here.
It assigns the first line to $titlecolumn that contains an # followed by an F.
The =~ is the bind operator and applies a regex to a string. That regex would usually be written as /#F/. The m prefix can be used to emphasize that the following literal is a prefix (important when other delimiters are used).
Do you understand what regular expressions are? Or, is the =~ throwing you off?
In most programming languages, you would see something like this:
if ( regexp(line, "/#F/") ) {
...
}
However, in Perl, regular expressions are inspired by Awk's syntax. Thus:
if ( $line =~ /#F/ ) {
...
}
The =~ means the regular expression will act upon the variable name on the left. If the pattern #F is found in $line, the if statement is true.
You might want to look at the Regular Expression Tutorial if you're not familiar with them. Regular expressions are extremely powerful and very commonly used in Perl. In fact, they tend to be very used in Perl and is one of the reasons developers in other languages will claim that Perl is a Write Only language.
It is called Binding Operator. It is used to match pattern on RHS with the variable on LHS. Similarly you have got !~ which negates the matching.
For your particular case:
$line =~ m/\#F/
This test whether the $line matches the pattern - /#F/.
Yes, =~ is the binding operator binding an expression to the pattern match m//.
The if statement checks, if a line matches the given regular expression. In this case, it checks, if there is a hash-sign followed by a capital F.
The backslash has just been added (maybe) to avoid treating # as a comment sign (which isn't needed).

How to have a variable as regex in Perl

I think this question is repeated, but searching wasn't helpful for me.
my $pattern = "javascript:window.open\('([^']+)'\);";
$mech->content =~ m/($pattern)/;
print $1;
I want to have an external $pattern in the regular expression. How can I do this? The current one returns:
Use of uninitialized value $1 in print at main.pm line 20.
$1 was empty, so the match did not succeed. I'll make up a constant string in my example of which I know that it will match the pattern.
Declare your regular expression with qr, not as a simple string. Also, you're capturing twice, once in $pattern for the open call's parentheses, once in the m operator for the whole thing, therefore you get two results. Instead of $1, $2 etc. I prefer to assign the results to an array.
my $pattern = qr"javascript:window.open\('([^']+)'\);";
my $content = "javascript:window.open('something');";
my #results = $content =~ m/($pattern)/;
# expression return array
# (
# q{javascript:window.open('something');'},
# 'something'
# )
When I compile that string into a regex, like so:
my $pattern = "javascript:window.open\('([^']+)'\);";
my $regex = qr/$pattern/;
I get just what I think I should get, following regex:
(?-xism:javascript:window.open('([^']+)');)/
Notice that it it is looking for a capture group and not an open paren at the end of 'open'. And in that capture group, the first thing it expects is a single quote. So it will match
javascript:window.open'fum';
but not
javascript:window.open('fum');
One thing you have to learn, is that in Perl, "\(" is the same thing as "(" you're just telling Perl that you want a literal '(' in the string. In order to get lasting escapes, you need to double them.
my $pattern = "javascript:window.open\\('([^']+)'\\);";
my $regex = qr/$pattern/;
Actually preserves the literal ( and yields:
(?-xism:javascript:window.open\('([^']+)'\);)
Which is what I think you want.
As for your question, you should always test the results of a match before using it.
if ( $mech->content =~ m/($pattern)/ ) {
print $1;
}
makes much more sense. And if you want to see it regardless, then it's already implicit in that idea that it might not have a value. i.e., you might not have matched anything. In that case it's best to put alternatives
$mech->content =~ m/($pattern)/;
print $1 || 'UNDEF!';
However, I prefer to grab my captures in the same statement, like so:
my ( $open_arg ) = $mech->content =~ m/($pattern)/;
print $open_arg || 'UNDEF!';
The parens around $open_arg puts the match into a "list context" and returns the captures in a list. Here I'm only expecting one value, so that's all I'm providing for.
Finally, one of the root causes of your problems is that you do not need to specify your expression in a string in order for your regex to be "portable". You can get perl to pre-compile your expression. That way, you only care what instructions the characters are to a regex and not whether or not you'll save your escapes until it is compiled into an expression.
A compiled regex will interpolate itself into other regexes properly. Thus, you get a portable expression that interpolates just as well as a string--and specifically correctly handles instructions that could be lost in a string.
my $pattern = qr/javascript:window.open\('([^']+)'\);/;
Is all that you need. Then you can use it, just as you did. Although, putting parens around the whole thing, would return the whole matched expression (and not just what's between the quotes).
You do not need the parentheses in the match pattern. It will match the whole pattern and return that as $1, which I am guess is not matching, but I am only guessing.
$mech->content =~ m/$pattern/;
or
$mech->content =~ m/(?:$pattern)/;
These are the clustering, non-capturing parentheses.
The way you are doing it is correct.
The solutions have been already given, I'd like to point out that the window.open call might have multiple parameters included in "" and grouped by comma like:
javascript:window.open("http://www.javascript-coder.com","mywindow","status=1,toolbar=1");
There might be spaces between the function name and parentheses, so I'd use a slighty different regex for that:
my $pattern = qr{
javascript:window.open\s*
\(
([^)]+)
\)
}x;
print $1 if $text =~ /$pattern/;
Now you have all parameters in $1 and can process them afterwards with split /,/, $stuff and so on.
It reports an uninitialized value because $1 is undefined. $1 is undefined because you have created a nested matching group by wrapping a second set of parentheses around the pattern. It will also be undefined if nothing matches your pattern.