Match returned as array instead of variable - regex

I am pulling a simple string from between two XML-like tags, but the match is being returned as an array instead of a variable. I am using the following code:
$finishState = $inFileLine =~ m(<State>(.*?)<\/State>)g;
And the value of $inFileLine is:
<recordNum>SW001</recordNum><state>Assigned</state><title>Fix Something</title>
When I run this code, a "1" is stored in $finishState. When I change $finishState to #finishState the value "Assigned" is stored properly.
I'm unsure why and how to fix this. I'm absolutely not able to use an XML parser.
While having the value I need in an array doesn't kill me I would like to find out why this is happening and modify my regexp to correctly populate the variable. I also considered using grep, sed, awk, etc. but a match seems like a concise and clean way to do this.

It is called context. Perl is context based language, the result given by operand is based on which context you are evaluating it.
There are two types of context in perl.
Scalar context.
List context.
Lists are collection of scalars.We use arrays and hashes to name them.
my $finishState = $inFileLine =~ m(<State>(.*?)<\/State>)g;
In this case you are evaluating the expression in scalar context which is giving you the boolean value whether it is matched or not i.e 1(matched) in your case..
my #finishState = $inFileLine =~ m(<State>(.*?)<\/State>)g;
In this case you are evaluating the expression as array so it will give you all the matches in the array.
So, you know there is only one match and you want to store it into scalar use parenthesis to evaluate it in list context.
i.e
my ($finishState) = $inFileLine =~ m(<State>(.*?)<\/State>)g;
Now $finishState will contain your match.
If there is more than one match, then $finishState will contain the first match. Check this and this node for more information on contexts.

$finishState = $inFileLine =~ m(<State>(.*?)<\/State>)g;
evaluates the regular expression in scalar context, and populates $finishState with a true (1) or false ("") value.
#finishState = $inFileLine =~ m(<State>(.*?)<\/State>)g;
or even
($finishState) = $inFileLine =~ m(<State>(.*?)<\/State>)g;
evaluate the regular expression in list context. The distinction between scalar context and list context is important in Perl, and one of the greatest sources of confusion to new Perl programmers. Many functions and operations behave differently in the two different contexts, and often the only way to be sure what an operation is supposed to do in a particular context is to read the docs.
In this case, #finishState will be populated by a list of all strings matching the capture group in the regular expression (i.e., all strings of length 0 of more enclosed by <State> and </State> tags), which in your example is a list of one element with the value Assigned.

Usually you would refer to $1 to see the content of the first matching parentheses:
$inFileLine = '<recordNum>SW001</recordNum><state>Assigned</state><title>Fix Something</title>';
$inFileLine =~ m(<State>(.*?)<\/State>)i;
$finishState = $1;
print $finishState;
outputs
Assigned
perlrequick states that
In list context, //g returns a list of matched groupings, or if there are no groupings, a list of matches to the whole regex.
But the usual way would be to check the return value of the regex to find out whether there is any match, and to refer to $1, $2 etc. to see the matches.

Related

What is the difference between qr/ and m/ in Perl?

From Perldoc:
qr/STRING/msixpodualn
This operator quotes (and possibly compiles) its STRING as a regular
expression. STRING is interpolated the same way as PATTERN in
m/PATTERN/.
m/PATTERN/msixpodualngc
/PATTERN/msixpodualngc
Searches a string for a pattern match, and in scalar context returns
true if it succeeds, false if it fails. If no string is specified via
the =~ or !~ operator, the $_ string is searched. (The string
specified with =~ need not be an lvalue--it may be the result of an
expression evaluation, but remember the =~ binds rather tightly.) See
also perlre.
Options are as described in qr// above
I'm sure I'm missing something obvious, but it's not clear at all to me how these options are different - they seem basically synonymous. When would you use qr// instead of m//, or vice versa?
The m// operator is for matching, whereas qr// produces a pattern (as a string) that you can stick in a variable and store for later. It's a quoted regular expression pattern.
Pre-compiling the this way is useful for optimising your run-time cost, e.g. if you are using a fixed pattern in a loop with millions of iterations, or you want to pass patterns around between function calls or use them in a dispatch table.
# match now
if ( $foo =~ m/pattern/ ) { ... }
# compile and use later
my $pattern = qr/pattern/i;
print $pattern; # (?^ui:pattern)
if ($foo =~ m/$pattern/) { ... }
The structure of the string (in this example, (?^ui:pattern)) is explained in perlre. Basically (?:) creates a sub-pattern with built-in flags, and the ^ says which flags not to have. You can use this inside other patterns too, to turn on and off case-insensitivity for parts of your pattern for example.

How to use Regular expression in perl if both regular expression and strings are variables

I have two variables coming from some user inputs. One is a string that needs to be checked and other one is a regular expression as below.
Following code doesn't work.
my $pattern = "/^current.*$/";
my $name = "currentStateVector";
if($name =~ $pattern) {
print "matches \n";
} else {
print "doesn't match \n";
}
And following does.
if($name =~ /^current.*$/) {
print "matches \n";
} else {
print "doesn't match \n";
}
What's the reason for this. I've the regular expression stored in a variable. Is there another way to store this variable or modify it?
The double-quotes that you use interpolate -- they first evaluate what's inside them (variables, escapes, etc) and return a string built with evaluations' results and remaining literals. See Gory details of parsing quoting constructs for an illuminating discussion, with lots of detail.
And your example string happens to have a $/ there, which is one of Perl's global variables (see perlvar) so $pattern is different than expected; print it to see. (In this case the / is erroneous as discussed below but the point stands.)
Instead, either use single quotes to avoid interpretation of characters like $ and \ (etc) so that they are used in regex as such
my $pattern = q(^current.*$);
or, better, use the regex-specific qr operator
my $pattern = qr/^current.*$/;
which builds from its string a proper regex pattern (a special type of Perl value), and allows use of modifiers. In this case you need to escape characters that have a special meaning in regex if you want them to be treated as literals.
Note that there's no need for // for the regex, and they wouldn't be a part of the pattern anyway -- having them around the actual pattern is wrong.
Also, carefully consider all circumstances under which user input may end up being used.
It is brought up in a comment that users may submit a "pattern" with extra /'s. That'd be wrong, as mentioned above; only the pattern itself should be given (surrounded on the command-line by ', so that the shell doesn't interpret particular characters in it). More detail follows.
The /'s are clearly not meant as a part of the pattern, but are rather intended to come with the match operator, to delimit (quote) the regex pattern itself (in the larger expression) so that one can use string literals in the pattern. Or they are used for clarity, and/or to be able to specify global modifiers (even though those can be specified inside patterns as well).
But then if users still type them around the pattern the regex will use those characters as a part of the pattern and will try to match a leading /, etc; it will fail, quietly. Make sure that users know that they need to give a pattern alone, with no delimiters.
If this is likely to be a problem I'd check for delimiters and if found carry on with a "loud" (clear) warning. What makes this tricky is the fact that a pattern starting and ending with a slash is legitimate -- it is possible, if somewhat unlikely, that a user may want actual /'s in their pattern. So you can only ask, or raise a warning, not abort.
Note that with a pattern given in a variable, or with an expression yielding a pattern at runtime, the explicit match operator and delimiters aren't needed for matching; the variable or the expression's return is taken as a search pattern and used for matching. See The basics (perlre) and Binding Operators (perlop).
So you can do simply $name =~ $pattern. Of course $name =~ /$pattern/ is fine as well, where you can then give global modifiers after the closing /
The slashes are part of the matching operator m//, not part of the regex.
When I populate the regex from user input
my $pattern = shift;
and run the script as
58663971.pl '^current.*$'
it matches.

Why do you need parenthesis to save the result of a regular expression in Perl?

Why does this return the string between foo and bar:
my ($var) = $my_string =~ /foo(.*)bar/;
And this return the number 1 if there is a hit and nothing if there is no match:
my $var = $my_string =~ /foo(.*)bar/;
Specifically what are the parenthesis around the variable doing?
As with many things in Perl, the result of that expression is different based on the context that is put in.
The pattern match operator will return a list of captures in list contact (your first example). If the operator is evaluated in scalar context (your second example), a boolean is indicating whether it matched or not is returned instead.
To answer your specific question, placing parens around $var forces it into list context and assigns the first element of the list returned from the pattern match to $var.
This is effectively the same as these statements:
my #matches = $my_string =~ /foo(.*)bar/;
my $var = $matches[0];
Because in first case you are doing regex matching in list context and http://perldoc.perl.org/perlrequick.html says in that case:
In list context, a match /regex/ with groupings will return the list of matched values ($1,$2,...)
In second case, regex return true (1)
Because the regex match operator returns a list, and you must provide a list context if you want the list values. In the parentheses case you provide a list context, so the variable gets set... but only the first value returned would be saved. If your regex returned multiple sub-matches you'd need to provide more than one variable, as in:
($a,$b) = $string =~ /\s+(\S+)\s+(\S+)/;
In the bare (non-parentheses) case, you provide scalar context, and the match operator in scalar context returns a boolean indicating whether the pattern matched or not.

Perl match only returning "1". Booleans? Why?

This has got to be obvious but I'm just not seeing it.
I have a documents containing thousands of records just like below:
Row:1 DATA:
[0]37755442
[1]DDG00000010
[2]FALLS
[3]IMAGE
[4]Defect
[5]3
[6]CLOSED
I've managed to get each record separated and I'm now trying to parse out each field.
I'm trying to match the numbered headers so that I can pull out the data that succeeds them but the problem is that my matches are only returning me "1" when they succeed and nothing if they don't. This is happening for any match I try to apply.
For instance, applied to a simple word within each record:
my($foo) = $record=~ /Defect/;
print STDOUT $foo;
prints out out a "1" for each record if it contains "Defect" and nothing if it contains something else.
Alternatively:
$record =~ /Defect/;
print STDOUT $1;
prints absolutely nothing.
$record =~ s/Defect/Blefect/
will replace "Defect" with "Blefect" perfectly fine on the other hand.
I'm really confused as to why the returns on my matches are so screwy.
Any help would be much appreciated.
You need to use capturing parentheses to actually capture:
if ($record =~ /(Defect)/ ) {
print "$1\n";
}
I think what you really want is to wrap the regex in parentheses:
my($foo) = $record=~ /(Defect)/;
In list context, the groups are returned, not the match itself. And your original code has no groups.
The =~ perl operator takes a string (left operand) and a regular expression (right operand) and matches the string against the RE, returning a boolean value (true or false) depending on whether the re matches.
Now perl doesn't really have a boolean type -- instead every value (of any type) is treated as either 'true' or 'false' when in a boolean context -- most things are 'true', but the empty string and the special 'undef' value for undefined things are false. So when returning a boolean, it generall uses '1' for true and '' (empty string) for false.
Now as to your last question, where trying to print $1 prints nothing. Whenever you match a regular expression, perl sets $1, $2 ... to the values of parenthesized subexpressions withing the RE. In your example however, there are NO parenthesized sub expressions, so $1 is always empty. If you change it to
$record =~ /(Defect)/;
print STDOUT $1;
You'll get something more like what you expect (Defect if it matches and nothing if it doesn't).
The most common idiom for regexp matching I generally see is something like:
if ($string =~ /regexp with () subexpressions/) {
... code that uses $1 etc for the subexpressions matched
} else {
... code for when the expression doesn't match at all
}
From perlop, Quote and Quote-Like operators [bits in brackets added by me]:
/PATTERN/msixpodualgc
Searches a string for a pattern match, and in scalar context returns true [1] if it succeeds, false [undef] if it fails.
(Looking at the section on s/// will also be useful ;-)
Perl just doesn't have a discreet boolean type or true/false aliases so 1 and undef are often used: however, it could very well could be other values without making the documentation incorrect.
$1 will never be defined because there is no capture group: perhaps $& (aka $MATCH) is desired? (Or better, change the regular expression to have a capture group ;-)
Happy coding.
my($foo) = $record=~ /Defect/;
print STDOUT $foo;
Rather than this you should do
$record =~ /Defect/;
my $foo = $&; # Matched portion of the $record.
As your goal seems to be to get the matched portion.
The return value is true/false indicating if match was successful or not.
You may find http://perldoc.perl.org/perlreref.html handy.
If you want the result of a match as "true" or "false", then do the pattern match in scalar context. That's what you did in your first example. You performed a pattern match and assigned the result to the scalar my($foo). So $foo got a "true" or "false" value.
But if you want to capture the text that matched a part of your pattern, use grouping parentheses and then check the corresponding $ variable. For example, consider the expression:
$record =~ /(.*)ing/
A match on the word "speaking" will assign "speak" to $1, "listening" will assign "listen" to $1, etc. That's what you are trying to do in your second example. The trouble is that you need to add in the grouping parentheses. "$record =~ /Defect/" will assign nothing to $1 because there are no grouping parentheses in the pattern.

How can I extract a varying number of groups of digits from a Perl string?

I am attempting to parse a string in Perl with the format:
Messages pushed to the Order Book queues 123691 121574 146343 103046 161253
I want to access the numbers at the end of the string so intend to do a match like
/(\d+)/s
My issue is that the number of values at the end contain a variable number of strings.
What is the best way to format the regexp to be able to access each of those numbers individually? I'm a C++ developer and am just learning Perl, so am trying to find the cleanest Perl way to accomplish this.
Thanks for your help.
Just use the /g flag to make the match operator perform a global match. In list context, the match operator returns all of the results as a list:
#result = $string =~ /(\d+)/g;
This works if there are no other numbers than the trailing ones.
You can use the match operator in a list context with the global flag to get a list of all your parenthetical captures. Example:
#list = ($string =~ /(\d+)/g);
Your list should now have the all the digit groups in your string.
See the documentation on the match operator for more info.
"In scalar context, each execution of m//g finds the next match, returning true if it matches, and false if there is no further match" --(From perldoc perlop)
So you should be able to make a global regex loop, like so:
while ($string =~ /(\d+)/g) {
push #queuelist, $1;
}
I'd do something like this:
my #numbers;
if (m/Messages pushed to the Order Book queues ([\d\s]+)/) {
#numbers = split(/\s+/, $1);
}
No need to cram it into one regex.