How can I have optional matches in a Perl regex? - regex

I have a string I read from a configuration file. Structure of the string is as follows;
(long_string)long_string(long_string)
Any item in brackets, including the brackets themselves, are optional. I have the following regular expression matching the whole string but I could not figure out how to make some parts of the regular expression optional with "?".
Here are a few valid strings for input
(a)like(1)
like(very long string here)
like
Here is my regexp only matching the first one;
^\((?<short>.*)\)(?<text>.*)\((?<return>.*)\)$
How can I convert my regexp to make brackets optional for a match?

Surround the two sub-patterns with non-matching groups (?:expr) and make them optional:
^(?:\((?<short>.*)\))?(?<text>.*)(?:\((?<return>.*)\))?$
And if possible make the universal expression .* more specific, maybe with [^()]+:
^(?:\((?<short>[^()]+)\))?(?<text>[^()]+)(?:\((?<return>[^()]+)\))?$

Using the code below, you will always get a #matches array consisting of three elements. If one of the optional parts did not match, the corresponding entry will be undef.
#!/usr/bin/perl
use strict;
use warnings;
my $optional = qr/(?:\(([^)]+?)\))?/;
my $required = qr/([^()]+)/;
while ( my $line = <DATA> ) {
chomp $line;
last unless $line =~ /\S/;
if ( my #matches = ($line =~ /$optional$required$optional/) ) {
no warnings 'uninitialized';
print "---$_---\n" for #matches;
}
}
__DATA__
(a)like(1)
like(very long string here)
like

What I would do is wrap the ( and ) with your grouping members, so instead of
\((?<short>.*)\)
change it to:
(\(<short>.*\))
That way it will match the ()'s along with the inner text. Then, if they are present use another regular expression to eliminate the parentheses.
I'm not very familiar with the named matches syntax so the group syntax might be off but you should get the idea.

Give this a try...
string[] strings = new string[] { "(a)like(1)", "like(very long string here)", "like" };
foreach (string s in strings)
{
System.Text.RegularExpressions.Match match = System.Text.RegularExpressions.Regex.Match(s, #"^(\((?<short>.)\))?(?<text>.+)?(\((?<return>.+)\))?$");
if (match.Success)
{
// do logic to handle the match
}
}

Well, just make them optional, then:
^(?<short>\(.*\))?(?<text>.*)(?<return>\(.*\))?$
I'm no big fan of named captures, they tend to make it look more complicated than it is (at least for me). Also, I recommend against using ".*". My suggestion:
^(\([^)]*\))?([^(]*)(\([^)]*\))?$
and go for match group 2. But if you insist on using named captures:
^(?<short>\([^)]*\))?(?<text>[^(]*)(?<return>\([^)]*\))?$

Related

Regular expression chaining/mixing in perl

Consider the following:
my $p1 = "(a|e|o)";
my $p2 = "(x|y|z)";
$text =~ s/($p1)( )[$p2]([0-9])/some action to be done/g;
Is the regular expression pattern in string form equal to the concatenation of the elements in the above? That is, can the above can be written as
$text =~ s/((a|e|o))( )[(x|y|z)]([0-9])/ some action to be done/g;
Well, yes, variables in a pattern get interpolated into the pattern, in a double-quoted context, and the expressions you show are equivalent. See discussion in perlretut tutorial.
(I suggest using qr operator for that instead. See discussion of that in perlretut as well.)
But that pattern clearly isn't right
Why those double parenthesis, ((a|e|o))? Either have the alternation in the variable and capture it in the regex
my $p1 = 'a|e|o'; # then use in a regex as
/($p1)/ # gets interpolated into: /(a|e|o)/
or indicate capture in the variable but then drop parens in the regex
my $p1 = '(a|e|o)'; # use as
/$p1/ # (same as above)
The capturing parenthesis do their job in either way and in both cases a match ( a or e or o) is captured into a suitable variable, $1 in your expression since this is the first capture
A pattern of [(x|y|z)] matches either one of the characters (, x, |,... (etc) -- that [...] is the character class, which matches either of the characters inside (a few have a special meaning). So, again, either use the alternation and capture in your variable
my $p2 = '(x|y|z)'; # then use as
/$p2/
or do it using the character class
my $p2 = 'xyz'; # and use as
/([$p2])/ # --> /([xyz])/
So altogether you'd have something like
use warnings;
use strict;
use feature 'say';
my $text = shift // q(e z7);
my $p1 = 'a|e|o';
my $p2 = 'xyz';
$text =~ s/($p1)(\s)([$p2])([0-9])/replacement/g;
say $_ // 'undef' for $1, $2, $3, $4;
I added \s instead of a literal single space, and I capture the character-class match with () (the pattern from the question doesn't), since that seems to be wanted.
Neither snippets are valid Perl code. They are therefore equivalent, but only in the sense that neither will compile.
But say you have a valid m//, s/// or qr// operator. Then yes, variables in the pattern would be handled as you describe.
For example,
my $p1 = "(a|e|o)";
my $p2 = "(x|y|z)";
$text =~ /($pl)( )[$p2]([0-9])/g;
is equivalent to
$text =~ /((a|e|o))( )[(x|y|z)]([0-9])/g;
As mentioned in an answer to a previous question of yours, (x|y|z) is surely a bug, and should be xyz.

Pre-compiled regex with special characters matching

I'm trying to match if a word such as *FOO (* as a normal character) is in a line. My input is a C++ source code. I need to use a pre-compiled regex for this due to program flow requirements, so I tried the following:
$pattern = qr/[^a-zA-Z](\*FOO)[^a-zA-Z]|^\s*(\*FOO)[^a-zA-Z]/;
And I use it like this:
if ($line =~ m/$pattern/) { ... }
It works and catches lines containing *FOO such as hey *FOO.BAR but also matches lines such as:
//FOO programming using stuff and things
which I want to ignore. What am I missing? Is \* not the right way to escape * in a pre-compiled regex in perl? If *FOO is stored in $word and the pattern looks like this:
$pattern = qr/[^a-zA-Z](\\$word)[^a-zA-Z]|^\s*(\\$word)[^a-zA-Z]/;
Is that different from the previous pattern? Because I tried both and the result seems to be the same.
I found a way to bypass this problem by removing the first char of $word and escaping * in the pattern, but if $word = "**.?FOO" for example, how do I create a qr// with $word so that all the meta-characters are escaped?
You do need to escape the *. One way to do it is by the quotemeta \Q operator:
use warnings;
use strict;
my $qr = qr/\Q*FOO/;
while (<DATA>) { print if /$qr/ }
__DATA__
//FOO programming using stuff and things
hey *FOO.BAR
Note that this escapes all ASCII non-"word" characters through the rest of the pattern. If you need to limit its action to only a part of the pattern then stop it using \E. Please see linked docs.
The above determines whether *FOO is in the line, regardless of whether it is a word or a part of one. It is not clear to me which is needed. Once that is specified the pattern can be adjusted.
Note that /\*FOO/ works, too. What you tried failed probably because of all the rest that you are trying to match, which purpose I do not understand. If you only need to detect whether the pattern is present the above does it. if there is a more specific requirement please clarify.
As for the examples: for me that string //FOO... is not matched by the main (first) $pattern you show. The second one won't interpolate $word -- but is firstly much too convoluted. The regex can really tie one in nasty knots when pushed; I suggest to keep it simple as much as possible.
Question 1:
my $word = '*FOO';
my $pattern = qr/\\$word/;
is equivalent to
my $pattern = qr/\\*FOO/; # zero or more '\' followed by 'FOO'
The $word is simply interpolated as is.
To get something equivalent to
my $pattern = qr/\*FOO/;
you should use
my $word = '*FOO';
my $pattern = qr/\Q$word\E/;
By default, an interpolated variable is considered a mini-regular expression, meta characters in the variable such as *, +, ? are still interpreted as meta character. \Q...\E will add a backslash before any character not matching /[A-Za-z_0-9]/, thus any meta characters in the interpolated variable is interpreted as literal ones. Refer to perldoc.
Question 2
I tried
my $pattern = qr/[^a-zA-Z](\*FOO)[^a-zA-Z]|^\s*(\*FOO)[^a-zA-Z]/;
my $line = '//FOO programming using stuff and things';
if($line =~ m/$pattern/){
print "$&\n";
}
else{
print "No match!";
}
and it printed "No match!". I can't explain how you get it matched.

Metaquoting patterns in a variable list

I have a list of patterns I want to look for in a string. These patterns are numerous and contain numerous metacharacters that I want to just match literally. So this is the perfect application for metaquoting with \Q..\E. The complication is that I need to join the variable list of patterns into a regular expression.
use strict;
use warnings;
# sample string to represent my problem
my $string = "{{a|!}} Abra\n{{b|!!}} {{b}} Hocus {{s|?}} Kedabra\n{{b|+?}} {{b|??}} Pocus\n {{s|?}}Alakazam\n";
# sample patterns to look for
my #patterns = qw({{a|!}} {{s|?}} {{s|+?}} {{b|?}});
# since these patterns can be anything, I join the resulting array into a variable-length regex
my $regex = join("|",#patterns);
my #matched = $string =~ /$regex(\s\w+\s)/; # Error in matching regex due to unquoted metacharacters
print join("", #matched); # intended result: Hocus\n Pocus\n
When I attempt to introduce metaquoting into the joining operation, they appear to have no effect.
# quote all patterns so that they match literally, but make sure the alternating metacharacter works as intended
my $qmregex = "\Q".join("\E|\Q", #patterns)."\E";
my #matched = $string =~ /$qmregex(\s\w+\s)/; # The same error
For some reason the metaquoting has no effect when it is included in the string I use as the regular expression. For me, they only work when they are added directly to a regex as in /\Q$anexpression\E/ but as far as I can tell this isn't an option for me. How do I get around this?
I don't understand your expected result, as Abra and Kedabra are the only strings preceded by any of the patterns.
To solve your problem you must escape each component of the regex separately as \Q and \E affect only the value of the string in which they appear, so "\Q" and "\E" are just the null string "" and "\E|\Q" is just "|". You could write
my $qmregex = join '|', map "\Q$_\E", #patterns;
but it is simpler to call the quotemeta function.
You must also enclose the list in parentheses (?:...) to isolate the alternation, and apply the /g modifier to the regex match to find all ocurrences within the string.
Try
use strict;
use warnings;
my $string = "{{a|!}} Abra\n{{b|!!}} {{b}} Hocus {{s|?}} Kedabra\n{{b|+?}} {{b|??}} Pocus\n {{s|?}}Alakazam\n";
my #patterns = qw( {{a|!}} {{s|?}} {{s|+?}} {{b|?}} );
my $regex = join '|', map quotemeta, #patterns;
my #matched = $string =~ /(?:$regex)(\s\w+\s)/g;
print #matched;
output
Abra
Kedabra

How to have a variable as regex in Perl

I think this question is repeated, but searching wasn't helpful for me.
my $pattern = "javascript:window.open\('([^']+)'\);";
$mech->content =~ m/($pattern)/;
print $1;
I want to have an external $pattern in the regular expression. How can I do this? The current one returns:
Use of uninitialized value $1 in print at main.pm line 20.
$1 was empty, so the match did not succeed. I'll make up a constant string in my example of which I know that it will match the pattern.
Declare your regular expression with qr, not as a simple string. Also, you're capturing twice, once in $pattern for the open call's parentheses, once in the m operator for the whole thing, therefore you get two results. Instead of $1, $2 etc. I prefer to assign the results to an array.
my $pattern = qr"javascript:window.open\('([^']+)'\);";
my $content = "javascript:window.open('something');";
my #results = $content =~ m/($pattern)/;
# expression return array
# (
# q{javascript:window.open('something');'},
# 'something'
# )
When I compile that string into a regex, like so:
my $pattern = "javascript:window.open\('([^']+)'\);";
my $regex = qr/$pattern/;
I get just what I think I should get, following regex:
(?-xism:javascript:window.open('([^']+)');)/
Notice that it it is looking for a capture group and not an open paren at the end of 'open'. And in that capture group, the first thing it expects is a single quote. So it will match
javascript:window.open'fum';
but not
javascript:window.open('fum');
One thing you have to learn, is that in Perl, "\(" is the same thing as "(" you're just telling Perl that you want a literal '(' in the string. In order to get lasting escapes, you need to double them.
my $pattern = "javascript:window.open\\('([^']+)'\\);";
my $regex = qr/$pattern/;
Actually preserves the literal ( and yields:
(?-xism:javascript:window.open\('([^']+)'\);)
Which is what I think you want.
As for your question, you should always test the results of a match before using it.
if ( $mech->content =~ m/($pattern)/ ) {
print $1;
}
makes much more sense. And if you want to see it regardless, then it's already implicit in that idea that it might not have a value. i.e., you might not have matched anything. In that case it's best to put alternatives
$mech->content =~ m/($pattern)/;
print $1 || 'UNDEF!';
However, I prefer to grab my captures in the same statement, like so:
my ( $open_arg ) = $mech->content =~ m/($pattern)/;
print $open_arg || 'UNDEF!';
The parens around $open_arg puts the match into a "list context" and returns the captures in a list. Here I'm only expecting one value, so that's all I'm providing for.
Finally, one of the root causes of your problems is that you do not need to specify your expression in a string in order for your regex to be "portable". You can get perl to pre-compile your expression. That way, you only care what instructions the characters are to a regex and not whether or not you'll save your escapes until it is compiled into an expression.
A compiled regex will interpolate itself into other regexes properly. Thus, you get a portable expression that interpolates just as well as a string--and specifically correctly handles instructions that could be lost in a string.
my $pattern = qr/javascript:window.open\('([^']+)'\);/;
Is all that you need. Then you can use it, just as you did. Although, putting parens around the whole thing, would return the whole matched expression (and not just what's between the quotes).
You do not need the parentheses in the match pattern. It will match the whole pattern and return that as $1, which I am guess is not matching, but I am only guessing.
$mech->content =~ m/$pattern/;
or
$mech->content =~ m/(?:$pattern)/;
These are the clustering, non-capturing parentheses.
The way you are doing it is correct.
The solutions have been already given, I'd like to point out that the window.open call might have multiple parameters included in "" and grouped by comma like:
javascript:window.open("http://www.javascript-coder.com","mywindow","status=1,toolbar=1");
There might be spaces between the function name and parentheses, so I'd use a slighty different regex for that:
my $pattern = qr{
javascript:window.open\s*
\(
([^)]+)
\)
}x;
print $1 if $text =~ /$pattern/;
Now you have all parameters in $1 and can process them afterwards with split /,/, $stuff and so on.
It reports an uninitialized value because $1 is undefined. $1 is undefined because you have created a nested matching group by wrapping a second set of parentheses around the pattern. It will also be undefined if nothing matches your pattern.

How can I capture multiple matches from the same Perl regex?

I'm trying to parse a single string and get multiple chunks of data out from the same string with the same regex conditions. I'm parsing a single HTML doc that is static (For an undisclosed reason, I can't use an HTML parser to do the job.) I have an expression that looks like:
$string =~ /\<img\ssrc\="(.*)"/;
and I want to get the value of $1. However, in the one string, there are many img tags like this, so I need something like an array returned (#1?) is this possible?
As Jim's answer, use the /g modifier (in list context or in a loop).
But beware of greediness, you dont want the .* to match more than necessary (and dont escape < = , they are not special).
while($string =~ /<img\s+src="(.*?)"/g ) {
...
}
#list = ($string =~ m/\<img\ssrc\="(.*)"/g);
The g modifier matches all occurences in the string. List context returns all of the matches. See the m// operator in perlop.
You just need the global modifier /g at the end of the match. Then loop through
until there are no matches remaining
my #matches;
while ($string =~ /\<img\ssrc\="(.*)"/g) {
push(#matches, $1);
}
Use the /g modifier and list context on the left, as in
#result = $string =~ /\<img\ssrc\="(.*)"/g;