Can Regular Expression Capture Groups Be Modular or Variable? - regex

Can a regular expression capture group be made variable?
Can regex captures be made modular so that they can be reused throughout a more complete regex?
Can you simplify a regex that has many captures of the same format?

There are some tricks you can use to make it modular.
You can make reusable patterns with qr//. This operator compiles a regular expression and returns it such that it can be saved to a variable. Capturing parenthesis are no different than any other regex term.
Those regex variables can be placed into larger regexes as if you had typed them out fully. A variable can also be plain text which will be compiled into the regex when it is run. This allows you to use the same code for several matching tasks. It makes your logic flow simpler and easier to understand and debug.
$string = "AM[740];FM[89.5];SW[200]";
$bracket_contents = qr/\[([^\]]*)\]/;
$thing = "AM";
$am_value = $string =~ m/\b$thing$bracket_contents/ ? $1 : undef;
$thing = "SW";
($sw_value) = $string =~ m/\b$thing$bracket_contents/;

Related

What does =~ m/\#F/ mean in Perl?

open DMLOG, "<dmlog.txt" or &error("Can't open log file: $!");
chomp(#entirelog=<DMLOG>);
close DMLOG;
for $line (#entirelog)
{
if ($line =~ m/\#F/)
{
$titlecolumn = $line;
last;
}
}
I found that =~ is a regular expression I think, but I don't quite understand what its doing here.
It assigns the first line to $titlecolumn that contains an # followed by an F.
The =~ is the bind operator and applies a regex to a string. That regex would usually be written as /#F/. The m prefix can be used to emphasize that the following literal is a prefix (important when other delimiters are used).
Do you understand what regular expressions are? Or, is the =~ throwing you off?
In most programming languages, you would see something like this:
if ( regexp(line, "/#F/") ) {
...
}
However, in Perl, regular expressions are inspired by Awk's syntax. Thus:
if ( $line =~ /#F/ ) {
...
}
The =~ means the regular expression will act upon the variable name on the left. If the pattern #F is found in $line, the if statement is true.
You might want to look at the Regular Expression Tutorial if you're not familiar with them. Regular expressions are extremely powerful and very commonly used in Perl. In fact, they tend to be very used in Perl and is one of the reasons developers in other languages will claim that Perl is a Write Only language.
It is called Binding Operator. It is used to match pattern on RHS with the variable on LHS. Similarly you have got !~ which negates the matching.
For your particular case:
$line =~ m/\#F/
This test whether the $line matches the pattern - /#F/.
Yes, =~ is the binding operator binding an expression to the pattern match m//.
The if statement checks, if a line matches the given regular expression. In this case, it checks, if there is a hash-sign followed by a capital F.
The backslash has just been added (maybe) to avoid treating # as a comment sign (which isn't needed).

Regex position string

For example i have this string.
$string = 'test***bas';
How can I display text before the stars with Regex?
You could use a regular expression which makes use of Capture Groups. Once that you have matched your input, you could then access the captured group and print the output.
The following pattern
^(.+?)\*\*\*
will create a group match using the parenthesis operators. See http://gskinner.com/RegExr/ for testing your regular expressions (there are many ways of testing online)
The language you use around your regular expression will have different ways of capturing groups so you will need to better explain what language you are using for any further advice.
Example for before and after asterix
^(.+?)\*\*\*(.+)$
If tou also want what is located after the ***, you can use the following:
$string = 'test***bas';
$pattern = '/(.+)\*{3}(.+)/';
preg_match($pattern, $string, $matches);
$matches will contain the results:
$matches[1] will be "test"
$matches[2] will be "bas"

Using regex to fetch a value

I have a string:
set a "ODUCTP-1-1-1-2P1"
regexp {.*?\-(.*)} $a match sub
I expect the value of sub to be 1-1-1-2P1
But I'm getting empty string. Can any one tell me how to properly use the regex?
The problem is that the non-greediness of the .*? is leaking over to the .* later on, which is a feature of the RE engine being used (automata-theoretic instead of stack-based).
The simplest fix is to write the regular expression differently.
Because Tcl has unanchored regular expressions (by default) and starts matches as soon as it can, a greedy match from the first - to the end of the string is perfect (with sub being assigned everything after the -). That's a very simple RE: -(.*). To use that, you do this:
regexp -- {-(.*)} $a match sub
Note the --; it's needed here because the regular expression starts with a - symbol and is otherwise confused as weird (and unsupported) option. Apart from that one niggle, it's all entirely straight-forward.
$str = "ODUCTP-1-1-1-2P1";
$str =~ s/^.*?-//;
print $str;
or:
$str =~ /^.*?-(.*)$/;
print $1;

How can I match everything that is after the last occurrence of some char in a perl regular expression?

For example, return the part of the string that is after the last x in axxxghdfx445 (should return 445).
my($substr) = $string =~ /.*x(.*)/;
From perldoc perlre:
By default, a quantified subpattern is "greedy", that is, it will match
as many times as possible (given a particular starting location) while
still allowing the rest of the pattern to match.
That's why .*x will match up to the last occurence of x.
The simplest way would be to use /([^x]*)$/
the first answer is a good one,
but when talking about "something that does not contain"...
i like to use the regex that "matches" it
my ($substr) = $string =~ /.*x([^x]*)$/;
very usefull in some case
the simplest way is not regular expression, but a simple split() and getting the last element.
$string="axxxghdfx445";
#s = split /x/ , $string;
print $s[-1];
Yet another way to do it. It's not as simple as a single regular expression, but if you're optimizing for speed, this approach will probably be faster than anything using regex, including split.
my $s = 'axxxghdfx445';
my $p = rindex $s, 'x';
my $match = $p < 0 ? undef : substr($s, $p + 1);
I'm surprised no one has mentioned the special variable that does this, $': "$'" returns everything after the matched string. (perldoc perlre)
my $str = 'axxxghdfx445';
$str =~ /x/;
# $' contains '445';
print $';
However, there is a cost (emphasis mine):
WARNING: Once Perl sees that you need one of $&, "$", or "$'" anywhere
in the program, it has to provide them for every pattern match. This
may substantially slow your program. Perl uses the same mechanism to
produce $1, $2, etc, so you also pay a price for each pattern that
contains capturing parentheses. (To avoid this cost while retaining
the grouping behaviour, use the extended regular expression "(?: ... )"
instead.) But if you never use $&, "$" or "$'", then patterns without
capturing parentheses will not be penalized. So avoid $&, "$'", and
"$`" if you can, but if you can't (and some algorithms really
appreciate them), once you've used them once, use them at will, because
you've already paid the price. As of 5.005, $& is not so costly as the
other two.
But wait, there's more! You get two operators for the price of one, act NOW!
As a workaround for this problem, Perl 5.10.0 introduces
"${^PREMATCH}", "${^MATCH}" and "${^POSTMATCH}", which are equivalent
to "$`", $& and "$'", except that they are only guaranteed to be
defined after a successful match that was executed with the "/p"
(preserve) modifier. The use of these variables incurs no global
performance penalty, unlike their punctuation char equivalents, however
at the trade-off that you have to tell perl when you want to use them.
my $str = 'axxxghdfx445';
$str =~ /x/p;
# ${^POSTMATCH} contains '445';
print ${^POSTMATCH};
I would humbly submit that this route is the best and most straight-forward
approach in most cases, since it does not require that you do special things
with your pattern construction in order to retrieve the postmatch portion, and there
is no performance penalty.
Regular Expression : /([^x]+)$/ #assuming x is not last element of the string.

How to have a variable as regex in Perl

I think this question is repeated, but searching wasn't helpful for me.
my $pattern = "javascript:window.open\('([^']+)'\);";
$mech->content =~ m/($pattern)/;
print $1;
I want to have an external $pattern in the regular expression. How can I do this? The current one returns:
Use of uninitialized value $1 in print at main.pm line 20.
$1 was empty, so the match did not succeed. I'll make up a constant string in my example of which I know that it will match the pattern.
Declare your regular expression with qr, not as a simple string. Also, you're capturing twice, once in $pattern for the open call's parentheses, once in the m operator for the whole thing, therefore you get two results. Instead of $1, $2 etc. I prefer to assign the results to an array.
my $pattern = qr"javascript:window.open\('([^']+)'\);";
my $content = "javascript:window.open('something');";
my #results = $content =~ m/($pattern)/;
# expression return array
# (
# q{javascript:window.open('something');'},
# 'something'
# )
When I compile that string into a regex, like so:
my $pattern = "javascript:window.open\('([^']+)'\);";
my $regex = qr/$pattern/;
I get just what I think I should get, following regex:
(?-xism:javascript:window.open('([^']+)');)/
Notice that it it is looking for a capture group and not an open paren at the end of 'open'. And in that capture group, the first thing it expects is a single quote. So it will match
javascript:window.open'fum';
but not
javascript:window.open('fum');
One thing you have to learn, is that in Perl, "\(" is the same thing as "(" you're just telling Perl that you want a literal '(' in the string. In order to get lasting escapes, you need to double them.
my $pattern = "javascript:window.open\\('([^']+)'\\);";
my $regex = qr/$pattern/;
Actually preserves the literal ( and yields:
(?-xism:javascript:window.open\('([^']+)'\);)
Which is what I think you want.
As for your question, you should always test the results of a match before using it.
if ( $mech->content =~ m/($pattern)/ ) {
print $1;
}
makes much more sense. And if you want to see it regardless, then it's already implicit in that idea that it might not have a value. i.e., you might not have matched anything. In that case it's best to put alternatives
$mech->content =~ m/($pattern)/;
print $1 || 'UNDEF!';
However, I prefer to grab my captures in the same statement, like so:
my ( $open_arg ) = $mech->content =~ m/($pattern)/;
print $open_arg || 'UNDEF!';
The parens around $open_arg puts the match into a "list context" and returns the captures in a list. Here I'm only expecting one value, so that's all I'm providing for.
Finally, one of the root causes of your problems is that you do not need to specify your expression in a string in order for your regex to be "portable". You can get perl to pre-compile your expression. That way, you only care what instructions the characters are to a regex and not whether or not you'll save your escapes until it is compiled into an expression.
A compiled regex will interpolate itself into other regexes properly. Thus, you get a portable expression that interpolates just as well as a string--and specifically correctly handles instructions that could be lost in a string.
my $pattern = qr/javascript:window.open\('([^']+)'\);/;
Is all that you need. Then you can use it, just as you did. Although, putting parens around the whole thing, would return the whole matched expression (and not just what's between the quotes).
You do not need the parentheses in the match pattern. It will match the whole pattern and return that as $1, which I am guess is not matching, but I am only guessing.
$mech->content =~ m/$pattern/;
or
$mech->content =~ m/(?:$pattern)/;
These are the clustering, non-capturing parentheses.
The way you are doing it is correct.
The solutions have been already given, I'd like to point out that the window.open call might have multiple parameters included in "" and grouped by comma like:
javascript:window.open("http://www.javascript-coder.com","mywindow","status=1,toolbar=1");
There might be spaces between the function name and parentheses, so I'd use a slighty different regex for that:
my $pattern = qr{
javascript:window.open\s*
\(
([^)]+)
\)
}x;
print $1 if $text =~ /$pattern/;
Now you have all parameters in $1 and can process them afterwards with split /,/, $stuff and so on.
It reports an uninitialized value because $1 is undefined. $1 is undefined because you have created a nested matching group by wrapping a second set of parentheses around the pattern. It will also be undefined if nothing matches your pattern.