Embedding evaluations in Perl regex - regex

So i'm writing a quick perl script that cleans up some HTML code and runs it through a html -> pdf program. I want to lose as little information as possible, so I'd like to extend my textareas to fit all the text that is currently in them. This means, in my case, setting the number of rows to a calculated value based on the value of the string inside the textbox.
This is currently the regex i'm using
$file=~s/<textarea rows="(.+?)"(.*?)>(.*?)<\/textarea>/<textarea rows="(?{ length($3)/80 })"$2>$3<\/textarea>/gis;
Unfortunately Perl doesn't seem to be recognizing what I was told was the syntax for embedding Perl code inside search-and-replace regexs
Are there any Perl junkies out there willing to tell me what I'm doing wrong?
Regards,
Zach

The (?{...}) pattern is an experimental feature for executing code on the match side, but you want to execute code on the replacement side. Use the /e regular-expression switch for that:
#! /usr/bin/perl
use warnings;
use strict;
use POSIX qw/ ceil /;
while (<DATA>) {
s[<textarea rows="(.+?)"(.*?)>(.*?)</textarea>] {
my $rows = ceil(length($3) / 80);
qq[<textarea rows="$rows"$2>$3</textarea>];
}egis;
print;
}
__DATA__
<textarea rows="123" bar="baz">howdy</textarea>
Output:
<textarea rows="1" bar="baz">howdy</textarea>

The syntax you are using to embed code is only valid in the "match" portion of the substitution (the left hand side). To embed code in the right hand side (which is a normal Perl double quoted string), you can do this:
$file =~ s{<textarea rows="(.+?)"(.*?)>(.*?)</textarea>}
{<textarea rows="#{[ length($3)/80 ]}"$2>$3</textarea>}gis;
This uses the Perl idiom of "some string #{[ embedded_perl_code() ]} more string".
But if you are working with a very complex statement, it may be easier to put the substitution into "eval" mode, where it treats the replacement string as Perl code:
$file =~ s{<textarea rows="(.+?)"(.*?)>(.*?)</textarea>}
{'<textarea rows="' . (length($3)/80) . qq{"$2>$3</textarea>}}gise;
Note that in both examples the regex is structured as s{}{}. This not only eliminates the need to escape the slashes, but also allows you to spread the expression over multiple lines for readability.

Must this be done with regex? Parsing any markup language (or even CSV) with regex is fraught with error. If you can, try to utilize a standard library:
http://search.cpan.org/dist/HTML-Parser/Parser.pm
Otherwise you risk the revenge of Cthulu:
http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
(Yes, the article leaves room for some simple string-manipulation, so I think your soul is safe, though. :-)

I believe your problem is an unescaped /
If it's not the problem, it certainly is a problem.
Try this instead, note the \/80
$file=~s/<textarea rows="(.+?)"(.*?)>(.*?)<\/textarea>/<textarea rows="(?{ length($3)\/80 })"$2>$3<\/textarea>/gis;
The basic pattern for this code is:
$file =~ s/some_search/some_replace/gis;
The gis are options, which I'd have to look up. I think g = global, i = case insensitive, s = nothing comes to mind right now.

First, you need to quote the / inside the expression in the replacement text (otherwise perl will see a s/// operator followed by the number 80 and so on). Or you can use a different delimiter; for complex substitutions, matching brackets are a good idea.
Then you get to the main problem, which is that (?{...}) is only available in patterns. The replacement text is not a pattern, it's (almost) an ordinary string.
Instead, there is the e modifier to the s/// operator, which lets you write a replacement expression rather than replacement string.
$file =~ s(<textarea rows="(.+?)"(.*?)>(.*?)</textarea>)
("<textarea rows=\"" . (length($3)/80) . "\"$2>$3</textarea>")egis;

As per http://perldoc.perl.org/perlrequick.html#Search-and-replace, this can be accomplished with the "evaluation modifier s///e", e.g., you gis must have an extra e in it.
The evaluation modifier s///e wraps an eval{...} around the replacement string and the evaluated result is substituted for the matched substring. Some examples:
# convert percentage to decimal
$x = "A 39% hit rate";
$x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate"

Related

Raku Regex to capture and modify the LFM code blocks

Update: Corrected code added below
I have a Leanpub flavored markdown* file named sample.md I'd like to convert its code blocks into Github flavored markdown style using Raku Regex
Here's a sample **ruby** code, which
prints the elements of an array:
{:lang="ruby"}
['Ian','Rich','Jon'].each {|x| puts x}
Here's a sample **shell** code, which
removes the ending commas and
finds all folders in the current path:
{:lang="shell"}
sed s/,$//g
find . -type d
In order to capture the lang value, e.g. ruby from the {:lang="ruby"} and convert it into
```ruby
I use this code
my #in="sample.md".IO.lines;
my #out;
for #in.kv -> $key,$val {
if $val.starts-with("\{:lang") {
if $val ~~ /^{:lang="([a-z]+)"}$/ { # capture lang
#out[$key]="```$0"; # convert it into ```ruby
$key++;
while #in[$key].starts-with(" ") {
#out[$key]=#in[$key].trim-leading;
$key++;
}
#out[$key]="```";
}
}
#out[$key]=$val;
}
The line containing the Regex gives
Cannot modify an immutable Pair (lang => True) error.
I've just started out using Regexes. Instead of ([a-z]+) I've tried (\w) and it gave the Unrecognized backslash sequence: '\w' error, among other things.
How to correctly capture and modify the lang value using Regex?
the LFM format just estimated
Corrected code:
my #in="sample.md".IO.lines;
my \len=#in.elems;
my #out;
my $k = 0;
while ($k < len) {
if #in[$k] ~~ / ^ '{:lang="' (\w+) '"}' $ / {
push #out, "```$0";
$k++;
while #in[$k].starts-with(" ") {
push #out, #in[$k].trim-leading;
$k++; }
push #out, "```";
}
push #out, #in[$k];
$k++;
}
for #out {print "$_\n"}
TL;DR
TL? Then read #jjemerelo's excellent answer which not only provides a one-line solution but much more in a compact form ;
DR? Aw, imo you're missing some good stuff in this answer that JJ (reasonably!) ignores. Though, again, JJ's is the bomb. Go read it first. :)
Using a Perl regex
There are many dialects of regex. The regex pattern you've used is a Perl regex but you haven't told Raku that. So it's interpreting your regex as a Raku regex, not a Perl regex. It's like feeding Python code to perl. So the error message is useless.
One option is to switch to Perl regex handling. To do that, this code:
/^{:lang="([a-z]+)"}$/
needs m :P5 at the start:
m :P5 /^{:lang="([a-z]+)"}$/
The m is implicit when you use /.../ in a context where it is presumed you mean to immediately match, but because the :P5 "adverb" is being added to modify how Raku interprets the pattern in the regex, one has to also add the m.
:P5 only supports a limited set of Perl's regex patterns. That said, it should be enough for the regex you've written in your question.
Using a Raku regex
If you want to use a Raku regex you have to learn the Raku regex language.
The "spirit" of the Raku regex language is the same as Perl's, and some of the absolute basic syntax is the same as Perl's, but it's different enough that you should view it as yet another dialect of regex, just one that's generally "powered up" relative to Perl's regexes.
To rewrite the regex in Raku format I think it would be:
/ ^ '{:lang="' (<[a..z]>+) '"}' $ /
(Taking advantage of the fact whitespace in Raku regexes is ignored.)
Other problems in your code
After fixing the regex, one encounters other problems in your code.
The first problem I encountered is that $key is read-only, so $key++ fails. One option is to make it writable, by writing -> $key is copy ..., which makes $key a read-write copy of the index passed by the .kv.
But fixing that leads to another problem. And the code is so complex I've concluded I'd best not chase things further. I've addressed your immediate obstacle and hope that helps.
This one-liner seems to solve the problem:
say S:g /\{\: "lang" \= \" (\w+) \" \} /```$0/ given "text.md".IO.slurp;
Let's try and explain what was going on, however. The error was a regular expression grammar error, caused by having a : being followed by a name, and all that inside a curly. {} runs code inside a regex. Raiph's answer is (obviously) correct, by changing it to a Perl regular expression. But what I've done here is to change it to a Raku's non-destructive substitution, with the :g global flag, to make it act on the whole file (slurped at the end of the line; I've saved it to a file called text.md). So what this does is to slurp your target file, with given it's saved in the $_ topic variable, and printed once the substitution has been made. Good thing is if you want to make more substitutions you can shove another such expression to the front, and it will act on the output.
Using this kind of expression is always going to be conceptually simpler, and possibly faster, than dealing with a text line by line.

How to use Regular expression in perl if both regular expression and strings are variables

I have two variables coming from some user inputs. One is a string that needs to be checked and other one is a regular expression as below.
Following code doesn't work.
my $pattern = "/^current.*$/";
my $name = "currentStateVector";
if($name =~ $pattern) {
print "matches \n";
} else {
print "doesn't match \n";
}
And following does.
if($name =~ /^current.*$/) {
print "matches \n";
} else {
print "doesn't match \n";
}
What's the reason for this. I've the regular expression stored in a variable. Is there another way to store this variable or modify it?
The double-quotes that you use interpolate -- they first evaluate what's inside them (variables, escapes, etc) and return a string built with evaluations' results and remaining literals. See Gory details of parsing quoting constructs for an illuminating discussion, with lots of detail.
And your example string happens to have a $/ there, which is one of Perl's global variables (see perlvar) so $pattern is different than expected; print it to see. (In this case the / is erroneous as discussed below but the point stands.)
Instead, either use single quotes to avoid interpretation of characters like $ and \ (etc) so that they are used in regex as such
my $pattern = q(^current.*$);
or, better, use the regex-specific qr operator
my $pattern = qr/^current.*$/;
which builds from its string a proper regex pattern (a special type of Perl value), and allows use of modifiers. In this case you need to escape characters that have a special meaning in regex if you want them to be treated as literals.
Note that there's no need for // for the regex, and they wouldn't be a part of the pattern anyway -- having them around the actual pattern is wrong.
Also, carefully consider all circumstances under which user input may end up being used.
It is brought up in a comment that users may submit a "pattern" with extra /'s. That'd be wrong, as mentioned above; only the pattern itself should be given (surrounded on the command-line by ', so that the shell doesn't interpret particular characters in it). More detail follows.
The /'s are clearly not meant as a part of the pattern, but are rather intended to come with the match operator, to delimit (quote) the regex pattern itself (in the larger expression) so that one can use string literals in the pattern. Or they are used for clarity, and/or to be able to specify global modifiers (even though those can be specified inside patterns as well).
But then if users still type them around the pattern the regex will use those characters as a part of the pattern and will try to match a leading /, etc; it will fail, quietly. Make sure that users know that they need to give a pattern alone, with no delimiters.
If this is likely to be a problem I'd check for delimiters and if found carry on with a "loud" (clear) warning. What makes this tricky is the fact that a pattern starting and ending with a slash is legitimate -- it is possible, if somewhat unlikely, that a user may want actual /'s in their pattern. So you can only ask, or raise a warning, not abort.
Note that with a pattern given in a variable, or with an expression yielding a pattern at runtime, the explicit match operator and delimiters aren't needed for matching; the variable or the expression's return is taken as a search pattern and used for matching. See The basics (perlre) and Binding Operators (perlop).
So you can do simply $name =~ $pattern. Of course $name =~ /$pattern/ is fine as well, where you can then give global modifiers after the closing /
The slashes are part of the matching operator m//, not part of the regex.
When I populate the regex from user input
my $pattern = shift;
and run the script as
58663971.pl '^current.*$'
it matches.

Perl: how to use string variables as search pattern and replacement in regex

I want to use string variables for both search pattern and replacement in regex. The expected output is like this,
$ perl -e '$a="abcdeabCde"; $a=~s/b(.)d/_$1$1_/g; print "$a\n"'
a_cc_ea_CC_e
But when I moved the pattern and replacement to a variable, $1 was not evaluated.
$ perl -e '$a="abcdeabCde"; $p="b(.)d"; $r="_\$1\$1_"; $a=~s/$p/$r/g; print "$a\n"'
a_$1$1_ea_$1$1_e
When I use "ee" modifier, it gives errors.
$ perl -e '$a="abcdeabCde"; $p="b(.)d"; $r="_\$1\$1_"; $a=~s/$p/$r/gee; print "$a\n"'
Scalar found where operator expected at (eval 1) line 1, near "$1$1"
(Missing operator before $1?)
Bareword found where operator expected at (eval 1) line 1, near "$1_"
(Missing operator before _?)
Scalar found where operator expected at (eval 2) line 1, near "$1$1"
(Missing operator before $1?)
Bareword found where operator expected at (eval 2) line 1, near "$1_"
(Missing operator before _?)
aeae
What do I miss here?
Edit
Both $p and $r are written by myself. What I need is to do multiple similar regex replacing without touching the perl code, so $p and $r have to be in a separate data file. I hope this file can be used with C++/python code later.
Here are some examples of $p and $r.
^(.*\D)?((19|18|20)\d\d)年 $1$2<digits>年
^(.*\D)?(0\d)年 $1$2<digits>年
([TKZGD])(\d+)/(\d+)([^\d/]) $1$2<digits>$3<digits>$4
([^/TKZGD\d])(\d+)/(\d+)([^/\d]) $1$3分之$2$4
With $p="b(.)d"; you are getting a string with literal characters b(.)d. In general, regex patterns are not preserved in quoted strings and may not have their expected meaning in a regex. However, see Note at the end.
This is what qr operator is for: $p = qr/b(.)d/; forms the string as a regular expression.
As for the replacement part and /ee, the problem is that $r is first evaluated, to yield _$1$1_, which is then evaluated as code. Alas, that is not valid Perl code. The _ are barewords and even $1$1 itself isn't valid (for example, $1 . $1 would be).
The provided examples of $r have $Ns mixed with text in various ways. One way to parse this is to extract all $N and all else into a list that maintains their order from the string. Then, that can be processed into a string that will be valid code. For example, we need
'$1_$2$3other' --> $1 . '_' . $2 . $3 . 'other'
which is valid Perl code that can be evaluated.
The part of breaking this up is helped by split's capturing in the separator pattern.
sub repl {
my ($r) = #_;
my #terms = grep { $_ } split /(\$\d)/, $r;
return join '.', map { /^\$/ ? $_ : q(') . $_ . q(') } #terms;
}
$var =~ s/$p/repl($r)/gee;
With capturing /(...)/ in split's pattern, the separators are returned as a part of the list. Thus this extracts from $r an array of terms which are either $N or other, in their original order and with everything (other than trailing whitespace) kept. This includes possible (leading) empty strings so those need be filtered out.
Then every term other than $Ns is wrapped in '', so when they are all joined by . we get a valid Perl expression, as in the example above.
Then /ee will have this function return the string (such as above), and evaluate it as valid code.
We are told that safety of using /ee on external input is not a concern here. Still, this is something to keep in mind. See this post, provided by Håkon Hægland in a comment. Along with the discussion it also directs us to String::Substitution. Its use is demonstrated in this post. Another way to approach this is with replace from Data::Munge
For more discussion of /ee see this post, with several useful answers.
Note on using "b(.)d" for a regex pattern
In this case, with parens and dot, their special meaning is maintained. Thanks to kangshiyin for an early mention of this, and to Håkon Hægland for asserting it. However, this is a special case. Double-quoted strings directly deny many patterns since interpolation is done -- for example, "\w" is just an escaped w (what is unrecognized). The single quotes should work, as there is no interpolation. Still, strings intended for use as regex patterns are best formed using qr, as we are getting a true regex. Then all modifiers may be used as well.

How can I exclude the part of the string that matches a Perl regular expression?

I have to file that has different types of lines. I want to select only those lines that have an user-agent. I know that the line that has this is something like this.
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; de-DE; rv:1.8.1.16) Gecko/20080702 Firefox/2.0.0.16
So, I want to identify the line that starts with the string "User-Agent", but after that I want to process the rest of the line excluding this string. My question is does Perl store the remaining string in any special variable that I can use to process further? So, basically I want to match the line that starts with that string but after that work on the rest of it excluding that string.
I search for that line with a simple regexp
/^User-Agent:/
The substr solution:
my $start = "User-Agent: ";
if ($start eq substr $line, 0, length($start)) {
my $remainder = substr $line, length($start);
}
if ($line =~ /^User\-Agent\: (.*?)$/) {
&process_string($1)
}
(my $remainder = $str) =~ s/^User-Agent: //;
You could use the $' variable, but don't--that adds a lot of overhead. Probably just about as good--for the same purposes--is #+ variable or, in English, #LAST_MATCH_END.
So this will get you there:
use English qw<#LAST_MATCH_END>;
my $value = substr( $line, $LAST_MATCH_END[0] );
Perl 5.10 has a nice feature that allows you to get the simplicity of the $' solutions without the performance problems. You use the /p flag and the ${^POSTMATCH} variable:
use 5.010;
if( $string =~ m/^User-Agent:\s+/ip ) {
my $agent = ${^POSTMATCH};
say $agent;
}
There are some other tricks though. If you can't use Perl 5.010 or later, you use a global match in scalar context, the value of pos is where you left off in the string. You can use that position in substr:
if( $string =~ m/^User-Agent:\s+/ig ) {
my $agent = substr $string, pos( $string );
print $agent, "\n";
}
The pos is similar to the #+ trick that Axeman shows. I think I have some examples with #+ and #- in Mastering Perl in the first chapter.
With Perl 5.14, which is coming soon, there's another interesting way to do this. The /r flag on the s/// does a non-destructive substitution. That is, it matches the bound string but performs the substitution on a copy and returns the copy:
use 5.013; # for now, but 5.014 when it's released
my $string = 'User-Agent: Firefox';
my $agent = $string =~ s/^User-Agent:\s+//r;
say $agent;
I thought that /r was silly at first, but I'm really starting to love it. So many things turn out to be really easy with it. This is similar to the idiom that M42 shows, but it's a bit tricky because the old idiom does an assignment then a substitution, where the /r feature does a substitution then an assignment. You have to be careful with your parentheses there to ensure the right order happens.
Note in this case that since the version is Perl 5.12 or later, you automatically get strictures.
You can use $' to capture the post-match part of the string:
if ( $line =~ m/^User-Agent: / ) {
warn $';
}
(Note that there's a trailing space after the colon there.)
But note, from perlre:
WARNING: Once Perl sees that you need
one of $& , $`, or $' anywhere in
the program, it has to provide them
for every pattern match. This may
substantially slow your program. Perl
uses the same mechanism to produce $1,
$2, etc, so you also pay a price for
each pattern that contains capturing
parentheses. (To avoid this cost while
retaining the grouping behaviour, use
the extended regular expression (?:
... ) instead.) But if you never use
$& , $` or $' , then patterns without
capturing parentheses will not be
penalized. So avoid $& , $' , and $`
if you can, but if you can't (and some
algorithms really appreciate them),
once you've used them once, use them
at will, because you've already paid
the price. As of 5.005, $& is not so
costly as the other two.
Use $' to get the part of the string to the right of the match.
There is much wailing and gnashing of teeth in the other answers about the "considerable performance penalty" but unless you actually know that your program is rich in use of regular expressions, and that you have a performance problem, I wouldn't worry about it.
We worry too often about optimizations that have little-to-no impact on the actual code. Chances are, this is one of them, too.

How can I match everything that is after the last occurrence of some char in a perl regular expression?

For example, return the part of the string that is after the last x in axxxghdfx445 (should return 445).
my($substr) = $string =~ /.*x(.*)/;
From perldoc perlre:
By default, a quantified subpattern is "greedy", that is, it will match
as many times as possible (given a particular starting location) while
still allowing the rest of the pattern to match.
That's why .*x will match up to the last occurence of x.
The simplest way would be to use /([^x]*)$/
the first answer is a good one,
but when talking about "something that does not contain"...
i like to use the regex that "matches" it
my ($substr) = $string =~ /.*x([^x]*)$/;
very usefull in some case
the simplest way is not regular expression, but a simple split() and getting the last element.
$string="axxxghdfx445";
#s = split /x/ , $string;
print $s[-1];
Yet another way to do it. It's not as simple as a single regular expression, but if you're optimizing for speed, this approach will probably be faster than anything using regex, including split.
my $s = 'axxxghdfx445';
my $p = rindex $s, 'x';
my $match = $p < 0 ? undef : substr($s, $p + 1);
I'm surprised no one has mentioned the special variable that does this, $': "$'" returns everything after the matched string. (perldoc perlre)
my $str = 'axxxghdfx445';
$str =~ /x/;
# $' contains '445';
print $';
However, there is a cost (emphasis mine):
WARNING: Once Perl sees that you need one of $&, "$", or "$'" anywhere
in the program, it has to provide them for every pattern match. This
may substantially slow your program. Perl uses the same mechanism to
produce $1, $2, etc, so you also pay a price for each pattern that
contains capturing parentheses. (To avoid this cost while retaining
the grouping behaviour, use the extended regular expression "(?: ... )"
instead.) But if you never use $&, "$" or "$'", then patterns without
capturing parentheses will not be penalized. So avoid $&, "$'", and
"$`" if you can, but if you can't (and some algorithms really
appreciate them), once you've used them once, use them at will, because
you've already paid the price. As of 5.005, $& is not so costly as the
other two.
But wait, there's more! You get two operators for the price of one, act NOW!
As a workaround for this problem, Perl 5.10.0 introduces
"${^PREMATCH}", "${^MATCH}" and "${^POSTMATCH}", which are equivalent
to "$`", $& and "$'", except that they are only guaranteed to be
defined after a successful match that was executed with the "/p"
(preserve) modifier. The use of these variables incurs no global
performance penalty, unlike their punctuation char equivalents, however
at the trade-off that you have to tell perl when you want to use them.
my $str = 'axxxghdfx445';
$str =~ /x/p;
# ${^POSTMATCH} contains '445';
print ${^POSTMATCH};
I would humbly submit that this route is the best and most straight-forward
approach in most cases, since it does not require that you do special things
with your pattern construction in order to retrieve the postmatch portion, and there
is no performance penalty.
Regular Expression : /([^x]+)$/ #assuming x is not last element of the string.