I'm wondering about using constants in perl regex's. I want to do something similar to:
use constant FOO => "foo"
use constant BAR => "bar"
$somvar =~ s/prefix1_FOO/prefix2_BAR/g;
of course, in there, FOO resolves to the three letters F O O instead of expanding to the constant.
I looked online, and someone was suggesting using either ${\FOO}, or #{[FOO]} Someone else mentioned (?{FOO}). I was wondering if anyone could shed some light on which of these is correct, and if there's any advantage to any of them. Alternatively, is it better to just use a non-constant variable? (performance is a factor in my case).
There's not much in the way of reasons to use a constant over a variable. It doesn't make a great deal of difference - perl will compile a regex anyway.
For example:
#!/usr/bin/perl
use warnings;
use strict;
use Benchmark qw(:all);
use constant FOO => "foo";
use constant BAR => "bar";
my $FOO_VAR = 'foo';
my $BAR_VAR = 'bar';
sub pattern_replace_const {
my $somvar = "prefix1_foo test";
$somvar =~ s/prefix1_${\FOO}/prefix2_${\BAR}/g;
}
sub pattern_replace_var {
my $somvar = "prefix1_foo test";
$somvar =~ s/prefix1_$FOO_VAR/prefix2_$BAR_VAR/g;
}
cmpthese(
1_000_000,
{ 'const' => \&pattern_replace_const,
'var' => \&pattern_replace_var
}
);
Gives:
Rate const var
const 917095/s -- -1%
var 923702/s 1% --
Really not enough in it to worry about.
However it may be worth noting - you can compile a regex with qr// and do it that way, which - provided the RE is static - might improve performance (but it might not, because perl can detect static regexes, and does that itself.
Rate var const compiled
var 910498/s -- -2% -9%
const 933097/s 2% -- -7%
compiled 998502/s 10% 7% --
With code like:
my $compiled_regex = qr/prefix1_$FOO_VAR/;
sub compiled_regex {
my $somvar = "prefix1_foo test";
$somvar =~ s/$compiled_regex/prefix2_$BAR_VAR/g;
}
Honestly though - this is a micro optimisation. The regex engine is fast compared to your code, so don't worry about it. If performance is critical to your code, then the correct way of dealing with it is first write the code, and then profile it to look for hotspots to optimise.
The shown problem is due to those constants being barewords (built at compile time)
Constants defined using this module cannot be interpolated into strings like variables.
In the current implemenation (of constant pragma) they are "inlinable
subroutines" (see † ).
This problem can be solved nicely by using a module like Const::Fast
use Const::Fast;
const my $foo => 'FOO';
const my $bar => 'BAR';
my $var = 'prefix1_FOO_more';
$var =~ s/prefix1_$foo/prefix2_$bar/g;
Now they do get interpolated. Note that more complex replacements may need /e.
These are built at runtime so you can assign results of expressions to them. In particular, you can use the qr operator, for example
const my $patt => qr/$foo/i; # case-insensitive
The qr is the recommended way to build regex patterns. (It interpolates unless the delimiter is '.) The performance gain is most often tiny, but you get a proper regular expression, which can be built and used as such (and in this case a constant as well).
I recommend Const::Fast module over the other one readily, and in fact over all others at this time. See a recent article with a detailed discussion of both. Here is a review of many other options.
I strongly recommend to use a constant (of your chosen sort) for things meant to be read-only. That is good for the health of the code, and of developers who come into contact with it (yourself in the proverbial six months included).
† These being subroutines, we need to be able to run code in order to have them evaluated and replaced by given values. Can't just "interpolate" (evaluate) a variable -- it's not a variable.
A way to run code inside a string (which need be interpolated, so effectively double quoted) is to de-reference, where there's an expression in a block under a reference; then the expression is evaluated. So we need to first make that reference. So either
say "#{ [FOO] }"; # make array reference, then dereference
or
say "${ \FOO }"; # make scalar reference then dereference
prints foo. See the docs for why this works and for examples. Thus one can do the same inside a regex, and both in matching and replacement parts
s/prefix1_${\FOO}/prefix2_${\BAR}/g;
(or with #{[...]}), since they are evaluated as interpolated strings.
Which is "better"? These are tricks. There is rarely, if ever, a need for doing this. It has a very good chance to confuse the reader. So I just wouldn't recommend resorting to these at all.
As for (?{ code }), that is a regex feature, whereby code is evaluated inside a pattern (matching side only). It is complex and tricky and very rarely needed. See about it in perlretut and in perlre.
Discussing speed of these things isn't really relevant. They are certainly outside the realm of clean and idiomatic code, while you'd be hard pressed to detect runtime differences.
But if you must use one of these, I'd much rather interpolate inside a scalar reference via a trick then reach for a complex regex feature.
According to PerlMonk, you better create an already-interpolated string if you are concerned about performance:
use constant PATTERN => 'def';
my $regex = qr/${\(PATTERN)}/; #options such as /m can go here.
if ($string =~ regex) { ... }
Here is the link to the whole discussion.
use constant FOO => "foo";
use constant BAR => "bar";
my $var =~ s/prefix1_${\FOO}/prefix2_${\BAR}/g;
Credit: https://www.perlmonks.org/?node_id=293323
Related
(sorry for the title, but this "feature" really confuses me)
Learning Perl, I learned that the o modifier for a regular expression using variables would be evaluated only once, even if the variable changes after initial evaluation.
Initially that reads like having no issues, this being clearly specified.
Obviously that initial evaluation cannot happen before the variable being used has got it value.
Now qr made life a bit more interesting.
Consider this code (executed in a loop defining other variables, too):
{
my $n = $name;
$n =~ s/[^\w\.-]/_/g;
$n = qr:^${n}\#${another_variable}$:o;
#a = grep { !/$n/ } #a;
}
When using the regex for qr directly, one could argue that the regex is compiled only once, even if the scope with the variable goes out of scope (is going out of scope being considered as a change of the variable?)
But when using qr to build a regex, assigning it to a lexical variable, the compiled regex would go out of scope, so I was expecting that the regex cannot be reused and would be re-built (The basic idea was that the regex inside grep shouldn't be rebuilt for every iteration).
As life is cruel, it seems the whole regex referenced by $n isn't ever rebuilt, so the first value is used until the program stops.
Interestingly in Perl 5.18.2 (the version being used) does not mention the o modifier in perlre(1) any more, and perl 5.26.1 says in the corresponding page:
o - pretend to optimize your code, but actually introduce bugs
So can anybody explain the rules for "once" evaluation (and whether the semantics had changed over the lifespan of Perl)?
Related:
Does the 'o' modifier for Perl regular expressions still provide any benefit?
How do you force Perl to re-compile a regex compiled with "/o" on demand?
Perl has a couple of constructs that don't just store state in variables, but also some states in the opcodes themselves. Aside from /o regex patterns, this also includes the .. flip-flop operator (in scalar context) or state variables.
Perhaps state variables are clearest, since it corresponds to static local variables in many other languages (e.g. C). A state variable is initialized at most once during the lifetime of the program. An expression state $var = initialize() can be understood as
my $var;
if (previously_initialized) {
$var = cached_value;
} else {
$var = initialize();
}
This does not track dependencies in the initialize() expression, but only evaluates it once.
Similarly, it can make sense to consider a regex pattern /.../o as a kind of hidden state variable state $compiled_pattern = qr/.../.
The /o feature was a good idea a very long time ago when regexes were compiled on the fly, similarly to how it works in other languages where regex patterns are provided to a search function as a string.
It hasn't been necessary for performance purposes since a long time, and only has effects when doing variable interpolation. But if you actually want that behaviour, using a state variable would communicate that intent more clearly. Thus, I'd argue that there is no appropriate use for the /o modifier.
The "largely obsolete /o" (perlop) flag still has the "once" meaning and operation. While it is barely mentioned in perlre and in passing, it is addressed in perlop
/PATTERN/msixpodualngc
...
... Perl will not recompile the pattern unless an interpolated variable that it contains changes. You can force Perl to skip the test and never recompile by adding a /o (which stands for "once") after the trailing delimiter. Once upon a time, Perl would recompile regular expressions unnecessarily, and this modifier was useful to tell it not to do so, in the interests of speed. But now, the only reasons to use /o are one of:
The variables are thousands of characters long and you know that they don't change, and you need to wring out the last little bit of speed by having Perl skip testing for that. (There is a maintenance penalty for doing this, as mentioning /o constitutes a promise that you won't change the variables in the pattern. If you do change them, Perl won't even notice.)
you want the pattern to use the initial values of the variables regardless of whether they change or not. (But there are saner ways of accomplishing this than using /o.)
If the pattern contains embedded code, such as
use re 'eval';
$code = 'foo(?{ $x })';
/$code/
then perl will recompile each time, even though the pattern string hasn't changed, to ensure that the current value of $x is seen each time. Use /o if you want to avoid this.
The bottom line is that using /o is almost never a good idea.
So, indeed, apparently it won't even test whether variables to interpolate changed, and this may have a legitimate use. But, indeed, all told it probably shouldn't be used.
An example to demonstrate the "once" operation
perl -Mstrict -wE'
sub tt {
my ($str, $v) = #_;
my $re = qr/$v/o;
$str =~ s/$re/X/;
return $str
};
say tt( q(a1), $_ ) for qw(a b c)'
With the /o, either on the qr-ed pattern or on the regex, this matches (changes) that a1 string every time even though a is passed for the pattern only in the first iteration. Clearly the pattern isn't recompiled since the variable later has b and then c and shouldn't match.
Without /o only the first iteration has the regex matching.
In every iteration the lexical $re clearly goes away, with the whole function, but its originally compiled value keeps getting used. This means that some of the operation of /o is "very" global. Such stubbornly global behavior is not unheard of with some other ancient features (glob comes to mind, for one).
It goes the same way if the sub is a code reference in a lexical variable, remade all the time in a dynamic scope.
perl -Mstrict -wE'
for my $c (qw(a b c)) {
my $tt = sub {
my ($str, $v) = #_;
my $re = qr/$v/o;
$str =~ s/$re/X/;
return $str
};
say $tt->( q(a1), $c )
}'
Prints X1 all three times.
recently I have become friendly with regular expressions and used them to over come a number of tasks very efficiently. As with most perl TIMTOWTDI has clouded my judgement. There are times I can use equality operator or binding operator. However are there times where it is more appropriate to use one over the other?
Firstly the simplified case
my $name = 'Chris';
if ($name eq 'Chris') { print 'What a great name!'; }
if ($name =~/^Chris$/) { print 'Yip sure is a great name; }
So in this case this is the most simplified, where using the equality is less typing, however in this simplified example is there any benefit to one or the other.
In a slightly more complex example
my $name = 'Christopher';
if ($name eq 'Chris' || $name eq 'Christopher') { print 'What a great name!'; }
if ($name =~ /^Chris(?:topher)?$/) { print 'Yip sure is a great name; }
here the binding operator is less typing. However I am not sure of the benefit either may hold over the other.
So is the general rule if you are matching an entire string with a fixed value to use equality operator and if your matching a string with a pattern for example any 5 digits string /\d{5}/ then use binding operator.
Is it inappropriate to use binding operator in the above examples. I appreciate that these examples are just made up and may not reflect a real life problem. However they were the ones i thought of to try to explain my question.
however in this simplified example is there any benefit to one or the other.
Well, they're not equivalent. /^Chris$/ matches Chris and Chris followed by a newline.
If you had used an equivalent pattern (/^Chris\z/), the difference would have been performance. A single string comparison will be faster than a regex match. It's also clearer.
For more complicated comparisons, you generally want to go with what's simpler, clearer, and more maintainable. Address performance (by using profile and running benchmarks) when it becomes an issue.
I would expect slightly (if at all) better performance from the eq operator because the regular expression might require a compilation phase as well as analysis before coming up with its determination.
So in the case:
if ($name eq 'Chris') { print 'What a great name!'; }
if ($name =~/^Chris$/) { print 'Yip sure is a great name; }
... I would expect the first statement to be fastest.
In the second example, however, you have to consider the summed times of the failed cases where you've provided a logical OR:
if ($name eq 'Chris' || $name eq 'Christopher') { print 'What a great name!'; }
if ($name =~ /^Chris(?:topher)?$/) { print 'Yip sure is a great name; }
... here things are less cut and dried. Sure, eq may be faster, but are two eqs faster than a regular expression which doesn't have to backtrack (in this example)? I can't be so sure.
Usually you won't have to consider the performance benefits. So you can't argue one is "better" than the other - I'd usually encourage code clarity in this situation. But it's important to realise that eq is very unforgiving while regular expressions are very flexible - allowing for case-insensitive searches, anchoring to just the beginning, etc. When you do hit some code in which comparison speed is critical then ultimately you'll want to benchmark.
The power of regular expressions is realized in its variability.
When you give a regex engine a template, you "suggest" match outcomes to the engine.
Inernally, its the same C "strncmp()" and such as you would do as in Perl, ie: $str eq "asdf", both are templates.
However, you cannot describe variablilty very well with just a language, thats why regular expression engines exist.
There is an overhead to "eterring" the engine, ie: reset variables, state tracking etc..
But after that, the engine will outperform any combination of language constructs you can
concieve of. Not by a little, but by a huge, huge percentage.
I have multiple texts fields every field is paragraph of text and I want to search for a specifc pattern on those fields using regular expression for example:
my $text1 =~/(my pattern)/ig;
my $text2 =~/(my pattern)/ig;
...
my $textn=~/(my pattern)/ig;
I wonder if there are an effecint way to search multiple text with the same regular expression on perl or I should use the above format?
Use a topicaliser.
for ($text1, $text2, $textn) {
/(my pattern)/ig && do { ... };
}
When you have numbered variables, it's a red flag that you should consider a compound data structure instead. With a simple array it looks nearly the same:
for (#texts) {
my $pattern = qr/((?:i)my pattern)/;
my #matches;
push #matches, $text1 =~ /$pattern/g;
push #matches, $text2 =~ /$pattern/g;
push #matches, $textn =~ /$pattern/g;
That's about as efficient as I can think of - theoretically pre-compiles the regex once, though I'm not sure if interpolating it back into // to get the 'g' modifier undoes any of that compilation. Of course, I also have to wonder if this is really a bottleneck, and if you're just looking at some premature optimisation.
The answer to this question depends on whether your pattern contains any variables. If it does not, perl is already smart enough to only build the RE once, as long as it's identical everywhere.
Now, if you do use variables, then #Tanktalus's answer is close, but adds unnecessary complexity, by compiling the RE an additional time.
Use this:
my #matches;
push #matches, $text1 =~ /((?:i)my pattern with a $variable)/o;
push #matches, $text2 =~ /((?:i)my pattern with a $variable)/o;
push #matches, $textn =~ /((?:i)my pattern with a $variable)/o;
Why?
By using a variable in the RE pattern, perl is forced to re-compile for every instance, even when that variable is a pre-compiled RE as in #Tanktalus's answer. The /o ensures that it is only compiled once, the first time it's encountered, but it still must be compiled once for every occurence int he code. This is because Perl has no way of knowing if $pattern changed between the different uses.
Other considerations
In practice, as #Tanktalus also said, I suspect this is a big fat case of premature optimization. /o/ only matters when your pattern contains variables (otherwise perl is smart enough to only compile once anyway!)
The far more useful reason to use a pre-compiled RE as #Tanktalus has suggested, is to improve code readability. If you have a big hairy RE, then using $pattern everywhere will greatly improve readability, and with only a minor cost in performance (one you're not likely to ever notice).
Conclusion
Just use /o for your REs if they contain variables (unless you actually need the variables to change the RE on every run), and don't worry about it otherwise.
I'm writing a simple Perl script that translates assembly instruction strings to 32-bit binary code.
I decided to handle translation grouping instruction by type (ADD and SUB are R-Type instructions and so on...) so in my code I'm doing something like this:
my $bin = &r_type($instruction) if $instruction =~ /^(?:add|s(?:ub|lt|gt))\s/;
because I want to handle add, sub, slt and sgt in the same way.
I realized however that maybe using that regular expression could be an 'overkill' for the task I'm supposed to do... could the pattern
/^(?:add|sub|slt|sgt)\s/
represent a better use of regular expressions in this case?
Thanks a lot.
Unless you are using a perl older than 5.10, the simple alternation will perform better anyway (see here), so there is no reason to try to optimize it.
Instead of placing the mnemonics buried inside regular expressions, build a dispatch table using a hash. It will be at least equally faster and your code far easier to follow:
my %emitter = (add => \&r_type,
sub => \&r_type,
slt => \&r_type,
sgt => \&r_type,
...);
if ($instruction =~ /^(\S+)/) {
my $emitter = $emitter{$1} // die "bad instruction $instruction";
$emitter->($1, $istruction);
}
else {
# error?...
}
I like salva's dispatch table (I show a lot of that in Mastering Perl), but I'll answer another aspect of the question in case you need that answer for a different problem someday.
When you want to build some alternations, some of which might be nested, you can use something like Regexp::Trie to build the alternation for you so you don't look at the ugly regex syntax:
use Regexp::Trie;
my $rt = Regexp::Trie->new;
foreach ( qw/add sub slt sgt/ ) {
$rt->add($_);
}
print $rt->regexp, "\n";
That gives you:
(?-xism:(?:add|s(?:gt|lt|ub)))
This way, you list the opcodes like Jonathan suggested, but also get the alternation. As ysth noted, you might get this for free with Perl now anyway.
Your second version is simpler, more readable, and more maintainable. The performance difference will depend on the regex implementation, but I suspect the nested version will run slower due to its increased complexity.
Yes it's overkill.
I'm running a regular expression against a large scalar. Though this match isn't capturing anything, my process grows by 30M after this match:
# A
if (${$c} =~ m/\G<<\s*/cgs)
{
#B
...
}
$c is a reference to a pretty big scalar (around 21M), but I've verified that pos(${$c}) is in the right place and the expression matches at the first character, with pos(${$c}) being updated to the correct place after the match. But as I mentioned, the process has grown by about 30M between #A and #B, even though I'm not capturing anything with this match. Where is my memory going?
Edit: Yes, use of $& was to blame. We are using Perl 5.8.8, and my script was using Getopt::Declare, which uses the built-in Text::Balanced. The 1.95 version of this module was using $&. The 2.0.0 version that ships with Perl 5.10 has removed the reference to $& and alleviates the problem.
Just a quick sanity check, are you mentioning $&, $` or $' (sometimes called $MATCH, $PREMATCH and $POSTMATCH) anywhere in your code? If so, Perl will copy your entire string for every regular expression match, just in case you want to inspect those variables.
"In your code" in this case means indirectly, including using modules that reference these variables, or writing use English rather than use English qw( -no_match_vars ).
If you're not sure, you can use the Devel::SawAmpersand module to determine if they have been used, and Devel::FindAmpersand to figure out where they are used.
There may be other reasons for the increase in memory (which version of Perl are you using?), but the match variables will definitely blow your memory if they're used, and hence are a likely culprit.
Cheerio,
Paul