Why is my Perl regex using so much memory?

Why is my Perl regex using so much memory? - regex

I'm running a regular expression against a large scalar. Though this match isn't capturing anything, my process grows by 30M after this match:
# A
if (${$c} =~ m/\G<<\s*/cgs)
{
#B
...
}
$c is a reference to a pretty big scalar (around 21M), but I've verified that pos(${$c}) is in the right place and the expression matches at the first character, with pos(${$c}) being updated to the correct place after the match. But as I mentioned, the process has grown by about 30M between #A and #B, even though I'm not capturing anything with this match. Where is my memory going?
Edit: Yes, use of $& was to blame. We are using Perl 5.8.8, and my script was using Getopt::Declare, which uses the built-in Text::Balanced. The 1.95 version of this module was using $&. The 2.0.0 version that ships with Perl 5.10 has removed the reference to $& and alleviates the problem.

Just a quick sanity check, are you mentioning $&, $` or $' (sometimes called $MATCH, $PREMATCH and $POSTMATCH) anywhere in your code? If so, Perl will copy your entire string for every regular expression match, just in case you want to inspect those variables.
"In your code" in this case means indirectly, including using modules that reference these variables, or writing use English rather than use English qw( -no_match_vars ).
If you're not sure, you can use the Devel::SawAmpersand module to determine if they have been used, and Devel::FindAmpersand to figure out where they are used.
There may be other reasons for the increase in memory (which version of Perl are you using?), but the match variables will definitely blow your memory if they're used, and hence are a likely culprit.
Cheerio,
Paul

Related

What does `/regex/o` really mean (once there was once, but it seems gone now)?

(sorry for the title, but this "feature" really confuses me)
Learning Perl, I learned that the o modifier for a regular expression using variables would be evaluated only once, even if the variable changes after initial evaluation.
Initially that reads like having no issues, this being clearly specified.
Obviously that initial evaluation cannot happen before the variable being used has got it value.
Now qr made life a bit more interesting.
Consider this code (executed in a loop defining other variables, too):
{
my $n = $name;
$n =~ s/[^\w\.-]/_/g;
$n = qr:^${n}\#${another_variable}$:o;
#a = grep { !/$n/ } #a;
}
When using the regex for qr directly, one could argue that the regex is compiled only once, even if the scope with the variable goes out of scope (is going out of scope being considered as a change of the variable?)
But when using qr to build a regex, assigning it to a lexical variable, the compiled regex would go out of scope, so I was expecting that the regex cannot be reused and would be re-built (The basic idea was that the regex inside grep shouldn't be rebuilt for every iteration).
As life is cruel, it seems the whole regex referenced by $n isn't ever rebuilt, so the first value is used until the program stops.
Interestingly in Perl 5.18.2 (the version being used) does not mention the o modifier in perlre(1) any more, and perl 5.26.1 says in the corresponding page:
o - pretend to optimize your code, but actually introduce bugs
So can anybody explain the rules for "once" evaluation (and whether the semantics had changed over the lifespan of Perl)?
Related:
Does the 'o' modifier for Perl regular expressions still provide any benefit?
How do you force Perl to re-compile a regex compiled with "/o" on demand?

Perl has a couple of constructs that don't just store state in variables, but also some states in the opcodes themselves. Aside from /o regex patterns, this also includes the .. flip-flop operator (in scalar context) or state variables.
Perhaps state variables are clearest, since it corresponds to static local variables in many other languages (e.g. C). A state variable is initialized at most once during the lifetime of the program. An expression state $var = initialize() can be understood as
my $var;
if (previously_initialized) {
$var = cached_value;
} else {
$var = initialize();
}
This does not track dependencies in the initialize() expression, but only evaluates it once.
Similarly, it can make sense to consider a regex pattern /.../o as a kind of hidden state variable state $compiled_pattern = qr/.../.
The /o feature was a good idea a very long time ago when regexes were compiled on the fly, similarly to how it works in other languages where regex patterns are provided to a search function as a string.
It hasn't been necessary for performance purposes since a long time, and only has effects when doing variable interpolation. But if you actually want that behaviour, using a state variable would communicate that intent more clearly. Thus, I'd argue that there is no appropriate use for the /o modifier.

The "largely obsolete /o" (perlop) flag still has the "once" meaning and operation. While it is barely mentioned in perlre and in passing, it is addressed in perlop
/PATTERN/msixpodualngc
...
... Perl will not recompile the pattern unless an interpolated variable that it contains changes. You can force Perl to skip the test and never recompile by adding a /o (which stands for "once") after the trailing delimiter. Once upon a time, Perl would recompile regular expressions unnecessarily, and this modifier was useful to tell it not to do so, in the interests of speed. But now, the only reasons to use /o are one of:
The variables are thousands of characters long and you know that they don't change, and you need to wring out the last little bit of speed by having Perl skip testing for that. (There is a maintenance penalty for doing this, as mentioning /o constitutes a promise that you won't change the variables in the pattern. If you do change them, Perl won't even notice.)
you want the pattern to use the initial values of the variables regardless of whether they change or not. (But there are saner ways of accomplishing this than using /o.)
If the pattern contains embedded code, such as
use re 'eval';
$code = 'foo(?{ $x })';
/$code/
then perl will recompile each time, even though the pattern string hasn't changed, to ensure that the current value of $x is seen each time. Use /o if you want to avoid this.
The bottom line is that using /o is almost never a good idea.
So, indeed, apparently it won't even test whether variables to interpolate changed, and this may have a legitimate use. But, indeed, all told it probably shouldn't be used.
An example to demonstrate the "once" operation
perl -Mstrict -wE'
sub tt {
my ($str, $v) = #_;
my $re = qr/$v/o;
$str =~ s/$re/X/;
return $str
};
say tt( q(a1), $_ ) for qw(a b c)'
With the /o, either on the qr-ed pattern or on the regex, this matches (changes) that a1 string every time even though a is passed for the pattern only in the first iteration. Clearly the pattern isn't recompiled since the variable later has b and then c and shouldn't match.
Without /o only the first iteration has the regex matching.
In every iteration the lexical $re clearly goes away, with the whole function, but its originally compiled value keeps getting used. This means that some of the operation of /o is "very" global. Such stubbornly global behavior is not unheard of with some other ancient features (glob comes to mind, for one).
It goes the same way if the sub is a code reference in a lexical variable, remade all the time in a dynamic scope.
perl -Mstrict -wE'
for my $c (qw(a b c)) {
my $tt = sub {
my ($str, $v) = #_;
my $re = qr/$v/o;
$str =~ s/$re/X/;
return $str
};
say $tt->( q(a1), $c )
}'
Prints X1 all three times.

Using a backreference as key in a hashmap within a regex-substitution?

I am learning the Perl language and I stumbled upon the following question:
Is it possible to use a backreference as a key in a substitution argument, e.g. something like:
$hm{"Cat"} = "Dog";
while(<>){
s/Cat/$hm{\1}/
print;
}
That is, I want to tell Perl to look up a key which is contained in a capture argument.
I know that this is a silly example. But I am just curious on the question as to whether it is possible to use such a key-lookup with a backreference in a substitution.

Use $1 instead.
While backrefs like \1 work in the substition part of a regex, it only works in string context. The $hm{KEY} is accesses an item in a hash. The KEY part can be a bareword or an expression. In an expression, \1 would be a “reference to a literal scalar with value 1” which would stringify as SCALAR(0x55776153ecb0), not a back-reference as in a string. Instead, we can access the value of captures in the regex with variables like $1.
But that requires us to capture a part of the regex. I would write it as:
s/(Cat)/$hm{$1}/;
As a rule of thumb, only use backrefs like \1 within a regex pattern. Everywhere else use capture variables like $1. If you use warnings, Perl will also tell you that \1 better written as $1, though it wouldn't have detected the particular issue in your case as the \1 was still valid syntax, albeit with different meaning.

If you are looking at really old code, you'll see people using the \1 form on the replacement side of the substitution. Sometimes you'll see it in really new code; it's a Perl 4 thing that still works, but Perl 5 added a warning. If you have warnings turned on, perl will tell you that (although I don't know when this warning started:
$ perl5.36.0 -wpe 's/cat(dog)/\1/'
\1 better written as $1 at -e line 1.
With diagnostics you get even more information about the warning:
$ perl5.36.0 -Mdiagnostics -wpe 's/cat(dog)/\1/'
\1 better written as $1 at -e line 1 (#1)
(W syntax) Outside of patterns, backreferences live on as variables.
The use of backslashes is grandfathered on the right-hand side of a
substitution, but stylistically it's better to use the variable form
because other Perl programmers will expect it, and it works better if
there are more than 9 backreferences.
There are many other warnings that Perl uses to show you better ways to do things.
A level above that is perlcritic, which is an opinionated set of policies about what some people find to be good style. It's not a terrible place to start before you develop your own ideas about what works for you or your team.

It has been explained that one wants to use a capture for that, not a backreference, like
perl -wE'$_=shift//q(hal); %h = (a => 7); s/(a)/$h{$1}/; say'
What I'd like to add is a note about what if there isn't in fact a key for that capture?
We often capture complex patterns, not simple literals, and it can happen that our anticipated keys don't cover every case that may come up. A way to check for that involves the modifier /e, which makes the replacement part be evaluated as code, and its return is then substituted into the string
perl -wE'$_=shift//q(hal); %h = (a => 7); s{(h)}{$h{$1} // "default"}e; say'
Now if the pattern is matched and captured but isn't a key (h) then the string default is substituted. An often sensible choice for default is the capture itself (if not a key put it back).
The replaccement part must now be syntactically correct code, so no bare literals, etc.

Perl tainting via regular expression

Short version
In the code below, $1 is tainted and I don't understand why.
Long version
I'm running Foswiki on a system with perl v5.14.2 with -T taint check mode enabled.
Debugging a problem with that setup, I managed to construct the following SSCCE. (Note that I edited this post, the first version was longer and more complicated, and comments still refer to that.)
#!/usr/bin/perl -T
use strict;
use warnings;
use locale;
use Scalar::Util qw(tainted);
my $var = "foo.bar_baz";
$var =~ m/^(.*)[._](.*?)$/;
print(tainted($1) ? "tainted\n" : "untainted\n");
Although the input string $var is untainted and the regular expression is fixed, the resulting capture group $1 is tainted. Which I find really strange.
The perlsec manual has this to say about taint and regular expressions:
Values may be untainted by using them as keys in a hash; otherwise the
only way to bypass the tainting mechanism is by referencing
subpatterns from a regular expression match. Perl presumes that if
you reference a substring using $1, $2, etc., that you knew what you
were doing when you wrote the pattern.
I would imagine that even if the input were tainted, the output would still be untainted. To observe the reverse, tainted output from untainted input, feels like a strange bug in perl. But if one reads more of perlsec, it also points users at the SECURITY section of perllocale. There we read:
when use locale is in effect, Perl uses the tainting mechanism (see
perlsec) to mark string results that become locale-dependent, and
which may be untrustworthy in consequence. Here is a summary of the
tainting behavior of operators and functions that may be affected by
the locale:
Comparison operators (lt, le , ge, gt and cmp) […]
Case-mapping interpolation (with \l, \L, \u or \U) […]
Matching operator (m//):
Scalar true/false result never tainted.
Subpatterns, either delivered as a list-context result or as $1
etc. are tainted if use locale (but not use locale
':not_characters') is in effect, and the subpattern regular
expression contains \w (to match an alphanumeric character), \W
(non-alphanumeric character), \s (whitespace character), or \S
(non whitespace character). The matched-pattern variable, $&, $`
(pre-match), $' (post-match), and $+ (last match) are also
tainted if use locale is in effect and the regular expression contains
\w, \W, \s, or \S.
Substitution operator (s///) […]
[⋮]
This looks like it should be an exhaustive list. And I don't see how it could apply: My regex is not using any of \w, \W, \s or \S, so it should not depend on locale.
Can someone explain why this code taints the varibale $1?

There currently is a discrepancy between the documentation as quoted in the question, and the actual implementation as of perl 5.18.1. The problem are character classes. The documentation mentions \w,\s,\W,\S in what sounds like an exhaustive list, while the implementation taints on pretty much every use of […].
The right solution would probably be somewhere in between: character classes like [[:word:]] should taint, since it depends on locale. My fixed list should not. Character ranges like [a-z] depend on collation, so in my personal opinion they should taint as well. \d depends on what a locale considers a digit, so it, too, should taint even if it is neither one of the escape sequences mentioned so far nor a bracketed class.
So in my opinion, both the documentation and the implementation need fixing. Perl devs are working on this. For progress information, please look at the perl bug report I filed.
For a fixed list of characters, one viable workaround appears to be a formulation as a disjunction, i.e. (?:\.|_) instead of [._]. It is more verbose, but should work even with the current (in my opinion buggy) perl versions.

efficient way to search the same regular expression on multiple text

I have multiple texts fields every field is paragraph of text and I want to search for a specifc pattern on those fields using regular expression for example:
my $text1 =~/(my pattern)/ig;
my $text2 =~/(my pattern)/ig;
...
my $textn=~/(my pattern)/ig;
I wonder if there are an effecint way to search multiple text with the same regular expression on perl or I should use the above format?

Use a topicaliser.
for ($text1, $text2, $textn) {
/(my pattern)/ig && do { ... };
}
When you have numbered variables, it's a red flag that you should consider a compound data structure instead. With a simple array it looks nearly the same:
for (#texts) {

my $pattern = qr/((?:i)my pattern)/;
my #matches;
push #matches, $text1 =~ /$pattern/g;
push #matches, $text2 =~ /$pattern/g;
push #matches, $textn =~ /$pattern/g;
That's about as efficient as I can think of - theoretically pre-compiles the regex once, though I'm not sure if interpolating it back into // to get the 'g' modifier undoes any of that compilation. Of course, I also have to wonder if this is really a bottleneck, and if you're just looking at some premature optimisation.

The answer to this question depends on whether your pattern contains any variables. If it does not, perl is already smart enough to only build the RE once, as long as it's identical everywhere.
Now, if you do use variables, then #Tanktalus's answer is close, but adds unnecessary complexity, by compiling the RE an additional time.
Use this:
my #matches;
push #matches, $text1 =~ /((?:i)my pattern with a $variable)/o;
push #matches, $text2 =~ /((?:i)my pattern with a $variable)/o;
push #matches, $textn =~ /((?:i)my pattern with a $variable)/o;
Why?
By using a variable in the RE pattern, perl is forced to re-compile for every instance, even when that variable is a pre-compiled RE as in #Tanktalus's answer. The /o ensures that it is only compiled once, the first time it's encountered, but it still must be compiled once for every occurence int he code. This is because Perl has no way of knowing if $pattern changed between the different uses.
Other considerations
In practice, as #Tanktalus also said, I suspect this is a big fat case of premature optimization. /o/ only matters when your pattern contains variables (otherwise perl is smart enough to only compile once anyway!)
The far more useful reason to use a pre-compiled RE as #Tanktalus has suggested, is to improve code readability. If you have a big hairy RE, then using $pattern everywhere will greatly improve readability, and with only a minor cost in performance (one you're not likely to ever notice).
Conclusion
Just use /o for your REs if they contain variables (unless you actually need the variables to change the RE on every run), and don't worry about it otherwise.

Any suggestions for improving (optimizing) existing string substitution in Perl code?

Perl 5.8
Improvements for fairly straightforward string substitutions, in an existing Perl script.
The intent of the code is clear, and the code is working.
For a given string, replace every occurrence of a TAB, LF or CR character with a single space, and replace every occurrence of a double quote with two double quotes. Here's a snippet from the existing code:
# replace all tab, newline and return characters with single space
$val01 =~s/[\t\n\r]/ /g;
$val02 =~s/[\t\n\r]/ /g;
$val03 =~s/[\t\n\r]/ /g;
# escape all double quote characters by replacing with two double quotes
$val01 =~s/"/""/g;
$val02 =~s/"/""/g;
$val03 =~s/"/""/g;
Question:Is there a better way to perform these string manipulations?
By "better way", I mean to perform them more efficiently, avoiding use of regular expressions (possibly using tr/// to replace the tab, newline and lf characters), or possibly using using the (qr//) to avoid recompilation.
NOTE: I've considered moving the string manipulation operations to a subroutine, to reduce the repetition of the regular expressions.
NOTE: This code works, it isn't really broken. I just want to know if there is a more appropriate coding convention.
NOTE: These operations are performed in a loop, a large number (>10000) of iterations.
NOTE: This script currently executes under perl v5.8.8. (The script has a require 5.6.0, but this can be changed to require 5.8.8. (Installing a later version of Perl is not currently an option on the production server.)
> perl -v
This is perl, v5.8.8 built for sun4-solaris-thread-multi
(with 33 registered patches, see perl -V for more detail)

Your existing solution looks fine to me.
As for avoiding recompilation, you don't need to worry about that. Perl's regular expressions are compiled only once as it is, unless they contain interpolated expressions, which yours don't.
For the sake of completeness, I should mention that even if interpolated expressions are present, you can tell Perl to compile the regex once only by supplying the /o flag.
$var =~ s/foo/bar/; # compiles once
$var =~ s/$foo/bar/; # compiles each time
$var =~ s/$foo/bar/o; # compiles once, using the value $foo has
# the first time the expression is evaluated

TMTOWTDI
You could use the tr or the index or the substr or the split functions as alternatives. But you must make measurements to identify the best method for your particular system.

You might be prematurely optimizing. Have you tried using a profiler, such as Devel::NYTProf, to see where your program spends the most of its time?

My guess would be that tr/// would be (slightly) quicker than s/// in your first regex. How much faster would, of course, be determined by factors that I don't know about your program and your environment. Profiling and benchmarking will answer that question.
But if you're interested in any kind of improvement to your code, can I suggest a maintainability fix? You run the same substitution (or set of substitutions) on three variables. This means that when you change that substitution, you need to change it three times - and doing the same thing three times is always dangerous :)
You might consider refactoring the code to look something like this:
foreach ($val01, $val02, $val03) {
s/[\t\n\r]/ /g;
s/"/""/g;
}
Also, it would probably be a good idea to have those values in an array rather than three such similarly named variables.
foreach (#vals) {
s/[\t\n\r]/ /g;
s/"/""/g;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why is my Perl regex using so much memory? - regex

Related

What does `/regex/o` really mean (once there was once, but it seems gone now)?

Using a backreference as key in a hashmap within a regex-substitution?

Perl tainting via regular expression

efficient way to search the same regular expression on multiple text

Any suggestions for improving (optimizing) existing string substitution in Perl code?

Categories

Resources