efficient way to search the same regular expression on multiple text - regex

I have multiple texts fields every field is paragraph of text and I want to search for a specifc pattern on those fields using regular expression for example:
my $text1 =~/(my pattern)/ig;
my $text2 =~/(my pattern)/ig;
...
my $textn=~/(my pattern)/ig;
I wonder if there are an effecint way to search multiple text with the same regular expression on perl or I should use the above format?

Use a topicaliser.
for ($text1, $text2, $textn) {
/(my pattern)/ig && do { ... };
}
When you have numbered variables, it's a red flag that you should consider a compound data structure instead. With a simple array it looks nearly the same:
for (#texts) {

my $pattern = qr/((?:i)my pattern)/;
my #matches;
push #matches, $text1 =~ /$pattern/g;
push #matches, $text2 =~ /$pattern/g;
push #matches, $textn =~ /$pattern/g;
That's about as efficient as I can think of - theoretically pre-compiles the regex once, though I'm not sure if interpolating it back into // to get the 'g' modifier undoes any of that compilation. Of course, I also have to wonder if this is really a bottleneck, and if you're just looking at some premature optimisation.

The answer to this question depends on whether your pattern contains any variables. If it does not, perl is already smart enough to only build the RE once, as long as it's identical everywhere.
Now, if you do use variables, then #Tanktalus's answer is close, but adds unnecessary complexity, by compiling the RE an additional time.
Use this:
my #matches;
push #matches, $text1 =~ /((?:i)my pattern with a $variable)/o;
push #matches, $text2 =~ /((?:i)my pattern with a $variable)/o;
push #matches, $textn =~ /((?:i)my pattern with a $variable)/o;
Why?
By using a variable in the RE pattern, perl is forced to re-compile for every instance, even when that variable is a pre-compiled RE as in #Tanktalus's answer. The /o ensures that it is only compiled once, the first time it's encountered, but it still must be compiled once for every occurence int he code. This is because Perl has no way of knowing if $pattern changed between the different uses.
Other considerations
In practice, as #Tanktalus also said, I suspect this is a big fat case of premature optimization. /o/ only matters when your pattern contains variables (otherwise perl is smart enough to only compile once anyway!)
The far more useful reason to use a pre-compiled RE as #Tanktalus has suggested, is to improve code readability. If you have a big hairy RE, then using $pattern everywhere will greatly improve readability, and with only a minor cost in performance (one you're not likely to ever notice).
Conclusion
Just use /o for your REs if they contain variables (unless you actually need the variables to change the RE on every run), and don't worry about it otherwise.

Related

What does `/regex/o` really mean (once there was once, but it seems gone now)?

(sorry for the title, but this "feature" really confuses me)
Learning Perl, I learned that the o modifier for a regular expression using variables would be evaluated only once, even if the variable changes after initial evaluation.
Initially that reads like having no issues, this being clearly specified.
Obviously that initial evaluation cannot happen before the variable being used has got it value.
Now qr made life a bit more interesting.
Consider this code (executed in a loop defining other variables, too):
{
my $n = $name;
$n =~ s/[^\w\.-]/_/g;
$n = qr:^${n}\#${another_variable}$:o;
#a = grep { !/$n/ } #a;
}
When using the regex for qr directly, one could argue that the regex is compiled only once, even if the scope with the variable goes out of scope (is going out of scope being considered as a change of the variable?)
But when using qr to build a regex, assigning it to a lexical variable, the compiled regex would go out of scope, so I was expecting that the regex cannot be reused and would be re-built (The basic idea was that the regex inside grep shouldn't be rebuilt for every iteration).
As life is cruel, it seems the whole regex referenced by $n isn't ever rebuilt, so the first value is used until the program stops.
Interestingly in Perl 5.18.2 (the version being used) does not mention the o modifier in perlre(1) any more, and perl 5.26.1 says in the corresponding page:
o - pretend to optimize your code, but actually introduce bugs
So can anybody explain the rules for "once" evaluation (and whether the semantics had changed over the lifespan of Perl)?
Related:
Does the 'o' modifier for Perl regular expressions still provide any benefit?
How do you force Perl to re-compile a regex compiled with "/o" on demand?
Perl has a couple of constructs that don't just store state in variables, but also some states in the opcodes themselves. Aside from /o regex patterns, this also includes the .. flip-flop operator (in scalar context) or state variables.
Perhaps state variables are clearest, since it corresponds to static local variables in many other languages (e.g. C). A state variable is initialized at most once during the lifetime of the program. An expression state $var = initialize() can be understood as
my $var;
if (previously_initialized) {
$var = cached_value;
} else {
$var = initialize();
}
This does not track dependencies in the initialize() expression, but only evaluates it once.
Similarly, it can make sense to consider a regex pattern /.../o as a kind of hidden state variable state $compiled_pattern = qr/.../.
The /o feature was a good idea a very long time ago when regexes were compiled on the fly, similarly to how it works in other languages where regex patterns are provided to a search function as a string.
It hasn't been necessary for performance purposes since a long time, and only has effects when doing variable interpolation. But if you actually want that behaviour, using a state variable would communicate that intent more clearly. Thus, I'd argue that there is no appropriate use for the /o modifier.
The "largely obsolete /o" (perlop) flag still has the "once" meaning and operation. While it is barely mentioned in perlre and in passing, it is addressed in perlop
/PATTERN/msixpodualngc
...
... Perl will not recompile the pattern unless an interpolated variable that it contains changes. You can force Perl to skip the test and never recompile by adding a /o (which stands for "once") after the trailing delimiter. Once upon a time, Perl would recompile regular expressions unnecessarily, and this modifier was useful to tell it not to do so, in the interests of speed. But now, the only reasons to use /o are one of:
The variables are thousands of characters long and you know that they don't change, and you need to wring out the last little bit of speed by having Perl skip testing for that. (There is a maintenance penalty for doing this, as mentioning /o constitutes a promise that you won't change the variables in the pattern. If you do change them, Perl won't even notice.)
you want the pattern to use the initial values of the variables regardless of whether they change or not. (But there are saner ways of accomplishing this than using /o.)
If the pattern contains embedded code, such as
use re 'eval';
$code = 'foo(?{ $x })';
/$code/
then perl will recompile each time, even though the pattern string hasn't changed, to ensure that the current value of $x is seen each time. Use /o if you want to avoid this.
The bottom line is that using /o is almost never a good idea.
So, indeed, apparently it won't even test whether variables to interpolate changed, and this may have a legitimate use. But, indeed, all told it probably shouldn't be used.
An example to demonstrate the "once" operation
perl -Mstrict -wE'
sub tt {
my ($str, $v) = #_;
my $re = qr/$v/o;
$str =~ s/$re/X/;
return $str
};
say tt( q(a1), $_ ) for qw(a b c)'
With the /o, either on the qr-ed pattern or on the regex, this matches (changes) that a1 string every time even though a is passed for the pattern only in the first iteration. Clearly the pattern isn't recompiled since the variable later has b and then c and shouldn't match.
Without /o only the first iteration has the regex matching.
In every iteration the lexical $re clearly goes away, with the whole function, but its originally compiled value keeps getting used. This means that some of the operation of /o is "very" global. Such stubbornly global behavior is not unheard of with some other ancient features (glob comes to mind, for one).
It goes the same way if the sub is a code reference in a lexical variable, remade all the time in a dynamic scope.
perl -Mstrict -wE'
for my $c (qw(a b c)) {
my $tt = sub {
my ($str, $v) = #_;
my $re = qr/$v/o;
$str =~ s/$re/X/;
return $str
};
say $tt->( q(a1), $c )
}'
Prints X1 all three times.

perl using constant in regex

I'm wondering about using constants in perl regex's. I want to do something similar to:
use constant FOO => "foo"
use constant BAR => "bar"
$somvar =~ s/prefix1_FOO/prefix2_BAR/g;
of course, in there, FOO resolves to the three letters F O O instead of expanding to the constant.
I looked online, and someone was suggesting using either ${\FOO}, or #{[FOO]} Someone else mentioned (?{FOO}). I was wondering if anyone could shed some light on which of these is correct, and if there's any advantage to any of them. Alternatively, is it better to just use a non-constant variable? (performance is a factor in my case).
There's not much in the way of reasons to use a constant over a variable. It doesn't make a great deal of difference - perl will compile a regex anyway.
For example:
#!/usr/bin/perl
use warnings;
use strict;
use Benchmark qw(:all);
use constant FOO => "foo";
use constant BAR => "bar";
my $FOO_VAR = 'foo';
my $BAR_VAR = 'bar';
sub pattern_replace_const {
my $somvar = "prefix1_foo test";
$somvar =~ s/prefix1_${\FOO}/prefix2_${\BAR}/g;
}
sub pattern_replace_var {
my $somvar = "prefix1_foo test";
$somvar =~ s/prefix1_$FOO_VAR/prefix2_$BAR_VAR/g;
}
cmpthese(
1_000_000,
{ 'const' => \&pattern_replace_const,
'var' => \&pattern_replace_var
}
);
Gives:
Rate const var
const 917095/s -- -1%
var 923702/s 1% --
Really not enough in it to worry about.
However it may be worth noting - you can compile a regex with qr// and do it that way, which - provided the RE is static - might improve performance (but it might not, because perl can detect static regexes, and does that itself.
Rate var const compiled
var 910498/s -- -2% -9%
const 933097/s 2% -- -7%
compiled 998502/s 10% 7% --
With code like:
my $compiled_regex = qr/prefix1_$FOO_VAR/;
sub compiled_regex {
my $somvar = "prefix1_foo test";
$somvar =~ s/$compiled_regex/prefix2_$BAR_VAR/g;
}
Honestly though - this is a micro optimisation. The regex engine is fast compared to your code, so don't worry about it. If performance is critical to your code, then the correct way of dealing with it is first write the code, and then profile it to look for hotspots to optimise.
The shown problem is due to those constants being barewords (built at compile time)
Constants defined using this module cannot be interpolated into strings like variables.
In the current implemenation (of constant pragma) they are "inlinable
subroutines" (see † ).
This problem can be solved nicely by using a module like Const::Fast
use Const::Fast;
const my $foo => 'FOO';
const my $bar => 'BAR';
my $var = 'prefix1_FOO_more';
$var =~ s/prefix1_$foo/prefix2_$bar/g;
Now they do get interpolated. Note that more complex replacements may need /e.
These are built at runtime so you can assign results of expressions to them. In particular, you can use the qr operator, for example
const my $patt => qr/$foo/i; # case-insensitive
The qr is the recommended way to build regex patterns. (It interpolates unless the delimiter is '.) The performance gain is most often tiny, but you get a proper regular expression, which can be built and used as such (and in this case a constant as well).
I recommend Const::Fast module over the other one readily, and in fact over all others at this time. See a recent article with a detailed discussion of both. Here is a review of many other options.
I strongly recommend to use a constant (of your chosen sort) for things meant to be read-only. That is good for the health of the code, and of developers who come into contact with it (yourself in the proverbial six months included).
† These being subroutines, we need to be able to run code in order to have them evaluated and replaced by given values. Can't just "interpolate" (evaluate) a variable -- it's not a variable.
A way to run code inside a string (which need be interpolated, so effectively double quoted) is to de-reference, where there's an expression in a block under a reference; then the expression is evaluated. So we need to first make that reference. So either
say "#{ [FOO] }"; # make array reference, then dereference
or
say "${ \FOO }"; # make scalar reference then dereference
prints foo. See the docs for why this works and for examples. Thus one can do the same inside a regex, and both in matching and replacement parts
s/prefix1_${\FOO}/prefix2_${\BAR}/g;
(or with #{[...]}), since they are evaluated as interpolated strings.
Which is "better"? These are tricks. There is rarely, if ever, a need for doing this. It has a very good chance to confuse the reader. So I just wouldn't recommend resorting to these at all.
As for (?{ code }), that is a regex feature, whereby code is evaluated inside a pattern (matching side only). It is complex and tricky and very rarely needed. See about it in perlretut and in perlre.
Discussing speed of these things isn't really relevant. They are certainly outside the realm of clean and idiomatic code, while you'd be hard pressed to detect runtime differences.
But if you must use one of these, I'd much rather interpolate inside a scalar reference via a trick then reach for a complex regex feature.
According to PerlMonk, you better create an already-interpolated string if you are concerned about performance:
use constant PATTERN => 'def';
my $regex = qr/${\(PATTERN)}/; #options such as /m can go here.
if ($string =~ regex) { ... }
Here is the link to the whole discussion.
use constant FOO => "foo";
use constant BAR => "bar";
my $var =~ s/prefix1_${\FOO}/prefix2_${\BAR}/g;
Credit: https://www.perlmonks.org/?node_id=293323

Any suggestions for improving (optimizing) existing string substitution in Perl code?

Perl 5.8
Improvements for fairly straightforward string substitutions, in an existing Perl script.
The intent of the code is clear, and the code is working.
For a given string, replace every occurrence of a TAB, LF or CR character with a single space, and replace every occurrence of a double quote with two double quotes. Here's a snippet from the existing code:
# replace all tab, newline and return characters with single space
$val01 =~s/[\t\n\r]/ /g;
$val02 =~s/[\t\n\r]/ /g;
$val03 =~s/[\t\n\r]/ /g;
# escape all double quote characters by replacing with two double quotes
$val01 =~s/"/""/g;
$val02 =~s/"/""/g;
$val03 =~s/"/""/g;
Question:Is there a better way to perform these string manipulations?
By "better way", I mean to perform them more efficiently, avoiding use of regular expressions (possibly using tr/// to replace the tab, newline and lf characters), or possibly using using the (qr//) to avoid recompilation.
NOTE: I've considered moving the string manipulation operations to a subroutine, to reduce the repetition of the regular expressions.
NOTE: This code works, it isn't really broken. I just want to know if there is a more appropriate coding convention.
NOTE: These operations are performed in a loop, a large number (>10000) of iterations.
NOTE: This script currently executes under perl v5.8.8. (The script has a require 5.6.0, but this can be changed to require 5.8.8. (Installing a later version of Perl is not currently an option on the production server.)
> perl -v
This is perl, v5.8.8 built for sun4-solaris-thread-multi
(with 33 registered patches, see perl -V for more detail)
Your existing solution looks fine to me.
As for avoiding recompilation, you don't need to worry about that. Perl's regular expressions are compiled only once as it is, unless they contain interpolated expressions, which yours don't.
For the sake of completeness, I should mention that even if interpolated expressions are present, you can tell Perl to compile the regex once only by supplying the /o flag.
$var =~ s/foo/bar/; # compiles once
$var =~ s/$foo/bar/; # compiles each time
$var =~ s/$foo/bar/o; # compiles once, using the value $foo has
# the first time the expression is evaluated
TMTOWTDI
You could use the tr or the index or the substr or the split functions as alternatives. But you must make measurements to identify the best method for your particular system.
You might be prematurely optimizing. Have you tried using a profiler, such as Devel::NYTProf, to see where your program spends the most of its time?
My guess would be that tr/// would be (slightly) quicker than s/// in your first regex. How much faster would, of course, be determined by factors that I don't know about your program and your environment. Profiling and benchmarking will answer that question.
But if you're interested in any kind of improvement to your code, can I suggest a maintainability fix? You run the same substitution (or set of substitutions) on three variables. This means that when you change that substitution, you need to change it three times - and doing the same thing three times is always dangerous :)
You might consider refactoring the code to look something like this:
foreach ($val01, $val02, $val03) {
s/[\t\n\r]/ /g;
s/"/""/g;
}
Also, it would probably be a good idea to have those values in an array rather than three such similarly named variables.
foreach (#vals) {
s/[\t\n\r]/ /g;
s/"/""/g;
}

Lengthy perl regex

This may seem as somewhat odd question, but anyhow to the point;
I have a string that I need to search for many many possible character occurrences in several combinations (so character classes are out of question), so what would be the most efficent way to do this?
I was thinking either stack it into one regex:
if ($txt =~ /^(?:really |really |long | regex here)$/){}
or using several 'smaller' comparisons, but I'd assume this won't be very efficent:
if ($txt =~ /^regex1$/ || $txt =~ /^regex2$/ || $txt =~ /^regex3$/) {}
or perhaps nest several if comparisons.
I will appreciate any extra suggestions and other input on this issue.
Thanks
Ever since way back in v5.9.2, Perl compiles a set of N alternatives like:
/string1|string2|string3|string4|string5|.../
into a trie data structure, and if that is the first thing in the pattern, even uses Aho–Corasick matching to find the start point very quickly.
That means that your match of N alternatives will now run in O(1) time instead of in the O(N) time that this:
if (/string1/ || /string2/ || /string3/ || /string4/ || /string5/ || ...)
will run in.
So you can have O(1) or O(N) performance: your choice.
If you use re "debug" or -Mre-debug, Perl will show these trie structures in your patterns.
This will not replace some time testing. If possible though, I would suggest using the o flag if possible so that Perl doesn't recompile your (large) regex on every evaulation. Of course this is only possible if those combinations of characters do not change for each evaluation.
I think it depends on how long regex you have. Sometimes better to devide very long expressions.

Why is my Perl regex using so much memory?

I'm running a regular expression against a large scalar. Though this match isn't capturing anything, my process grows by 30M after this match:
# A
if (${$c} =~ m/\G<<\s*/cgs)
{
#B
...
}
$c is a reference to a pretty big scalar (around 21M), but I've verified that pos(${$c}) is in the right place and the expression matches at the first character, with pos(${$c}) being updated to the correct place after the match. But as I mentioned, the process has grown by about 30M between #A and #B, even though I'm not capturing anything with this match. Where is my memory going?
Edit: Yes, use of $& was to blame. We are using Perl 5.8.8, and my script was using Getopt::Declare, which uses the built-in Text::Balanced. The 1.95 version of this module was using $&. The 2.0.0 version that ships with Perl 5.10 has removed the reference to $& and alleviates the problem.
Just a quick sanity check, are you mentioning $&, $` or $' (sometimes called $MATCH, $PREMATCH and $POSTMATCH) anywhere in your code? If so, Perl will copy your entire string for every regular expression match, just in case you want to inspect those variables.
"In your code" in this case means indirectly, including using modules that reference these variables, or writing use English rather than use English qw( -no_match_vars ).
If you're not sure, you can use the Devel::SawAmpersand module to determine if they have been used, and Devel::FindAmpersand to figure out where they are used.
There may be other reasons for the increase in memory (which version of Perl are you using?), but the match variables will definitely blow your memory if they're used, and hence are a likely culprit.
Cheerio,
Paul