Unexpected speed behaviour when benchmarking Perl regexs

Unexpected speed behaviour when benchmarking Perl regexs - regex

Whilst discussing the relative merits of using index() in Perl to search for substrings I decided to write a micro benchmark to prove what I had seen before than index is faster than regular expressions when looking for a substring. Here is the benchmarking code:
use strict;
use warnings;
use Benchmark qw(:all);
my #random_data;
for (1..100000) {
push(#random_data, int(rand(1000)));
}
my $warn_about_counts = 0;
my $count = 100;
my $search = '99';
cmpthese($count, {
'Using regex' => sub {
my $instances = 0;
my $regex = qr/$search/;
foreach my $i (#random_data) {
$instances++ if $i =~ $regex;
}
warn $instances if $warn_about_counts;
return;
},
'Uncompiled regex with scalar' => sub {
my $instances = 0;
foreach my $i (#random_data) {
$instances++ if $i =~ /$search/;
}
warn $instances if $warn_about_counts;
return;
},
'Uncompiled regex with literal' => sub {
my $instances = 0;
foreach my $i (#random_data) {
$instances++ if $i =~ /99/;
}
warn $instances if $warn_about_counts;
return;
},
'Using index' => sub {
my $instances = 0;
foreach my $i (#random_data) {
$instances++ if index($i, $search) > -1;
}
warn $instances if $warn_about_counts;
return;
},
});
What I was surprised at was how these performed (using Perl 5.10.0 on a recent MacBook Pro). In descending order of speed:
Uncompiled regex with literal (69.0 ops/sec)
Using index (61.0 ops/sec)
Uncompiled regex with scalar (56.8 ops/sec)
Using regex (17.0 ops/sec)
Can anyone offer an explanation as to what voodoo Perl is using to get the speed of the two uncomplied regular expressions to perform as well as the index operation? Is it an issue in the data I've used to generate the benchmark (looking for the occurrence of 99 in 100,000 random integers) or is Perl able to do a runtime optimisation?

Wholesale revision
In light of #Ven'Tatsu's comment, I changed the benchmark a bit:
use strict; use warnings;
use Benchmark qw(cmpthese);
use Data::Random qw( rand_words );
use Data::Random::WordList;
my $wl = Data::Random::WordList->new;
my #data_1 = (rand_words( size => 10000 )) x 10;
my #data_2 = #data_1;
my $pat = 'a(?=b)';
my $re = qr/^$pat/;
cmpthese(1, {
'qr/$search/' => sub {
my $instances = grep /$re/, #data_1;
return;
},
'm/$search/' => sub {
my $search = 'a(?=b)';
my $instances = grep /^$search/, #data_2;
return;
},
});
On Windows XP with ActiveState perl 5.10.1:
Rate qr/$search/ m/$search/
qr/$search/ 5.40/s -- -73%
m/$search/ 20.1/s 272% --
On Windows XP with Strawberry perl 5.12.1:
Rate qr/$search/ m/$search/
qr/$search/ 6.42/s -- -66%
m/$search/ 18.6/s 190% --
On ArchLinux with bleadperl:
Rate qr/$search/ m/$search/
qr/$search/ 9.25/s -- -38%
m/$search/ 14.8/s 60% --

Well, your case "Using regex" is so slow because you are compiling it each time. Try moving it out of the subroutine.

Perl optimizes a lot of things. Your pattern with no special regex features and literal characters allows perl's regex engine to simplify many things. Using use re 'debug' can show you what's actually happening behind the scenes.

Related

Regex performance

I am benchmarking different approaches to RegEx and seeing something I really don't understand. I am specifically comparing using the -match operator vs using the [regex]::Matches() accelerator.
I started with
(Measure-Command {
foreach ($i in 1..10000) {
$path -match $testPattern
}
}).TotalSeconds
(Measure-Command {
foreach ($i in 1..10000) {
[regex]::Matches($path, $testPattern)
}
}).TotalSeconds
and -match is always very slightly faster. But it's also not apples to apples because I need to assign the [Regex] results to a variable to use it. So I added that
(Measure-Command {
foreach ($i in 1..10000) {
$path -match $testPattern
}
}).TotalSeconds
(Measure-Command {
foreach ($i in 1..10000) {
$test = [regex]::Matches($path, $testPattern)
}
}).TotalSeconds
And now [Regex] is consistently slightly faster, which makes no sense because I added to the workload with the variable assignment. The performance difference is ignorable, 1/100th of a second when doing 10,000 matches, but I wonder what is going on under the hood to make [Regex] faster when there is a variable assignment involved?
For what it's worth, without the variable assignment -match is faster, .05 seconds vs .03 seconds. With variable assignment [Regex] is faster by .03 seconds vs .02 seconds. So while it IS all negligible, adding the variable cuts [Regex] processing time more than in half, which is a (relatively) huge delta.

The outputs of both tests are different.
The accelerator output a lot more text.
Even though they are not displayed when wrapped in the Measure-Command cmdlet, they are part of the calculation.
Output of $path -match $testPattern
$true
Output of [regex]::Matches($path,$testPattern
Groups : {0}
Success : True
Name : 0
Captures : {0}
Index : 0
Length : 0
Value :
Writing stuff is slow.
In your second example, you take care of the accelerator output by assigning it to a variable. That's why it is significantly faster.
You can see the difference without assignment by voiding the outputs
If you do that, you'll see the accelerator is consistently slightly faster.
(Measure-Command {
foreach ($i in 1..10000) {
[void]($path -match $testPattern)
}
}).TotalSeconds
(Measure-Command {
foreach ($i in 1..10000) {
[void]([regex]::Matches($path, $testPattern))
}
}).TotalSeconds
Additional note
void is always more efficient than Command | Out-null.
Pipeline is slower but memory efficient.

This isn't an answer to the direct question asked, but it's an expansion on the performance of pre-compiled regexes that I mentioned in comments...
First, here's my local performance benchmark for the original code in the question for comparison (with some borrowed text and patterns):
$text = "foo" * 1e6;
$pattern = "f?(o)";
$count = 1000000;
# example 1
(Measure-Command {
foreach ($i in 1..$count) {
$text -match $pattern
}
}).TotalSeconds
# 8.010825
# example 2
(Measure-Command {
foreach ($i in 1..$count) {
$result = [regex]::Matches($text, $pattern)
}
}).TotalSeconds
# 6.8186813
And then using a pre-compiled regex, which according to Compilation and Reuse in Regular Expressions emits a native assembly to process the regex rather than the default "sequence of internal instructions" - whatever that actually means :-).
$text = "foo" * 1e6;
$pattern = "f?(o)";
$count = 1000000;
# example 3
$regex = [regex]::new($pattern, "Compiled");
(Measure-Command {
foreach ($i in 1..$count) {
$result = $regex.Matches($text)
}
}).TotalSeconds
# 5.8794981
# example 4
(Measure-Command {
$regex = [regex]::new($pattern, "Compiled");
foreach ($i in 1..$count) {
$result = $regex.Matches($text)
}
}).TotalSeconds
# 3.6616832
# example 5
# see https://github.com/PowerShell/PowerShell/issues/8976
(Measure-Command {
& {
$regex = [regex]::new($pattern, "Compiled");
foreach ($i in 1..$count) {
$result = $regex.Matches($text);
}
}
}).TotalSeconds
# 1.5474028
Note that Example 3 has a performance overhead of finding / resolving the $regex variable from inside each iteration because it's defined outside the Measure-Command's -Expresssion scriptblock - see https://github.com/PowerShell/PowerShell/issues/8976 for details.
Example 5 defines the variable inside a nested scriptblock and so is a lot faster. I'm not sure why Example 4 sits in between the two in performance, but it's useful to note there's a definite difference :-)
Also, as an aside, in my comments above, my original version of Example 5 didn't have the &, which meant I was timing the effort required to define the scriptblock, not execute it, so my numbers were way off. In practice, the performance increase is a lot less than my comment suggested, but it's still a decent improvement if you're executing millions of matches in a tight loop...

In perl, Is there a more compact way to search for a number of patterns, and for each one, substitute with an expression

In perl, I am reading a line and trying to replace a set of strings with corresponding expressions using a sequence of if statements. For example:
my #shiftInstructions=("lsr", "lsl", "rol", "ror");
while (my $line = <>) {
if ($line =~ /\$sh/) {
my $r = int(rand(6));
$line =~ s/\$sh/$r/;
}
if ($line =~ /\$ish/) {
my $r = $shiftInstructions[rand(4)]
$line =~ s/\$ish/$r/;
}
}
I don't like this approach for a number of reasons. First, it is repetitive. I have to first check if the pattern exists, and then if it does, execute a function to generate a replacement value, then substitute. So it is both verbose, and slow (2 regex searches per pattern, perhaps eventually dozens of pattern strings).
I thought of a map where a number of codes are mapped to corresponding code to execute.
I can imagine mapping to a string and then using eval but then I can't check the code except at runtime. Is there any cleaner way of doing this?
I found the execute option in regex. What about writing a set of subroutines to process each regex, then creating a mapping:
my %regexMap = (
"\$fn", &foundFunc,
"\$hw", &hex8,
"\$hb", &hex2,
"\$sh", &rand6,
"\$ish", &shiftInst,
);
while (my $line = <>) {
$line =~ s/(\$fn|\$hw|\$hb|\$sh|\$ish|)/$regexMap{$1}/e;
print $line;
}

if ($line =~ /\$sh/) {
my $r = int(rand(6));
$line =~ s/\$sh/$r/;
}
is a poor way of writing
$line =~ s/\$sh/ int(rand(6)) /e;
So
my #shiftInstructions=("lsr", "lsl", "rol", "ror");
while (my $line = <>) {
if ($line =~ /\$sh/) {
my $r = int(rand(6));
$line =~ s/\$sh/$r/;
}
if ($line =~ /\$ish/) {
my $r = $shiftInstructions[rand(4)]
$line =~ s/\$ish/$r/;
}
print($line);
}
can be written as
my #shiftInstructions = qw( lsr lsl rol ror );
while (my $line = <>) {
$line =~ s/\$sh/ int(rand(6)) /e;
$line =~ s/\$ish/ $shiftInstructions[rand(#shiftInstructions)] /e;
print($line);
}
But that means you are scanning the string over and over again. Let's avoid that.
my #shiftInstructions = qw( lsr lsl rol ror );
while (my $line = <>) {
$line =~ s/\$(sh|ish)/
if ( $1 eq "sh" ) { int(rand(6)) }
elsif ( $1 eq "ish" ) { $shiftInstructions[rand(#shiftInstructions)] }
/eg;
print($line);
}
Unfortunately, that reintroduces repetition. We can solve that using a dispatch table.
my #shiftInstructions = qw( lsr lsl rol ror );
my %replacements = (
sh => sub { int(rand(6)) },
ish => sub { $shiftInstructions[rand(#shiftInstructions)] },
);
my $alt = join '|', map quotemeta, keys(%replacements);
my $re = qr/\$($alt)/;
while (my $line = <>) {
print $line =~ s/$re/ $replacements{$1}->() /reg;
}
Now we have an efficient solution that can be extended without slowing down the matching, all while avoiding repetition.
The solution you added to your question was close, but it had two bugs.
&foo calls foo. To get a reference to it, use \&foo.
my %regexMap = (
"\$fn", \&foundFunc,
"\$hw", \&hex8,
"\$hb", \&hex2,
"\$sh", \&rand6,
"\$ish", \&shiftInst,
);
$regexMap{$1} now returns the reference. You want to call the referenced sub, which can be done using $regexMap{$1}->().
while (my $line = <>) {
$line =~ s/(\$fn|\$hw|\$hb|\$sh|\$ish|)/ $regexMap{$1}->() /e;
print $line;
}

In these cases, I often make some sort of data structure that holds the patterns and their actions:
my #tuples = (
[ qr/.../, sub { ... } ]
[ ... ].
);
Now the meat of the process stays the same no matter how many patterns I want to try:
while( <> ) {
foreach $tuple ( #tuples ) {
$tuple->[1]() if /$tuple[0]/
}
}
Abstract this a little further with a subroutine that takes the data structure. Then you can pass it different tables depending on what you would like to do:
sub some_sub {
my #tuples = #_;
while( <> ) {
foreach $tuple ( #tuples ) {
$tuple->[1]() if /$tuple[0]/
}
}
}
I've written about this sort of thing in Mastering Perl and Effective Perl Programming, and it's the sort of thing that does into my obscure modules like Brick and Data::Constraint.
I've been thinking about this more, and I wonder if regexes are actually part of what you are trying to do. It looks like you are matching literal strings, but using the match operator to do it. You don't give details of the input, so I'm guessing here—it looks like there's an operation (e.g. $fn, and you want to match exactly that operation. The problem is finding that string then mapping it onto code. That looks something like this (and ikegami's answer is another form of this idea). Instead of an alternation, I match anything that might look like the string:
while( <> ) {
# find the string. Need example input to guess better
if( m/(\$[a-z]+)/ ) {
$table{$1}->() if exists $table{$1};
}
}
But again, it's dependent on the input, how many actual substrings you might want to match (so, the number of branches in an alternation), how many lines you want to process, and so on. There was a wonderful talk about processing apache log files with Regex::Trie and the various experiments they tried to make things faster. I've forgotten all the details, but very small adjustments made noticeable differences over tens of millions of lines.
Interesting reading:
Maybe this talk? An exploration of trie regexp matching
http://taint.org/2006/07/07/184022a.html
Matching a long list of phrases

OP's code can be written in following form
use strict;
use warnings;
use feature 'say';
my %regexMap = (
'$fn' => \&foundFunc,
'$hw' => \&hex8,
'$hb' => \&hex2,
'$sh' => \&rand6,
'$ish' => \&shiftInst,
);
my #keys = map { "\\$_" } keys %regexMap;
my $re = join('|', #keys);
while (<DATA>) {
chomp;
next unless /($re)/;
$regexMap{$1}->();
}
sub foundFunc { say 'sub_foundFunc' }
sub hex8 { say 'sub_hex8' }
sub hex2 { say 'sub_hex2' }
sub rand6 { say 'sub_rand6' }
sub shiftInst { say 'sub_shiftInst' }
__DATA__
$fn
$hw
$ac
$hb
$sh
$fn
$mf
$hb
$ish
$hw
Output
sub_foundFunc
sub_hex8
sub_hex2
sub_rand6
sub_foundFunc
sub_hex2
sub_shiftInst
sub_hex8

Dynamically capture regular expression match in Perl

I'm trying to dynamically catch regex matching in Perl. I've known that eval will help me do this but I may be doing something wrong.
Code:
use strict;
use warnings;
my %testHash = (
'(\d+)\/(\d+)\/(\d+)' => '$1$2$3'
);
my $str = '1/12/2016';
foreach my $pattern (keys (%testHash)) {
my $value = $testHash{$pattern};
my $result;
eval {
local $_ = $str;
/$pattern/;
print "\$1 - $1\n";
print "\$2 - $2\n";
print "\$3 - $3\n";
eval { print "$value\n"; }
}
}
Is it also possible to store captured regex patterns in an array?

I believe what you really want is a dynamic version of the following:
say $str =~ s/(\d+)\/(\d+)\/(\d+)/$1$2$3/gr;
String::Substitution provides what we need to achieve that.
use String::Substitution qw( gsub_copy );
for my $pattern (keys(%testHash)) {
my $replacement = $testHash{$pattern};
say gsub_copy($str, $pattern, $replacement);
}
Note that $replacement can also be a callback. This permits far more complicated substitutions. For example, if you wanted to convert 1/12/2016 into 2016-01-12, you could use the following:
'(\d+)/(\d+)/(\d+)' => sub { sprintf "%d-%02d-%02d", #_[3,1,2] },
To answer your actual question:
use String::Substitution qw( interpolate_match_vars last_match_vars );
for my $pattern (keys(%testHash)) {
my $template = $testHash{$pattern};
$str =~ $pattern # Or /$pattern/ if you prefer
or die("No match!\n");
say interpolate_match_vars($template, last_match_vars());
}

I am not completely sure what you want to do here, but I don't think your program does what you think it does.
You are useing eval with a BLOCK of code. That's like a try block. If it dies inside of that eval block, it will catch that error. It will not run your string like it was code. You need a string eval for that.
Instead of explaining that, here's an alternative.
This program uses sprintf and numbers the parameters. The %1$s syntax in the pattern says _take the first argument (1$) and format it as a string (%s). You don't need to localize or assign to $_ to do a match. The =~ operator does that on other variables for you. I also use qr{} to create a quoted regular expression (essentially a variable containing a precompiled pattern) that I can use directly. Because of the {} as delimiter, I don't need to escape the slashes.
use strict;
use warnings;
use feature 'say'; # like print ..., "\n"
my %testHash = (
qr{(\d+)/(\d+)/(\d+)} => '%1$s.%2$s.%3$s',
qr{(\d+)/(\d+)/(\d+) nomatch} => '%1$s.%2$s.%3$s',
qr{(\d+)/(\d+)/(\d\d\d\d)} => '%3$4d-%2$02d-%1$02d',
qr{\d} => '%s', # no capture group
);
my $str = '1/12/2016';
foreach my $pattern ( keys %testHash ) {
my #captures = ( $str =~ $pattern );
say "pattern: $pattern";
if ($#+ == 0) {
say " no capture groups";
next;
}
unless (#captures) {
say " no match";
next;
}
# debug-output
for my $i ( 1 .. $#- ) {
say sprintf " \$%d - %s", $i, $captures[ $i - 1 ];
}
say sprintf $testHash{$pattern}, #captures;
}
I included four examples:
The first pattern is the one you had. It uses %1$s and so on as explained above.
The second one does not match. We check the number of elements in #captured by looking at it in scalar context.
The third one shows that you can also reorder the result, or even use the sprintf formatting.
The last one has no capture group. We check by looking at the index of the last element ($# as the sigil for arrays that usually have an # sigil) in #+, which holds the offsets of the ends of the last successful submatches in the currently active dynamic scope. The first element is the end of the overall match, so if this only has one element, we don't have capture groups.
The output for me is this:
pattern: (?^:(\d+)/(\d+)/(\d\d\d\d))
$1 - 1
$2 - 12
$3 - 2016
2016-12-01
pattern: (?^:(\d+)/(\d+)/(\d+) nomatch)
no match
pattern: (?^:\d)
no capture groups
pattern: (?^:(\d+)/(\d+)/(\d+))
$1 - 1
$2 - 12
$3 - 2016
1.12.2016
Note that the order in the output is mixed up. That's because hashes are not ordered in Perl, and if you iterate over the keys in a hash without sort the order is random.

Apologies! I realized both my question and sample code were both vague. But after reading your suggestions I came of with the following code.
I haven't optimized this code yet and there is a limit to the replacement.
foreach my $key (keys %testHash) {
if ( $str =~ $key ) {
my #matchArr = ($str =~ $key); # Capture all matches
# Search and replace (limited from $1 to $9)
for ( my $i = 0; $i < #matchArr; $i++ ) {
my $num = $i+1;
$testHash{$key} =~ s/\$$num/$matchArr[$i]/;
}
$result = $testHash{$key};
last;
}
}
print "$result\n";

Evaluing the regexp in list context returns the matches. so in your example:
use Data::Dumper; # so we can see the result
foreach my $pattern (keys (%testHash)) {
my #a = ($str =~/$pattern/);
print Dumper(\#a);
}
would do the job.
HTH
Georg

Is it also possible to store captured regex patterns in an array?
Of course it is possible to store captured substrings in an array:
#!/usr/bin/env perl
use strict;
use warnings;
my #patterns = map qr{$_}, qw{
(\d+)/(\d+)/(\d+)
};
my $str = '1/12/2016';
foreach my $pattern ( #patterns ) {
my #captured = ($str =~ $pattern)
or next;
print "'$_'\n" for #captured;
}
Output:
'1'
'12'
'2016'
I do not quite understand what you are trying to do with combinations of local, eval EXPR and eval BLOCK in your code and the purpose of the following hash:
my %testHash = (
'(\d+)\/(\d+)\/(\d+)' => '$1$2$3'
);
If you are trying to codify that this pattern should result in three captures, you can do that like this:
my #tests = (
{
pattern => qr{(\d+)/(\d+)/(\d+)},
ncaptures => 3,
}
);
my $str = '1/12/2016';
foreach my $test ( #tests ) {
my #captured = ($str =~ $test->{pattern})
or next;
unless (#captured == $test->{ncaptures}) {
# handle failure
}
}
See this answer to find out how you can automate counting the number of capture groups in a pattern. Using the technique in that answer:
#!/usr/bin/env perl
use strict;
use warnings;
use Test::More;
my #tests = map +{ pattern => qr{$_}, ncaptures => number_of_capturing_groups($_) }, qw(
(\d+)/(\d+)/(\d+)
);
my $str = '1/12/2016';
foreach my $test ( #tests ) {
my #captured = ($str =~ $test->{pattern});
ok #captured == $test->{ncaptures};
}
done_testing;
sub number_of_capturing_groups {
"" =~ /|$_[0]/;
return $#+;
}
Output:
ok 1
1..1

Regex simple replace document from dictionary hash (Perl)

I need to find and replace keywords from a hash in a large documents as fast as possible.
I tired the below two methods, one is faster by 320% but I am sure I am doing this the wrong way and sure there is a better way to do it.
The idea I want to replace only the keywords that exist in the dictionary hash and keep those that does not exist so I know it is not in the dictionary.
Both methods below scan twice to find and replace as I think. I am sure the regex like look ahead or behind can optimize it much faster.
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(:all);
my %dictionary = (
pollack => "pollard",
polynya => "polyoma",
pomaces => "pomaded",
pomades => "pomatum",
practic => "praetor",
prairie => "praised",
praiser => "praises",
prajnas => "praline",
quakily => "quaking",
qualify => "quality",
quamash => "quangos",
quantal => "quanted",
quantic => "quantum",
);
my $content =qq{
Start this is the text that contains the words to replace. {quantal} A computer {pollack} is a general {pomaces} purpose device {practic} that
can be {quakily} programmed to carry out a set {quantic} of arithmetic or logical operations automatically {quamash}.
Since a {prajnas} sequence of operations can {praiser} be readily changed, the computer {pomades} can solve more than {prairie}
one kind of problem {qualify} {doesNotExist} end.
};
# just duplicate content many times
$content .= $content;
cmpthese(100000, {
replacer_1 => sub {my $text = replacer1($content)},
replacer_2 => sub {my $text = replacer2($content)},
});
print replacer1($content) , "\n--------------------------\n";
print replacer2($content) , "\n--------------------------\n";
exit;
sub replacer1 {
my ($content) = shift;
$content =~ s/\{(.+?)\}/exists $dictionary{$1} ? "[$dictionary{$1}]": "\{$1\}"/gex;
return $content;
}
sub replacer2 {
my ($content) = shift;
my #names = $content =~ /\{(.+?)\}/g;
foreach my $name (#names) {
if (exists $dictionary{$name}) {
$content =~ s/\{$name\}/\[$dictionary{$name}\]/;
}
}
return $content;
}
Here is the benchmark result:
Rate replacer_2 replacer_1
replacer_2 5565/s -- -76%
replacer_1 23397/s 320% --

Here's a way that's a little faster and more compact:
sub replacer3 {
my ($content) = shift;
$content =~ s#\{(.+?)\}#"[".($dictionary{$1} // $1)."]"#ge;
return $content;
}
In Perl 5.8, it is ok to use || instead of // if none of your dictionary values are "false".
There's also a little to be gained by using a dictionary that already contains the braces and brackets:
sub replacer5 {
my ($content) = shift;
our %dict2;
if (!%dict2) {
%dict2 = map { "{".$_."}" => "[".$dictionary{$_}."]" } keys %dictionary
}
$content =~ s#(\{.+?\})#$dict2{$1} || $1#ge;
return $content;
}
Benchmark results:
Rate replacer_2 replacer_1 replacer_3 replacer_5
replacer_2 2908/s -- -79% -83% -84%
replacer_1 14059/s 383% -- -20% -25%
replacer_3 17513/s 502% 25% -- -7%
replacer_5 18741/s 544% 33% 7% --

It helps to build a regex that will match any of the hash keys beforehand. Like this
my $pattern = join '|', sort {length $b <=> length $a } keys %dictionary;
$pattern = qr/$pattern/;
sub replacer4 {
my ($string) = #_;
$string =~ s# \{ ($pattern) \} #"[$dictionary{$1}]"#gex;
$string;
}
with these results
Rate replacer_2 replacer_1 replacer_3 replacer_4
replacer_2 4883/s -- -80% -84% -85%
replacer_1 24877/s 409% -- -18% -22%
replacer_3 30385/s 522% 22% -- -4%
replacer_4 31792/s 551% 28% 5% --
It would also make an improvement if you could the braces and brackets in the hash, instead of having to add them each time.

I'd recommend using meaningful names for your benchmarking subroutines, it'll make the output and intent more clear.
The following reproduces a bit of what Borodin and mob have tried out, and then combines them as well.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'state';
use Benchmark qw(:all);
# Data separated by paragraph mode.
my %dictionary = split ' ', do {local $/ = ''; <DATA>};
my $content = do {local $/; <DATA>};
# Quadruple Content
$content = $content x 4;
cmpthese(100_000, {
original => sub { my $text = original($content) },
build_list => sub { my $text = build_list($content) },
xor_regex => sub { my $text = xor_regex($content) },
list_and_xor => sub { my $text = list_and_xor($content) },
});
exit;
sub original {
my $content = shift;
$content =~ s/\{(.+?)\}/exists $dictionary{$1} ? "[$dictionary{$1}]": "\{$1\}"/gex;
return $content;
}
sub build_list {
my $content = shift;
state $list = join '|', map quotemeta, keys %dictionary;
$content =~ s/\{($list)\}/[$dictionary{$1}]/gx;
return $content;
}
sub xor_regex {
my $content = shift;
state $with_brackets = {
map {("{$_}" => "[$dictionary{$_}]")} keys %dictionary
};
$content =~ s{(\{.+?\})}{$with_brackets->{$1} // $1}gex;
return $content;
}
sub list_and_xor {
my $content = shift;
state $list = join '|', map quotemeta, keys %dictionary;
state $with_brackets = {
map {("{$_}" => "[$dictionary{$_}]")} keys %dictionary
};
$content =~ s{(\{(?:$list)\})}{$with_brackets->{$1} // $1}gex;
return $content;
}
__DATA__
pollack pollard
polynya polyoma
pomaces pomaded
pomades pomatum
practic praetor
prairie praised
praiser praises
prajnas praline
quakily quaking
qualify quality
quamash quangos
quantal quanted
quantic quantum
Start this is the text that contains the words to replace. {quantal} A computer {pollack} is a general {pomaces} purpose device {practic} that
can be {quakily} programmed to carry out a set {quantic} of arithmetic or logical operations automatically {quamash}.
Since a {prajnas} sequence of operations can {praiser} be readily changed, the computer {pomades} can solve more than {prairie}
one kind of problem {qualify} {doesNotExist} end.
Outputs:
Rate original xor_regex build_list list_and_xor
original 19120/s -- -23% -24% -29%
xor_regex 24938/s 30% -- -1% -8%
build_list 25253/s 32% 1% -- -7%
list_and_xor 27027/s 41% 8% 7% --
My solutions make heavy use of state variables to avoid reinitializing static data structures. However, one could also use closures or our $var; $var ||= VAL.
Addendum about enhancing the LHS of the regex
Actually, editing the LHS to use an explicit list is about improving the regular expression. And this change showed a 30% improvement in speed.
There isn't likely to be any magic solution to this. You have a list of values, that you're wanting to replace. It isn't like there is some mysterious way to simplify the language of this goal.
You could perhaps use a code block in the LHS to Fail and skip if the word does not exist in the dictionary hash. However, the following shows that this is actually 36% slower than your original method:
sub skip_fail {
my $content = shift;
$content =~ s{\{(.+?)\}(?(?{! $dictionary{$1}})(*SKIP)(*FAIL))}{[$dictionary{$1}]}gx;
return $content;
}
Outputs:
Rate skip_fail original xor_regex build_list list_and_xor
skip_fail 6769/s -- -36% -46% -49% -53%
original 10562/s 56% -- -16% -21% -27%
xor_regex 12544/s 85% 19% -- -6% -14%
build_list 13355/s 97% 26% 6% -- -8%
list_and_xor 14537/s 115% 38% 16% 9% --

Does the 'o' modifier for Perl regular expressions still provide any benefit?

It used to be considered beneficial to include the 'o' modifier at the end of Perl regular expressions. The current Perl documentation does not even seem to list it, certainly not at the modifiers section of perlre.
Does it provide any benefit now?
It is still accepted, for reasons of backwards compatibility if nothing else.
As noted by J A Faucett and brian d foy, the 'o' modifier is still documented, if you find the right places to look (one of which is not the perlre documentation). It is mentioned in the perlop pages. It is also found in the perlreref pages.
As noted by Alan M in the accepted answer, the better modern technique is usually to use the qr// (quoted regex) operator.

/o is deprecated. The simplest way to make sure a regex is compiled only once is to use use a regex object, like so:
my $reg = qr/foo$bar/;
The interpolation of $bar is done when the variable $reg is initialized, and the cached, compiled regex will be used from then on within the enclosing scope. But sometimes you want the regex to be recompiled, because you want it to use the variable's new value. Here's the example Friedl used in The Book:
sub CheckLogfileForToday()
{
my $today = (qw<Sun Mon Tue Wed Thu Fri Sat>)[(localtime)[6]];
my $today_regex = qr/^$today:/i; # compiles once per function call
while (<LOGFILE>) {
if ($_ =~ $today_regex) {
...
}
}
}
Within the scope of the function, the value of $today_regex stays the same. But the next time the function is called, the regex will be recompiled with the new value of $today. If he had just used:
if ($_ =~ m/^$today:/io)
...the regex would never be updated. So, with the object form you have the efficiency of /o without sacrificing flexibility.

The /o modifier is in the perlop documentation instead of the perlre documentation since it is a quote-like modifier rather than a regex modifier. That has always seemed odd to me, but that's how it is. Since Perl 5.20, it's now listed in perlre simply to note that you probably shouldn't use it.
Before Perl 5.6, Perl would recompile the regex even if the variable had not changed. You don't need to do that anymore. You could use /o to compile the regex once despite further changes to the variable, but as the other answers noted, qr// is better for that.

In the Perl 5 version 20.0 documentation
http://perldoc.perl.org/perlre.html
it states
Modifiers
Other Modifiers
…
o - pretend to optimize your code, but actually introduce bugs
which may be a humorous way of saying it was supposed to perform some kind of optimisation, but the implementation is broken.
Thus the option might be best avoided.

This is an optimization in the case that the regex includes a variable reference. It indicates that the regex does not change even though it has a variable within it. This allows for optimizations that would not be possible otherwise.

Here are timings for different ways to call matching.
$ perl -v | grep version
This is perl 5, version 20, subversion 1 (v5.20.1) built for x86_64-linux-gnu-thread-multi
$ perl const-in-re-once.pl | sort
0.200 =~ CONST
0.200 =~ m/$VAR/o
0.204 =~ m/literal-wo-vars/
0.252 =~ m,#{[ CONST ]},o
0.260 =~ $VAR
0.276 =~ m/$VAR/
0.336 =~ m,#{[ CONST ]},
My code:
#! /usr/bin/env perl
use strict;
use warnings;
use Time::HiRes qw/ tv_interval clock_gettime gettimeofday /;
use BSD::Resource qw/ getrusage RUSAGE_SELF /;
use constant RE =>
qr{
https?://
(?:[^.]+-d-[^.]+\.)?
(?:(?: (?:dev-)? nind[^.]* | mr02 )\.)?
(?:(?:pda|m)\.)?
(?:(?:news|haber)\.)
(?:.+\.)?
yandex\.
.+
}x;
use constant FINAL_RE => qr,^#{[ RE ]}(/|$),;
my $RE = RE;
use constant ITER_COUNT => 1e5;
use constant URL => 'http://news.trofimenkov.nerpa.yandex.ru/yandsearch?cl4url=www.forbes.ru%2Fnews%2F276745-visa-otklyuchila-rossiiskie-banki-v-krymu&lr=213&lang=ru';
timeit(
'=~ m/literal-wo-vars/',
ITER_COUNT,
sub {
for (my $i = 0; $i < ITER_COUNT; ++$i) {
URL =~ m{
^https?://
(?:[^.]+-d-[^.]+\.)?
(?:(?: (?:dev-)? nind[^.]* | mr02 )\.)?
(?:(?:pda|m)\.)?
(?:(?:news|haber)\.)
(?:.+\.)?
yandex\.
.+
(/|$)
}x
}
}
);
timeit(
'=~ m/$VAR/',
ITER_COUNT,
sub {
for (my $i = 0; $i < ITER_COUNT; ++$i) {
URL =~ m,^$RE(/|$),
}
}
);
timeit(
'=~ $VAR',
ITER_COUNT,
sub {
my $r = qr,^$RE(/|$),o;
for (my $i = 0; $i < ITER_COUNT; ++$i) {
URL =~ $r
}
}
);
timeit(
'=~ m/$VAR/o',
ITER_COUNT,
sub {
for (my $i = 0; $i < ITER_COUNT; ++$i) {
URL =~ m,^$RE(/|$),o
}
}
);
timeit(
'=~ m,#{[ CONST ]},',
ITER_COUNT,
sub {
for (my $i = 0; $i < ITER_COUNT; ++$i) {
URL =~ m,^#{[ RE ]}(/|$),
}
}
);
timeit(
'=~ m,#{[ CONST ]},o',
ITER_COUNT,
sub {
for (my $i = 0; $i < ITER_COUNT; ++$i) {
URL =~ m,^#{[ RE ]}(/|$),o
}
}
);
timeit(
'=~ CONST',
ITER_COUNT,
sub {
my $r = qr,^$RE(/|$),o;
for (my $i = 0; $i < ITER_COUNT; ++$i) {
URL =~ FINAL_RE
}
}
);
sub timeit {
my ($name, $iters, $code) = #_;
#my $t0 = [gettimeofday];
my $t0 = (getrusage RUSAGE_SELF)[0];
$code->();
#my $el = tv_interval($t0);
my $el = (getrusage RUSAGE_SELF)[0] - $t0;
printf "%.3f\t%-17s\t%.9f\n", $el, $name, $el / $iters
}

Yep and Nope
I ran a simple comparison using the follow script:
perl -MBenchmark=cmpthese -E 'my #n = 1..10000; cmpthese(10000, {string => sub{"a1b" =~ /a\d+c/ for #n}, o_flag => sub{"a1b" =~ /a\d+c/o for #n}, qr => sub{my $qr = qr/a\d+c/; "a1b" =~ /$qr/ for #n } })'
Here are the results:
Rate qr string o_flag
qr 760/s -- -72% -73%
string 2703/s 256% -- -5%
o_flag 2833/s 273% 5% --
So, clearly the /o flag is much faster than using qr.
But apparently the /o flag may cause bugs:
Perl regex /o optimization or bug?

One thing it, mystifyingly, does not do is, allow a ONCE block, at least at 5.8.8.
perl -le 'for (1..3){
print;
m/${\(print( "between 1 and 2 only"), 3)}/o and print "matched"
}'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Unexpected speed behaviour when benchmarking Perl regexs - regex

Well, your case "Using regex" is so slow because you are compiling it each time. Try moving it out of the subroutine.

Perl optimizes a lot of things. Your pattern with no special regex features and literal characters allows perl's regex engine to simplify many things. Using use re 'debug' can show you what's actually happening behind the scenes.

Related

Regex performance

In perl, Is there a more compact way to search for a number of patterns, and for each one, substitute with an expression

Dynamically capture regular expression match in Perl

Regex simple replace document from dictionary hash (Perl)

Does the 'o' modifier for Perl regular expressions still provide any benefit?

Categories

Resources