How do I efficiently generate a list of primes in Perl 6? - primes

Generating a list of primes is incredibly easy in Perl 6 using is-prime:
my #primes = (^∞).grep: *.is-prime;
This works well enough if you need a relatively small number of primes, but is very inefficient for large numbers, since every number is independently checked.
Is there a way to access Perl 6's built-in prime checking logic to efficiently create a list of primes?
Otherwise I'll need to build a sieve myself. Easy enough, but I'm afraid a sieve in high-level Perl 6 code will be almost as inefficient as the code I started with.

If you run your program with --profile, you will see that more than 99% of the time is spent in Int.is-prime. Since that effectively is just a wrapper around nqp::isprime_I(), I have tried to run similar code without the wrapper. But that doesn't change anything noticeably. So the brunt of the work is being done in nqp::isprime_I().
So the only option you really have is to parallelize your searches for primes. In the (nearer) future, hyper would be your friend here. But that is currently at the "initial naive implementation" phase, with a more robust implementation being discussed in: https://gist.github.com/jnthn/6a80a9712fb38b32537f9f0e46fca6d7
Until then, if you want to run things faster, you would have to manually break up the range of values you want to check for primedness and run them within a start block, and collect the results from the resulting Promise.

I wrote some Perl 6 bindings for primesieve:
https://github.com/CurtTilmes/perl6-primesieve
Math::Primesieve

This isn't really an answer to my question, but I benchmarked a few ways to get a list of primes.
Conclusions:
.is-prime is indeed way too slow for this (although #DanaJ's branch hopefully improves this a bit).
A sieve in Perl 6 code isn't as slow as I feared, as long as you optimize things a bit (i.e. make the code less pretty Perl6ish).
Native code (by way of a Perl 5 module) is still way faster.
Cheating is the fastest. 😋
Edit: added Curt Tilmes' Primesieve module. Wow, it's fast! It beats cheating (i.e. reading primes from a file)!
Well, that might be because Primesieve doesn't support a generator/iterator (yet?), so I'm just returning the whole list at once, while all other versions use a generator / lazy list.
Edit: added “even more optimized” sieve based on Timbus' comment. This one has a pretty decent performance, but the code is almost impossible to follow...
Edit: added an even better pure Perl6 version with a flexible sieve size. I start with a (pre-initialized) sieve for the odd numbers 1..99, and double the sieve size as necessary. Together with some further optimizations (e.g. when extending the sieve, we only have to check primes up to √(sieve size)) this is by far the fastest pure Perl 6 version so far. Further advantages: you don't have to know a limit in advance; and it gives the small primes a lot faster if you do need a high limit.
Edit: Math::Primesieve now supports an iterator, so include that in the script.
#!/usr/bin/env perl6
use v6.c;
# The easy but slow way
sub primes-is-prime
{
(^∞).grep: *.is-prime;
}
# Use a sieve (simple Perl 6 style)
sub primes-sieve(Int $max)
{
my #sieve;
lazy gather for 2..$max -> $p {
next if #sieve[$p];
take $p;
for 2*$p, 3*$p ... $max -> $n {
#sieve[$n] = True;
}
}
}
# Use a sieve (optimized)
sub primes-sieve2(Int $max)
{
my int #sieve;
lazy gather {
take 2;
loop (my int $p = 3; $p ≤ $max; $p += 2) {
next if #sieve[$p];
take $p;
loop (my int $n = 3*$p; $n ≤ $max; $n += 2*$p) {
#sieve[$n] = 1;
}
}
}
}
# Use a sieve (even more optimized)
sub primes-sieve3(Int $max)
{
my int #sieve;
my $max2 = ($max-1) div 2;
lazy gather {
take 2;
for 1 .. $max2 -> int $i {
next if #sieve[$i];
take 2*$i + 1;
my $max3 = ($max2 - $i) div (2*$i + 1);
for 1 .. $max3 -> int $j {
#sieve[(2*$j + 1)*$i + $j] = 1;
}
}
}
}
# Use a flexible sieve size (and further optimized)
sub primes-sieve4
{
# Pre-initialize our sieve with the odd numbers from 1 to 99
my $max = 100;
my int #sieve = 1,0,0,0,1,0,0,1,0,0,1,0,1,1,0,0,1,1,0,1,0,0,1,0,1,
1,0,1,1,0,0,1,1,0,1,0,0,1,1,0,1,0,1,1,0,1,1,1,0,1;
lazy gather {
# Don't forget our single even prime number
take 2;
my int $min-i = 1;
loop {
# Take all primes in the new part of the sieve
my int $max-i = ($max-1) div 2;
for $min-i .. $max-i -> int $i {
take 2*$i + 1 unless #sieve[$i];
}
# Extend sieve by factor 2
# We must check the primes from 3 to √(2*max) in the sieve
# for max to 2*max
for 1 .. ((2*$max).sqrt.floor-1) div 2 -> int $i {
next if #sieve[$i];
my int $p = 2*$i + 1;
my int $min-j = max(($max-i - $i) div $p, $i);
my int $max-j = (2*$max-i + 1 - $i) div $p;
for $min-j .. $max-j -> int $j {
#sieve[$i + $p*$j] = 1;
}
}
# Double the sieve size, and start the next iteration
# in the second half of the sieve
$max *= 2;
$min-i = $max-i+1;
}
}
}
# Use a Perl 5 module
sub primes-perl5
{
use Math::Prime::Util:from<Perl5> <prime_iterator>;
my $it = prime_iterator;
lazy gather {
loop {
take $it.();
}
}
}
# Use Primesieve module
sub primes-primesieve($max)
{
# See:
# - http://primesieve.org/
# - https://github.com/CurtTilmes/perl6-primesieve
use Math::Primesieve;
# No iterator support (yet?), so just return the whole list
return Math::Primesieve.new.primes($max);
}
# Use Primesieve module (iterator)
sub primes-primesieve-iterator
{
# See:
# - http://primesieve.org/
# - https://github.com/CurtTilmes/perl6-primesieve
use Math::Primesieve;
my $iterator = Math::Primesieve::iterator.new;
lazy gather {
loop {
take $iterator.next;
}
}
}
# Cheat
# Source: https://primes.utm.edu/lists/small/millions/ - first million
# (Unzip and remove the first few lines from the file.)
sub primes-cheat
{
lazy $*PROGRAM.sibling('primes1.txt').words.map(+*);
}
sub timer(&code)
{
my $start = now;
&code();
my $elapsed = now - $start;
say "Elapsed: $elapsed.fmt('%.3f')s";
}
sub MAIN
{
#my $nth = 1_000;
#my $max = 8_000;
#my $nth = 10_000;
#my $max = 105_000;
my $nth = 50_000;
my $max = 612_000;
timer {
my #primes = primes-is-prime;
say "Using .is-prime: #primes[$nth]";
}
timer {
my #primes = primes-sieve($max);
say "Using a sieve (simple Perl 6 style): #primes[$nth]";
}
timer {
my #primes = primes-sieve2($max);
say "Using a sieve (optimized): #primes[$nth]";
}
timer {
my #primes = primes-sieve3($max);
say "Using a sieve (even more optimized): #primes[$nth]";
}
timer {
my #primes = primes-sieve4;
say "Using a flexible sieve size (further optimized): #primes[$nth]";
}
timer {
my #primes = primes-perl5;
say "Using a Perl 5 module: #primes[$nth]";
}
timer {
my #primes = primes-primesieve($max);
say "Using Primesieve module: #primes[$nth]";
}
timer {
my #primes = primes-primesieve-iterator;
say "Using Primesieve module (iterator): #primes[$nth]";
}
timer {
my #primes = primes-cheat;
say "Cheating: #primes[$nth]";
}
}
# 4 year old Linux server, running Rakudo Star 2017.04:
#
# Using .is-prime: 611957
# Elapsed: 216.134s
# Using a sieve (simple Perl 6 style): 611957
# Elapsed: 124.087s
# Using a sieve (optimized): 611957
# Elapsed: 41.129s
# Using a sieve (even more optimized): 611957
# Elapsed: 7.285s
# Using a flexible sieve size (further optimized): 611957
# Elapsed: 3.897s
# Using a Perl 5 module: 611957
# Elapsed: 10.031s
# Using Primesieve module: 611957
# Elapsed: 0.312s
# Using Primesieve module (iterator): 611957
# Elapsed: 1.460s
# Cheating: 611957
# Elapsed: 2.017s

Related

Perl anchored regex performance

Problem and Data
At the bottom of this post is the entire script from which this NYTProf data was generated. The script builds a hash and then attempts to delete keys that contain certain bad pattern. Running the code through NYTProf generates the following
delete #$hash{ grep { /\Q$bad_pattern\E/ } sort keys %$hash };
# spent 7.29ms making 2 calls to main::CORE:sort, avg 3.64ms/call
# spent 808µs making 7552 calls to main::CORE:match, avg 107ns/call
# spent 806µs making 7552 calls to main::CORE:regcomp, avg 107ns/call
There are over 7,000 calls being made to main::CORE:match and main::CORE:regcomp. The assumption is that this is a sufficient amount of calls to reduce noise levels.
Moving on! The bad patterns only need to be deleted if they appear at the beginning of a key. Sounds great! Adding a ^ to anchor the regex should improve performance. However, NYTProf generates the following. NYTprof has been run many times and this is quite consistent
delete #$hash{ grep { /^\Q$bad_pattern\E/ } sort keys %$hash };
# spent 7.34ms making 2 calls to main::CORE:sort, avg 3.67ms/call
# spent 1.62ms making 7552 calls to main::CORE:regcomp, avg 214ns/call
# spent 723µs making 7552 calls to main::CORE:match, avg 96ns/call
Questions
The anchored regex nearly doubles the amount of time spent in these main::CORE:* methods. But an anchored regex should improve performance. What is unique about this dataset that causes the anchored regex to take so much additional time?
Entire Script
use strict;
use Devel::NYTProf;
my #states = qw(KansasCity MississippiState ColoradoMountain IdahoInTheNorthWest AnchorageIsEvenFurtherNorth);
my #cities = qw(WitchitaHouston ChicagoDenver);
my #streets = qw(DowntownMainStreetInTheCity CenterStreetOverTheHill HickoryBasketOnTheWall);
my #seasoncode = qw(8000S 8000P 8000F 8000W);
my #historycode = qw(7000S 7000P 7000F 7000W 7000A 7000D 7000G 7000H);
my #sides = qw(left right up down);
my $hash;
for my $state (#states) {
for my $city (#cities) {
for my $street (#streets) {
for my $season (#seasoncode) {
for my $history (#historycode) {
for my $side (#sides) {
$hash->{$state . '[0].' . $city . '[1].' . $street . '[2].' . $season . '.' . $history . '.' . $side} = 1;
}
}
}
}
}
}
sub CleanseHash {
my #bad_patterns = (
'KansasCity[0].WitchitaHouston[1].DowntownMainStreetInTheCity[2]',
'ColoradoMountain[0].ChicagoDenver[1].HickoryBasketOnTheWall[2].8000F'
);
for my $bad_pattern (#bad_patterns) {
delete #$hash{ grep { /^\Q$bad_pattern\E/ } sort keys %$hash };
}
}
DB::enable_profile();
CleanseHash();
DB::finish_profile();
It's very unlikely you can optimise the regex engine. If performance is your goal, though, you can concentrate on other parts of the code. For example, try this:
for my $bad_pattern (#bad_patterns) {
my $re = qr/^\Q$bad_pattern\E/;
delete #$hash{ grep /$re/, sort keys %$hash };
}
On my machine, it runs much faster (regardless of the presence of the anchor), because the expression form of grep doesn't have to create a scope and the complex compilation of the regex happens just once for each bad pattern.
That's a fairly straightforward matching, with a pattern being a fixed string. So the anchored pattern must be faster in general. The profiling confirms that much, with 96 ns/call vs 107 ns/call.
But when I benchmark anchored and un-anchored versions of the code they run neck-to-neck. This is about the rest of the code, which overwhelms the regex's match: the sort of keys is unneeded for comparison, and the regex is being compiled inside grep's loop, unneeded.
When that is relieved I do get the anchored call to be 11--15% faster (multiple runs)
use warnings;
use strict;
use feature 'say';
use Data::Dump;
use Storable qw(dclone);
use Benchmark qw(cmpthese);
my $runfor = shift // 3;
my #states = qw(KansasCity MississippiState ColoradoMountain IdahoInTheNorthWest AnchorageIsEvenFurtherNorth);
my #cities = qw(WitchitaHouston ChicagoDenver);
my #streets = qw(DowntownMainStreetInTheCity CenterStreetOverTheHill HickoryBasketOnTheWall);
my #seasoncode = qw(8000S 8000P 8000F 8000W);
my #historycode = qw(7000S 7000P 7000F 7000W 7000A 7000D 7000G 7000H);
my #sides = qw(left right up down);
my #bad_patterns = (
'KansasCity[0].WitchitaHouston[1].DowntownMainStreetInTheCity[2]',
'ColoradoMountain[0].ChicagoDenver[1].HickoryBasketOnTheWall[2].8000F'
);
my $hash_1;
for my $state (#states) {
for my $city (#cities) {
for my $street (#streets) {
for my $season (#seasoncode) {
for my $history (#historycode) {
for my $side (#sides) {
$hash_1->{$state . '[0].' . $city . '[1].' . $street . '[2].' . $season . '.' . $history . '.' . $side} = 1;
}
}
}
}
}
}
my $hash_2 = dclone $hash_1;
#say for #bad_patterns; say '---'; dd $hash_1; exit;
sub no_anchor {
for my $bad_pattern (#bad_patterns) {
my $re = qr/\Q$bad_pattern\E/;
delete #$hash_2{ grep { /$re/ } keys %$hash_2 };
}
}
sub w_anchor {
for my $bad_pattern (#bad_patterns) {
my $re = qr/^\Q$bad_pattern\E/;
delete #$hash_1{ grep { /$re/ } keys %$hash_1 };
}
}
cmpthese( -$runfor, {
'no_anchor' => sub { no_anchor() },
'w_anchor' => sub { w_anchor() },
});
I have the comparison subs use external data (not passed to tested subs as usually), to cut out any extra work, and then I use separate hashref copies obtained with Storable::dclone.
The output of benchmark above run with 10 seconds (pass 10 to program when run):
Rate no_anchor w_anchor
no_anchor 296/s -- -13%
w_anchor 341/s 15% --
So the anchored version does win, albeit with a modest margin. With this data the match fails in about 96% cases and for all of that the un-anchored version does more work, having to search through the whole string; I'd expect a larger difference.
The relative closeness of runtimes is due to the rest of the code (grep, hash manipulation, loop), and in particular the regex compilation cost, being included in the timing, what dilutes the difference in the matching efficiency itself.
This lends us an important lesson about timing code: it can be subtle. One needs to ensure that only the relevant sections are compared, and fairly (in equal situataions).

Merge/combine two lists line by line?

I have two lists stored in variables: $list1 and $list2, for example:
$list1:
a
b
c
d
$list2:
1
2
3
4
How do I merge them together line by line so that I end up with:
a1
b2
c3
d4
I have tried using array (#) but it just combines them one after the other, not line by line, example:
$list1 = #(command)
$list1 += #($list2)
If you prefer pipelining, you can also do it in one line:
0 .. ($list1.count -1) | ForEach-Object { $list1[$_]+$list2[$_] }
You could do this with a For loop that uses iterates through the index of each object until it reaches the total (.count) of the first object:
$list1 = 'a','b','c','d'
$list2 = 1,2,3,4
For ($i=0; $i -lt $list1.count; $i++) {
$list1[$i]+$list2[$i]
}
Output:
a1
b2
c3
d4
If you want the results to go to a variable, you could put (for example) $list = before the For.
To complement Mark Wragg's helpful for-based answer and Martin Brandl's helpful pipeline-based answer:
Combining foreach with .., the range operator allows for a concise solution that also performs well:
foreach ($i in 0..($list1.count-1)) { "$($list1[$i])$($list2[$i])" }
Even though an entire array of indices is constructed first - 0..($list1.count-1) - this slightly outperforms the for solution with large input lists, and both foreach and for will be noticeably faster than the pipeline-based solution - see below.
Also note how string interpolation (variable references and subexpressions inside a single "..." string) are used to ensure that the result is always a string.
By contrast, if you use +, it is the type of the LHS that determines the output type, which can result in errors or unwanted output; e.g., 1 + 'a' causes an error, because 1 is an integer and 'a' cannot be converted to an integer.
Optional reading: performance considerations
Generally, foreach and for solutions are noticeably faster than pipeline-based (ForEach-Object cmdlet-based) solutions.
Pipelines are elegant and concise, but they are comparatively slow.
That shouldn't stop you from using them, but it's important to be aware that they can be a performance bottleneck.
Pipelines are memory-efficient, and for processing large collections that don't fit into memory as a whole they are always the right tool to use.
PSv4 introduced the little-known .ForEach() collection operator (method), whose performance is in between that of for / foreach and the ForEach-Object cmdlet.
The following compares the relative performance with large lists (100,000 items); the absolute timing numbers will vary based on many factors, but they should give you a general sense:
# Define two large lists.
$list1 = 1..100000
$list2 = 1..100000
# Define the commands as script blocks:
$cmds = { foreach ($i in 0..($list1.count-1)) { "$($list1[$i])$($list2[$i])" } },
{ for ($i=0; $i -lt $list1.count; $i++) { "$($list1[$i])$($list2[$i])" } },
{ 0..($list1.count -1) | ForEach-Object { "$($list1[$_])$($list2[$_])" } },
{ (0..($list1.count-1)).ForEach({ "$($list1[$_])$($list2[$_])" }) }
# Time each command.
$cmds | ForEach-Object { '{0:0.0}' -f (Measure-Command $_).TotalSeconds }
In a 2-core Windows 10 VM running PSv5.1 I get the following results after running the tests several times:
0.5 # foreach
0.7 # for
1.8 # ForEach-Object (pipeline)
1.2 # .ForEach() operator

How to get the value from the list that appears only once?

I saw this question in the internet. Get the only number that is present only once in the list while other numbers are present twice in the list. The data is large and contains about a million numbers unsorted and may contain negative numbers too of random order out of which all numbers appear twice except one number that appears only once.
my #array = (1,1,2,3,3,4,4)
output :
2
Only two is not repeated in the list. I tried my solutions.
my $unique;
$unique ^= $_ for(#array);
say $unique;
It doesn't work on negative numbers but fast.
I tried a hash where key is the number and value is the number of times its present in the list. Reverse the hash and then print the value with 1 as key as all other numbers have 2 as key as they appear twice. The hash solution is slow with a large input of one million numbers but works for negative numbers.
I tried a regex way of combining the entire list with tab and then used
my $combined = join " ", #array;
$combined !~ (\d+).*$1;
say $1;
but I get only the last number of the list
Is there a fast way to do it? Any idea of using a regex?
Edit : Repharsed the title for better answers
This seems pretty fast:
use v5.10; use strict; use warnings;
sub there_can_be_only_one {
my #counts;
$counts[ $_>=0 ? 2*$_ : (-2*$_)-1 ]++ for #{$_[0]};
$counts[ $_>=0 ? 2*$_ : (-2*$_)-1 ]==1 and return $_ for #{$_[0]};
return;
}
my #array = (1,1,-4,-4,2,3,-1,3,4,-1,4);
say there_can_be_only_one(\#array);
It's basically a variation of the hash technique, but using an array instead of a hash. Because we need to deal with negative numbers, we can't use them unmodified in the #counts array. Negative indexes do work in Perl of course, but they'd overwrite our data for positive indexes. Fail.
So we use something similar to two's complement. We store positive numbers in the array as 2*$_ and negative numbers as (-2*$_)-1. That is:
Integer: ... -3 -2 -1 0 1 2 3 ...
Stored as: ... 5 3 1 0 2 4 6 ...
Because this solution doesn't rely on sorting the list, and simply does two passes over it (well, on average, one and a half passes), it performs at O(n) in contrast to Schwern's O(n log n) solution. Thus for larger lists (a few million integers) should be significantly faster. Here's a quick comparison on my (fairly low-powered) netbook:
use v5.10; use strict; use warnings;
use Benchmark qw(timethese);
use Time::Limit '60';
sub tobyink {
my #counts;
$counts[ $_>=0 ? 2*$_ : (-2*$_)-1 ]++ for #{$_[0]};
$counts[ $_>=0 ? 2*$_ : (-2*$_)-1 ]==1 and return $_ for #{$_[0]};
return;
}
sub schwern {
my #nums = sort #{$_[0]};
return $nums[0] if $nums[0] != $nums[1];
for (1..$#nums-1) {
my($prev, $this, $next) = #nums[$_-1, $_, $_+1];
return $this if $prev != $this && $next != $this;
}
return $nums[-1] if $nums[-1] != $nums[-2];
}
my #input = (
1..2_000_000, # 1_000_001 only appears once
1..1_000_000, 1_000_002..2_000_000,
);
timethese(1, {
tobyink => sub { tobyink(\#input) },
schwern => sub { schwern(\#input) },
});
__END__
Benchmark: timing 1 iterations of schwern, tobyink...
schwern: 11 wallclock secs ( 8.72 usr + 0.92 sys = 9.64 CPU) # 0.10/s (n=1)
(warning: too few iterations for a reliable count)
tobyink: 5 wallclock secs ( 5.01 usr + 0.08 sys = 5.09 CPU) # 0.20/s (n=1)
(warning: too few iterations for a reliable count)
UPDATE: in my initial answer I missed the detail that no number will appear more than twice. I'd assumed that it was possible for some numbers to appear three or more times. Using this additional detail, we can go even faster:
sub there_can_be_only_one {
my $tmp;
$tmp ^= $_>=0 ? 2*$_ : (-2*$_)-1 for #{$_[0]};
$tmp%2 ? ($tmp+1)/-2 : $tmp/2;
}
say there_can_be_only_one(\#array);
This runs about 30% faster than my initial answer.
The standard way to deal with this is to throw it all into a hash.
use v5.10;
use strict;
use warnings;
my #nums = (2..500_000, 500_002..1_000_000, 0..1_000_001);
my %count;
for (#nums) {
$count{$_}++
}
for (keys %count) {
say $_ if $count{$_} == 1;
}
But yes, it's quite slow.
Then I thought maybe I could avoid having to loop through the hash to find the singles...
my #nums = (2..500_000, 500_002..1_000_000, 0..1_000_001);
my %uniqs;
my %dups;
for (#nums) {
if( $uniqs{$_} ) {
delete $uniqs{$_};
$dups{$_} = 1;
}
elsif( !$dups{$_} ) {
$uniqs{$_} = 1;
}
}
print join ", ", keys %uniqs;
But that was even slower.
This is the fastest thing I've come up with, takes about half the time as the above.
use v5.10;
use strict;
use warnings;
my #nums = (2..500_000, 500_002..1_000_000, 0..1_000_001);
#nums = sort #nums;
say $nums[0] if $nums[0] != $nums[1];
for (1..$#nums-1) {
my($prev, $this, $next) = #nums[$_-1, $_, $_+1];
say $this if $prev != $this && $next != $this;
}
say $nums[-1] if $nums[-1] != $nums[-2];
By sorting the list, you can iterate through it and check if a given entry's neighbors are duplicates. Have to be careful about the first and last elements. I put their checks outside the loop to avoid having to run a special case for every iteration.
Because sort is O(nlogn), as the list of numbers gets larger this solution will eventually be slower than the hash-based one, but you'll probably run out of memory before that happens.
Finally, if this list is large, you should consider storing it on disk in a database. Then you can avoid using up memory and let the database do the work efficiently.
It doesn't work on negative numbers but fast.
Actually, if you want xor to work on negative numbers, you just need to stringify them:
my #array = (-10..-7,-5..10,-10..10);
my $unique;
$unique ^= "$_" for #array;
say $unique;
Outputs
-6
And doing some quick benchmarks:
Benchmark: timing 100 iterations of schwern, there_can_be_only_one, tobyink, xor_string...
schwern: 323 wallclock secs (312.42 usr + 7.08 sys = 319.51 CPU) # 0.31/s (n=100)
there_can_be_only_one: 114 wallclock secs (113.49 usr + 0.02 sys = 113.51 CPU) # 0.88/s (n=100)
tobyink: 177 wallclock secs (176.76 usr + 0.14 sys = 176.90 CPU) # 0.57/s (n=100)
xor_string: 98 wallclock secs (97.05 usr + 0.00 sys = 97.05 CPU) # 1.03/s (n=100)
Shows that xor-ing the string goes 15% faster than xor-ing the mathematical translation to the positive numbers.
Corollary - What about with a sorted list?
Schwern's solution brings up an interesting corollary. He sorted the list and then did a search for all of the unique elements.
If we use the additional information that there is only 1 singleton in a crowd of doubletons, we can quickly simplify that the search by doing a pairwise comparison which reduces our comparisons a factor of 4.
However, we can do even better by doing a binary search. If we separate the list on a barrier between a known matched pair, then whichever of the two remaining lists is odd contains our singleton. I did some benchmarking of this solution, and it's orders of magnitude faster than anything else (of course):
use strict;
use warnings;
use Benchmark qw(timethese);
sub binary_search {
my $nums = $_[0];
my $min = 0;
my $max = $#$nums;
while ($min < $max) {
my $half = ($max - $min) / 2; # should always be an integer
my ($prev, $this, $next) = ($min+$half-1) .. ($min+$half+1);
if ($nums->[$prev] == $nums->[$this]) {
if ($half % 2) { # 0 0 1 1 2 2 3 ( half = 3 )
$min = $next;
} else { # 0 1 1 2 2 ( half = 2 )
$max = $prev - 1;
}
} elsif ($nums->[$this] == $nums->[$next]) {
if ($half % 2) { # 0 1 1 2 2 3 3 ( half = 3 )
$max = $prev;
} else { # 0 0 1 1 2 ( half = 2 )
$min = $next + 1;
}
} else {
$max = $min = $this;
}
}
return $nums->[$min];
}
sub xor_string {
my $tmp;
$tmp ^= "$_" for #{$_[0]};
}
sub brute {
my $nums = $_[0];
return $nums->[0] if $nums->[0] != $nums->[1];
for (1..$#$nums-1) {
my($prev, $this, $next) = #$nums[$_-1, $_, $_+1];
return $this if $prev != $this && $next != $this;
}
return $nums->[-1] if $nums->[-1] != $nums->[-2];
}
sub pairwise_search {
my $nums = $_[0];
for (my $i = 0; $i <= $#$nums; $i += 2) {
if ($nums->[$i] != $nums->[$i+1]) {
return $nums->[$i];
}
}
}
# Note: this test data is very specific and is intended to take near the maximum
# number of steps for a binary search while shortcutting halfway for brute force
# and pairwise
my #input = sort {$a <=> $b} (0..500_003, 500_005..1_000_000, 0..1_000_000);
#my #input = sort {$a <=> $b} (0..499_996, 499_998..1_000_000, 0..1_000_000);
timethese(1000, {
brute => sub { brute(\#input) },
pairwise => sub { pairwise_search(\#input) },
xor_string => sub { xor_string(\#input) },
binary => sub { binary_search(\#input) },
});
Results:
Benchmark: timing 1000 iterations of binary, brute, pairwise, xor_string...
binary: 0 wallclock secs ( 0.02 usr + 0.00 sys = 0.02 CPU) # 62500.00/s (n=1000)
(warning: too few iterations for a reliable count)
brute: 472 wallclock secs (469.92 usr + 0.05 sys = 469.97 CPU) # 2.13/s (n=1000)
pairwise: 216 wallclock secs (214.74 usr + 0.00 sys = 214.74 CPU) # 4.66/s (n=1000)
xor_string: 223 wallclock secs (221.74 usr + 0.06 sys = 221.80 CPU) # 4.51/s (n=1000)

profiling a simple regex search

I want to test the performance of two different approaches, in perl, of checking that one string is contained entirely within another.
The first approach is to take a string convert it to an array and test character by character whilst the second approach simply evaluates a regular expression (which I believe has the same order as a linear search through all the characters but doesn't incur the cost of assigning memory for an array, and copying characters into it (though it might have other costs involved)).
My initial approach to doing this test was to just stick both procedures (see below) in a big for loop (0 to 999999) and then time how long it takes for the program to finish; and at first it looked as though a regex match was much faster (12.926s vs 0.318s); but I then considered the possibility that upon evaluating the regex once the following iterations are trivial because it is cached. To test this I instead put my for loop on the command line (making each iteration of the perl script looping through 0 to 0 "memory-less") and noticed that they are both similar (albeit with some wild divergence from the average at times). But I strongly suspect this might be a poor conclusion because the time taken to start the script probably dominates the execution time of the script.
Is there a way (especially for when I want to look at something less trivial), of turning off the caching (if that's what is happening of course) so that I can fairly run procedures within a for loop (so I can call the script only once)?
Or is it the case that there is nothing clever going on and that a regex search really is much quicker in this example!?
my $teststr = "testing testing";
my $teststr2 = "testing tasted";
my $match = 1;
#procedure one - character by character checking
for (my $i = 0; $i < 1; $i++)
{
my #chrArr = split //, $teststr;
my #chrArr2 = split //, $teststr2;
for (my $j = 0; $j < #chrArr2; $j++)
{
if($chrArr[$j] != $chrArr2[$j])
{
$match = 0;
break;
}
}
}
#procedure 2 - regex matching
for (my $i = 0; $i < 1; $i++)
{
if($teststr !~ m/$teststr2/)
{
$match = 0;
}
}
Why don't you use the Banchmark module. It should fit perfectly here.
use Benchmark qw( timethese cmpthese);
--
cmic
Regular expression matching/searching is linear. Compiling the pattern is expensive. If you change $teststr2 on every iteration, no caching will be possible. For example:
#procedure 2 - regex matching
for (my $i = 0; $i < 1; $i++)
{
if($teststr !~ m/${i}$teststr2/)
{
$match = 0;
}
}

Find words, that are substrings of other words efficiently

I have an Ispell list of english words (nearly 50 000 words), my homework in Perl is to get quickly (like under one minute) list of all strings, that are substrings of some other word. I have tried solution with two foreach cycles comparing all words, but even with some optimalizations, its still too slow. I think, that right solution could be some clever use of regular expressions on array of words. Do you know how to solve this problem quicky (in Perl)?
I have found fast solution, which can find some all these substrings in about 15 seconds on my computer, using just one thread. Basically, for each word, I have created array of every possible substrings (eliminating substrings which differs only in "s" or "'s" endings):
#take word and return list of all valid substrings
sub split_to_all_valid_subwords {
my $word = $_[0];
my #split_list;
my ($i, $j);
for ($i = 0; $i < length($word); ++$i){
for ($j = 1; $j <= length($word) - $i; ++$j){
unless
(
($j == length($word)) or
($word =~ m/s$/ and $i == 0 and $j == length($word) - 1) or
($word =~ m/\'s$/ and $i == 0 and $j == length($word) - 2)
)
{
push(#split_list, substr($word, $i, $j));
}
}
}
return #split_list;
}
Then I just create list of all candidates for substrings and make intersection with words:
my #substring_candidates;
foreach my $word (#words) {
push( #substring_candidates, split_to_all_valid_subwords($word));
}
#make intersection between substring candidates and words
my %substring_candidates=map{$_ =>1} #substring_candidates;
my %words=map{$_=>1} #words;
my #substrings = grep( $substring_candidates{$_}, #words );
Now in substrings I have array of all words, that are substrings of some other words.
Perl regular expressions will optimize patterns like foo|bar|baz into an Aho-Corasick match - up to a certain limit of total compiled regex length. Your 50000 words will probably exceed that length, but could be broken into smaller groups. (Indeed, you probably want to break them up by length and only check words of length N for containing words of length 1 through N-1.)
Alternatively, you could just implement Aho-Corasick in your perl code - that's kind of fun to do.
update
Ondra supplied a beautiful solution in his answer; I leave my post here as an example of overthinking a problem and failed optimisation techniques.
My worst case kicks in for a word that doesn't match any other word in the input. In that case, it goes quadratic. The OPT_PRESORT was a try to advert the worst case for most words. The OPT_CONSECUTIVE was a linear-complexity filter that reduced the total number of items in the main part of the algorithm, but it is just a constant factor when considering the complexity. However, it is still useful with Ondras algorithm and saves a few seconds, as building his split list is more expensive than comparing two consecutive words.
I updated the code below to select ondras algorithm as a possible optimisation. Paired with zero threads and the presort optimisation, it yields maximum performance.
I would like to share a solution I coded. Given an input file, it outputs all those words that are a substring of any other word in the same input file. Therefore, it computes the opposite of ysth's ideas, but I took the idea of optimisation #2 from his answer. There are the following three main optimisations that can be deactivated if required.
Multithreading
The questions "Is word A in list L? Is word B in L?" can be easily parallelised.
Pre-sorting all the words for their length
I create an array that points to the list of all words that are longer than a certain length, for every possible length. For long words, this can cut down the number of possible words dramatically, but it trades quite a lot of space, as one word of length n appears in all lists from length 1 to length n.
Testing consecutive words
In my /usr/share/dict/words, most consecutive lines look quite similar:
Abby
Abby's
for example. As every word that would match the first word also matches the second one, I immediately add the first word to the list of matching words, and only keep the second word for further testing. This saved about 30% of words in my test cases. Because I do that before optimisation No 2, this also saves a lot of space. Another trade-off is that the output will not be sorted.
The script itself is ~120 lines long; I explain each sub before showing it.
head
This is just a standard script header for multithreading. Oh, and you need perl 5.10 or better to run this. The configuration constants define the optimisation behaviour. Add the number of processors of your machine in that field. The OPT_MAX variable can take the number of words you want to process, however this is evaluated after the optimisations have taken place, so the easy words will already have been caught by the OPT_CONSECUTIVE optimisation. Adding anything there will make the script seemingly slower. $|++ makes sure that the status updates are shown immediately. I exit after the main is executed.
#!/usr/bin/perl
use strict; use warnings; use feature qw(say); use threads;
$|=1;
use constant PROCESSORS => 0; # (false, n) number of threads
use constant OPT_MAX => 0; # (false, n) number of words to check
use constant OPT_PRESORT => 0; # (true / false) sorts words by length
use constant OPT_CONSECUTIVE => 1; # (true / false) prefilter data while loading
use constant OPT_ONDRA => 1; # select the awesome Ondra algorithm
use constant BLABBER_AT => 10; # (false, n) print progress at n percent
die q(The optimisations Ondra and Presort are mutually exclusive.)
if OPT_PRESORT and OPT_ONDRA;
exit main();
main
Encapsulates the main logic, and does multi-threading. The output of n words will be matched will be considerably smaller than the number of input words, if the input was sorted. After I have selected all matched words, I print them to STDOUT. All status updates etc. are printed to STDERR, so that they don't interfere with the output.
sub main {
my #matching; # the matching words.
my #words = load_words(\#matching); # the words to be searched
say STDERR 0+#words . " words to be matched";
my $prepared_words = prepare_words(#words);
# do the matching, possibly multithreading
if (PROCESSORS) {
my #threads =
map {threads->new(
\&test_range,
$prepared_words,
#words[$$_[0] .. $$_[1]] )
} divide(PROCESSORS, OPT_MAX || 0+#words);
push #matching, $_->join for #threads;
} else {
push #matching, test_range(
$prepared_words,
#words[0 .. (OPT_MAX || 0+#words)-1]);
}
say STDERR 0+#matching . " words matched";
say for #matching; # print out the matching words.
0;
}
load_words
This reads all the words from the input files which were supplied as command line arguments. Here the OPT_CONSECUTIVE optimisation takes place. The $last word is either put into the list of matching words, or into the list of words to be matched later. The -1 != index($a, $b) decides if the word $b is a substring of word $a.
sub load_words {
my $matching = shift;
my #words;
if (OPT_CONSECUTIVE) {
my $last;
while (<>) {
chomp;
if (defined $last) {
push #{-1 != index($_, $last) ? $matching : \#words}, $last;
}
$last = $_;
}
push #words, $last // ();
} else {
#words = map {chomp; $_} <>;
}
#words;
}
prepare_words
This "blows up" the input words, sorting them after their length into each slot, that has the words of larger or equal length. Therefore, slot 1 will contain all words. If this optimisation is deselected, it is a no-op and passes the input list right through.
sub prepare_words {
if (OPT_ONDRA) {
my $ondra_split = sub { # evil: using $_ as implicit argument
my #split_list;
for my $i (0 .. length $_) {
for my $j (1 .. length($_) - ($i || 1)) {
push #split_list, substr $_, $i, $j;
}
}
#split_list;
};
return +{map {$_ => 1} map &$ondra_split(), #_};
} elsif (OPT_PRESORT) {
my #prepared = ([]);
for my $w (#_) {
push #{$prepared[$_]}, $w for 1 .. length $w;
}
return \#prepared;
} else {
return [#_];
}
}
test
This tests if the word $w is a substring in any of the other words. $wbl points to the data structure that was created by the previous sub: Either a flat list of words, or the words sorted by length. The appropriate algorithm is then selected. Nearly all of the running time is spent in this loop. Using index is much faster than using a regex.
sub test {
my ($w, $wbl) = #_;
my $l = length $w;
if (OPT_PRESORT) {
for my $try (#{$$wbl[$l + 1]}) {
return 1 if -1 != index $try, $w;
}
} else {
for my $try (#$wbl) {
return 1 if $w ne $try and -1 != index $try, $w;
}
}
return 0;
}
divide
This just encapsulates an algorithm that guarantees a fair distribution of $items items into $parcels buckets. It outputs the bounds of a range of items.
sub divide {
my ($parcels, $items) = #_;
say STDERR "dividing $items items into $parcels parcels.";
my ($min_size, $rest) = (int($items / $parcels), $items % $parcels);
my #distributions =
map [
$_ * $min_size + ($_ < $rest ? $_ : $rest),
($_ + 1) * $min_size + ($_ < $rest ? $_ : $rest - 1)
], 0 .. $parcels - 1;
say STDERR "range division: #$_" for #distributions;
return #distributions;
}
test_range
This calls test for each word in the input list, and is the sub that is multithreaded. grep selects all those elements in the input list where the code (given as first argument) return true. It also regulary outputs a status message like thread 2 at 10% which makes waiting for completition much easier. This is a psychological optimisation ;-).
sub test_range {
my $wbl = shift;
if (BLABBER_AT) {
my $range = #_;
my $step = int($range / 100 * BLABBER_AT) || 1;
my $i = 0;
return
grep {
if (0 == ++$i % $step) {
printf STDERR "... thread %d at %2d%%\n",
threads->tid,
$i / $step * BLABBER_AT;
}
OPT_ONDRA ? $wbl->{$_} : test($_, $wbl)
} #_;
} else {
return grep {OPT_ONDRA ? $wbl->{$_} : test($_, $wbl)} #_;
}
}
invocation
Using bash, I invoked the script like
$ time (head -n 1000 /usr/share/dict/words | perl script.pl >/dev/null)
Where 1000 is the number of lines I wanted to input, dict/words was the word list I used, and /dev/null is the place I want to store the output list, in this case, throwing the output away. If the whole file should be read, it can be passed as an argument, like
$ perl script.pl input-file >output-file
time just tells us how long the script ran. Using 2 slow processors and 50000 words, it executed in just over two minutes in my case, which is actually quite good.
update: more like 6–7 seconds now, with the Ondra + Presort optimisation, and no threading.
further optimisations
update: overcome by better algorithm. This section is no longer completely valid.
The multithreading is awful. It allocates quite some memory and isn't exactly fast. This isn't suprising considering the amount of data. I considered using a Thread::Queue, but that thing is slow like $#*! and therefore is a complete no-go.
If the inner loop in test was coded in a lower-level language, some performance might be gained, as the index built-in wouldn't have to be called. If you can code C, take a look at the Inline::C module. If the whole script were coded in a lower language, array access would also be faster. A language like Java would also make the multithreading less painful (and less expensive).