Perl list interpolation performance - list

Background
Perldoc for List::Util suggests that some uses of map may be replaced by reduce in order to avoid creating unnecessary intermadiate list:
For example, to find the total length of the all the strings in a
list, we could use
$total = sum map { length } #strings;
However, this produces a list of temporary integer values as long as
the original list of strings, only to reduce it down to a single value
again. We can compute the same result more efficiently by using reduce
with a code block that accumulates lengths by writing this instead as:
$total = reduce { $a + length $b } 0, #strings;
That makes sense. However, reduce in order to work in this example needs "identity value", that would be prepended to input list:
$total = reduce { $a + length $b } 0, #strings;
# ^^^^^^^^^^^
That makes me think, doesn't 0, #strings create a new list, thus offset any gains from not creaing list in map?
Question
How does list interpolation ($scalar, #list) work in Perl? Does it involve copying elements from source list or is it done in some smarter way? My simple benchmark suggests copying taking place:
use strict;
use warnings;
use Benchmark qw/cmpthese/;
my #a1 = 1..10;
my #a2 = 1..100;
my #a3 = 1..1000;
my #a4 = 1..10000;
my #a5 = 1..100000;
my #a6 = 1..1000000;
cmpthese(10000, {
'a1' => sub { my #l = (0, #a1); },
'a2' => sub { my #l = (0, #a2); },
'a3' => sub { my #l = (0, #a3); },
'a4' => sub { my #l = (0, #a4); },
'a5' => sub { my #l = (0, #a5); },
'a6' => sub { my #l = (0, #a6); },
});
Results:
(warning: too few iterations for a reliable count)
Rate a6 a5 a4 a3 a2 a1
a6 17.6/s -- -90% -99% -100% -100% -100%
a5 185/s 952% -- -90% -99% -100% -100%
a4 1855/s 10438% 902% -- -90% -99% -100%
a3 17857/s 101332% 9545% 862% -- -91% -98%
a2 200000/s 1135940% 107920% 10680% 1020% -- -80%
a1 1000000/s 5680100% 540000% 53800% 5500% 400% --
Bonus question: If my assumptions are correct (i.e. 0, #strings creates a new list), does replacing map with reduce make sense?

doesn't 0, #strings create a new list
Not really. If you decompile the code, it's just one additional SVOP.
But you're measuring the wrong thing. The values are flattened and passed into the map or reduce subroutine in both cases!
The documentation is talking about what happens inside the subroutine. map creates a list of as many input values and returns them, and then sum takes the list and condenses it into a value. The return list is ephemeral and is not represented directly in the code. (This list passing is not that efficient, it could be made faster by using references.)
In contrast, in reduce, there no such return list. reduce only works on the input list of values and returns a single value.

"This produces a list of temporary integer values as long as the original list of strings" refers to map putting N scalars on the stack. The thing is, the reduce approach creates just as many scalars, and they also all go on the stack. The only difference is that the reduce approach only keeps one on them on the stack at once. That means the reduce approach uses less memory, but it doesn't speak to its performance at all. The reason it gives for reduce computing the same result more efficiently is nonsense.
There could be a performance difference, but not for that reason. If you want to find which one performs better for you, will need to run a benchmark.
That makes me think, doesn't 0, #strings create a new list
No. reduce creates a single list unconditonally. This is unrelated to the number expressions in the argument list.
Lists aren't arrays. When we say "the sub returns a list" or "the op evaluates to a list", we actually mean "the sub or op places some quantity of scalars on the stack".
List are created for ops that will pop a variable number of scalars from the stack. This is done by simply pushing a mark onto the stack. For example, reduce { ... } 0, #a would create a list for the entersub op. { ... } will end up leaving one code ref on the list/stack, 0 will end up leaving a number on the list/stack, and #strings will end up leaving its elements on the list/stack. One last thing is added to the list/stack before the sub is called: the glob *reduce.
Note that creating the list is effectively free, since it's simply pushing a mark on the stack. Placing an array on the stack is proportional to the number of its elements, but it's still quite cheap since we're only copying a block of pointers (in the C sense of the word).
That means there's effectively no performance difference between reduce { ... } #strings and reduce { ... } 0, #strings. Both create a single list, and both add roughly the same number of elements to the list/stack.
Exceptions:
for (#a) is optimized to be for* (\#a).This saves memory, and it saves time if the loop is exited prematurely.
sub f(\#); f(#a) is equivalent to &f(\#a).
AFAIK, map and grep aren't optimized in this manner.
In detail:
$ perl -MO=Concise,-exec -MList::Util=reduce -e'reduce { ... } #a'
...
3 <0> pushmark s <-- Creates list (adds mark to the stack).
4 <$> anoncode[CV ] sRM <-- Adds CV to the stack.
5 <1> srefgen sKM/1 <-- Replaces CV with a ref to the CV.
6 <#> gv[*a] s <-- Places *a on the stack.
7 <1> rv2av[t4] lKM/1 <-- Replaces *a with the contents of #a.
8 <#> gv[*reduce] s <-- Places *reduce on the stack.
9 <1> entersub[t5] vKS/TARG <-- Will remove the entire list from the stack.
...
$ perl -MO=Concise,-exec -MList::Util=reduce -e'reduce { ... } 0, #a'
...
3 <0> pushmark s
4 <$> anoncode[CV ] sRM
5 <1> srefgen sKM/1
6 <$> const[IV 0] sM <-- The only difference.
7 <#> gv[*a] s
8 <1> rv2av[t4] lKM/1
9 <#> gv[*reduce] s
a <1> entersub[t5] vKS/TARG
...

The direct question can be answered directly by a benchmark
use strict;
use warnings;
use List::Util qw(sum reduce);
use Benchmark qw(cmpthese);
my #ary = 1..10_000;
sub by_reduce { my $res = reduce { $a + length $b } 0, #ary }
sub by_map { my $res = sum map { length } #ary }
cmpthese(-3, {
reduce => sub { by_reduce },
map => sub { by_map },
});
which prints on my v5.16 at hand
Rate map reduce
map 780/s -- -41%
reduce 1312/s 68% --
Thus reduce does something significantly better for this task.
As for the question of lists in general, it would have to depend on how the full list is used.
In your benchmark there is an assignment to a new array so the data copy clearly must be done. Then longer arrays take longer, and by about an order of magnitude quite like the ratio of their sizes.
With list inputs for functions like map and reduce I don't see a reason for an additional data copy. This can be checked by a benchmark, comparing an identical operation
my #ary = 1..10_000;
# benchmark:
my $r1 = sum map { length } #ary;
my $r2 = sum map { length } (1..5000, 5001..10_000);
The reported rates are nearly identical, for example780/s and 782/s, showing that the flattening of the ranges for map input doesn't involve a data copy. (The ranges are converted to arrays at compile time, thanks to ikegami for comments.)

Related

Merge/combine two lists line by line?

I have two lists stored in variables: $list1 and $list2, for example:
$list1:
a
b
c
d
$list2:
1
2
3
4
How do I merge them together line by line so that I end up with:
a1
b2
c3
d4
I have tried using array (#) but it just combines them one after the other, not line by line, example:
$list1 = #(command)
$list1 += #($list2)
If you prefer pipelining, you can also do it in one line:
0 .. ($list1.count -1) | ForEach-Object { $list1[$_]+$list2[$_] }
You could do this with a For loop that uses iterates through the index of each object until it reaches the total (.count) of the first object:
$list1 = 'a','b','c','d'
$list2 = 1,2,3,4
For ($i=0; $i -lt $list1.count; $i++) {
$list1[$i]+$list2[$i]
}
Output:
a1
b2
c3
d4
If you want the results to go to a variable, you could put (for example) $list = before the For.
To complement Mark Wragg's helpful for-based answer and Martin Brandl's helpful pipeline-based answer:
Combining foreach with .., the range operator allows for a concise solution that also performs well:
foreach ($i in 0..($list1.count-1)) { "$($list1[$i])$($list2[$i])" }
Even though an entire array of indices is constructed first - 0..($list1.count-1) - this slightly outperforms the for solution with large input lists, and both foreach and for will be noticeably faster than the pipeline-based solution - see below.
Also note how string interpolation (variable references and subexpressions inside a single "..." string) are used to ensure that the result is always a string.
By contrast, if you use +, it is the type of the LHS that determines the output type, which can result in errors or unwanted output; e.g., 1 + 'a' causes an error, because 1 is an integer and 'a' cannot be converted to an integer.
Optional reading: performance considerations
Generally, foreach and for solutions are noticeably faster than pipeline-based (ForEach-Object cmdlet-based) solutions.
Pipelines are elegant and concise, but they are comparatively slow.
That shouldn't stop you from using them, but it's important to be aware that they can be a performance bottleneck.
Pipelines are memory-efficient, and for processing large collections that don't fit into memory as a whole they are always the right tool to use.
PSv4 introduced the little-known .ForEach() collection operator (method), whose performance is in between that of for / foreach and the ForEach-Object cmdlet.
The following compares the relative performance with large lists (100,000 items); the absolute timing numbers will vary based on many factors, but they should give you a general sense:
# Define two large lists.
$list1 = 1..100000
$list2 = 1..100000
# Define the commands as script blocks:
$cmds = { foreach ($i in 0..($list1.count-1)) { "$($list1[$i])$($list2[$i])" } },
{ for ($i=0; $i -lt $list1.count; $i++) { "$($list1[$i])$($list2[$i])" } },
{ 0..($list1.count -1) | ForEach-Object { "$($list1[$_])$($list2[$_])" } },
{ (0..($list1.count-1)).ForEach({ "$($list1[$_])$($list2[$_])" }) }
# Time each command.
$cmds | ForEach-Object { '{0:0.0}' -f (Measure-Command $_).TotalSeconds }
In a 2-core Windows 10 VM running PSv5.1 I get the following results after running the tests several times:
0.5 # foreach
0.7 # for
1.8 # ForEach-Object (pipeline)
1.2 # .ForEach() operator

Sort a set of variables based on frequency from a Perl RegEx

I am attempting to use a table or array to list and sort items
this following sequence of letters, or items, by what is 'eaten'
first (in the captured perl RegEx) These four lists are the exact same
items, just entered in a different order, in succession.
Input items: these letters represent an action or an input into the
client.
a b c d
b c d a
c d b a
d a b c
perl regex:
^(\w+) eats (a|an) (\w+)\.$
So matches[4] will be the item captured.
This will trigger RegEx will fire in the client with 'each' set of
letters (a, b, c, d) Entered, separately. So four sets of a, b, c, d that
will be input in succession but on a rotating order basis. The above
RegEx will in fire 16x (once for each letter.) I need to be able to
sort it so, if (a) is eaten first every time, then that will have
priority at the top going down. But it might not always be (a), it
could be any of the letters that hold priority.
I need this priority list to be displayed to a Geyser such as
PrioList= Geyser.MiniConsole:new({
name="PrioList",
x="70%", y="50%",
width="30%", height="50%",
})
I then need to be able to set each letter to a different priority list
or variable. Because each separate letter will indicate a different
action needed to be taken, so I will need to say
if (a == highestpriority) then
do action / function()
end
I am unsure of how to write the 'for' statement that will be able to
sort and list these items based off the 4 groups of letters. I figure
the list will have to be saved and reset, after each sequence then
somehow entered into a table or array, and compared to each other for
the highest priority. But this seriously beyond what I know how to
script, but I would definitely love to learn this.
If I'm correctly understanding you, one option is to use a 1) hash to tally the frequency of the first letter entered, and 2) dispatch table to associate each letter with a subroutine:
use strict;
use warnings;
use List::Util qw/shuffle/;
my %seen;
my %dispatchTable = (
a => \&a_priority,
b => \&b_priority,
c => \&c_priority,
d => \&d_priority
);
for my $i ( 1 .. 4 ) {
my #chars = shuffle qw/a b c d/;
print "Round $i: #chars\n";
$seen{ $chars[0] }++;
}
my $priority = ( sort { $seen{$b} <=> $seen{$a} } keys %seen )[0];
print "Priority: $priority\n";
$dispatchTable{$priority}->();
sub a_priority {
print "a priority sub called\n";
}
sub b_priority {
print "b priority sub called\n";
}
sub c_priority {
print "c priority sub called\n";
}
sub d_priority {
print "d priority sub called\n";
}
Sample run output:
Round 1: d c a b
Round 2: b a d c
Round 3: d b a c
Round 4: c d a b
Priority: d
d priority sub called
You said, "I need to be able to sort it so, if (a) is eaten first every time..." The above attempts to select the item with the highest frequency--not the item that was first all four times.
You'll need to decide what to do in cases where more than one letter shares the same frequency, but perhaps this will help provide some direction.

Perl Deleting element from array

I have a 2D array in perl. I want to delete all elements which has the pattern <<< or >>>.
I have written a perl code, it works good till matching pattern, however it cannot delete that element, some error occurs.
foreach my $x(#array)
{
foreach my $y(#$x)
{
if($y =~ (m/^(\<+)|(\>+)$/ig))
{
delete $y;
}
}
}
Can you help me to delete that particular element that matches the pattern. (I want to delete and remove from array, not undef it)
Let's say your array looks like this:
1 2 3 4
5 X 6 7
8 9 A B
You want to delete X. What do you want to happen? What should your new array look like after the delete?
Do you wan this:
1 2 3 4
4 6 7
8 9 A B
Or this?
1 2 3 4
5 9 6 7
8 A B
That's the first thing you need to decide. Second, you can't use delete. The delete command deletes a keyed value from a hash and not an array. If you have an array like this:
my #array = qw(0 1 2 3 4 5 X 7 8 9);
And you want to delete the X (which is $array[6]), you'd use the splice command:
splice #array, 6, 1;
Finally, Perl does not have 2 dimensional arrays, so you can't delete a value from a 2 dimensional array.
What you have is an array of references to a second array. Think of it this way:
my #row0 = qw(1 2 3 4);
my #row1 = qw(5 X 6 7);
my #row2 = qw(8 9 A B);
my #two_d_array = (\#row0, \#row1, \#row2);
Or, I could do this by column:
my #col0 = qw(1 5 8);
my #col1 = qw(2 X 6);
my #col2 = qw(2 6 A);
my #col3 = qw(4 7 B);
my #two_d_array = (\#col0, \#col1, \#col2, \#col3);
When you talk about.
if ( $two_d_array[1][1]` eq "X" ) {
What is going on is that Perl is messing with your mind. It is making you think there's a two dimensional array is involved, but it's not really there.
A more accurate way of writing this would be:
if ( ${ $two_d_array[1] }[1] eq "X" ) {
or, more cleanly:
if ( $two_d_array[1]->[1] eq "X" ) {
So first, decide what you mean by deleting a value. In a two dimensional array, if you actually delete that value, you end up ruining the dimensional structure of that array. Maybe you can replace the value at that point with an undef.
Once you do that, you must understand what you're actually dealing with: An array of references to arrays.
for my $array_reference ( #two_d_array ) {
for my $value ( #{ $array_reference } ) {
if ( $value =~ /^(<+|>+)$/ ) {
$value = undef; #See Note #1
}
}
}
Note #1: When you use a for loop, the index of the array is a link to the actual value in the array. Therefore, when you change the index, you're changing the actual value. That's why this will work.
If you really, really want to delete the element using splice, you will have to decide if you want your elements moving up to replace the deleted value or moving to the left to replace the deleted value. If you want the values to the moving left, you want an array or references to row arrays. If you wan the values moving up to fill in the deleted value, you want an array of reference to column arrays.
Remember that computers will do exactly what you tell them to do and not what you want them to do. Make sure you understand exactly what you want.
You are applying delete on a scalar value, $y, and delete is only meant to be applied to hashes and arrays. You would need to do do
for my $x (0 .. $#array) {
for my $y (0 .. $#{$array[$x]}) {
if (...) { delete $array[$x][$y]; }
The best solution, in my opinion, is to remove the value before storing it in the array. I am guessing you read it in from some data source such as a file, and that would be the best place to filter it out. E.g.
while (<$fh>) {
....
#values = grep !/^[<>]+/, #values; # filtering
push #array, \#values; # storing
}
On that note, you can also do it afterwards, of course, with something like:
for (#array) {
#$_ = grep !/^[<>]+/, #$_;
}
You can delete elements from arrays, by splice function:
splice(#array, $index, 1); 1 in this example is number of elements, you want to delete
delete function only sets array value to undef
delete does not alter array indices so it is not what you want. If you want to delete elements by value, use something like this:
foreach my $x(#array)
{
$x = [ grep { $_ !~ (m/^(\<+)|(\>+)$/ig)} #$x ];
print join(",", #$x), "\n";
}
or, use splice. But then you will need to iterate the array using indices rather than values.
Also see Perl-delete, Perl-splice.

Find words, that are substrings of other words efficiently

I have an Ispell list of english words (nearly 50 000 words), my homework in Perl is to get quickly (like under one minute) list of all strings, that are substrings of some other word. I have tried solution with two foreach cycles comparing all words, but even with some optimalizations, its still too slow. I think, that right solution could be some clever use of regular expressions on array of words. Do you know how to solve this problem quicky (in Perl)?
I have found fast solution, which can find some all these substrings in about 15 seconds on my computer, using just one thread. Basically, for each word, I have created array of every possible substrings (eliminating substrings which differs only in "s" or "'s" endings):
#take word and return list of all valid substrings
sub split_to_all_valid_subwords {
my $word = $_[0];
my #split_list;
my ($i, $j);
for ($i = 0; $i < length($word); ++$i){
for ($j = 1; $j <= length($word) - $i; ++$j){
unless
(
($j == length($word)) or
($word =~ m/s$/ and $i == 0 and $j == length($word) - 1) or
($word =~ m/\'s$/ and $i == 0 and $j == length($word) - 2)
)
{
push(#split_list, substr($word, $i, $j));
}
}
}
return #split_list;
}
Then I just create list of all candidates for substrings and make intersection with words:
my #substring_candidates;
foreach my $word (#words) {
push( #substring_candidates, split_to_all_valid_subwords($word));
}
#make intersection between substring candidates and words
my %substring_candidates=map{$_ =>1} #substring_candidates;
my %words=map{$_=>1} #words;
my #substrings = grep( $substring_candidates{$_}, #words );
Now in substrings I have array of all words, that are substrings of some other words.
Perl regular expressions will optimize patterns like foo|bar|baz into an Aho-Corasick match - up to a certain limit of total compiled regex length. Your 50000 words will probably exceed that length, but could be broken into smaller groups. (Indeed, you probably want to break them up by length and only check words of length N for containing words of length 1 through N-1.)
Alternatively, you could just implement Aho-Corasick in your perl code - that's kind of fun to do.
update
Ondra supplied a beautiful solution in his answer; I leave my post here as an example of overthinking a problem and failed optimisation techniques.
My worst case kicks in for a word that doesn't match any other word in the input. In that case, it goes quadratic. The OPT_PRESORT was a try to advert the worst case for most words. The OPT_CONSECUTIVE was a linear-complexity filter that reduced the total number of items in the main part of the algorithm, but it is just a constant factor when considering the complexity. However, it is still useful with Ondras algorithm and saves a few seconds, as building his split list is more expensive than comparing two consecutive words.
I updated the code below to select ondras algorithm as a possible optimisation. Paired with zero threads and the presort optimisation, it yields maximum performance.
I would like to share a solution I coded. Given an input file, it outputs all those words that are a substring of any other word in the same input file. Therefore, it computes the opposite of ysth's ideas, but I took the idea of optimisation #2 from his answer. There are the following three main optimisations that can be deactivated if required.
Multithreading
The questions "Is word A in list L? Is word B in L?" can be easily parallelised.
Pre-sorting all the words for their length
I create an array that points to the list of all words that are longer than a certain length, for every possible length. For long words, this can cut down the number of possible words dramatically, but it trades quite a lot of space, as one word of length n appears in all lists from length 1 to length n.
Testing consecutive words
In my /usr/share/dict/words, most consecutive lines look quite similar:
Abby
Abby's
for example. As every word that would match the first word also matches the second one, I immediately add the first word to the list of matching words, and only keep the second word for further testing. This saved about 30% of words in my test cases. Because I do that before optimisation No 2, this also saves a lot of space. Another trade-off is that the output will not be sorted.
The script itself is ~120 lines long; I explain each sub before showing it.
head
This is just a standard script header for multithreading. Oh, and you need perl 5.10 or better to run this. The configuration constants define the optimisation behaviour. Add the number of processors of your machine in that field. The OPT_MAX variable can take the number of words you want to process, however this is evaluated after the optimisations have taken place, so the easy words will already have been caught by the OPT_CONSECUTIVE optimisation. Adding anything there will make the script seemingly slower. $|++ makes sure that the status updates are shown immediately. I exit after the main is executed.
#!/usr/bin/perl
use strict; use warnings; use feature qw(say); use threads;
$|=1;
use constant PROCESSORS => 0; # (false, n) number of threads
use constant OPT_MAX => 0; # (false, n) number of words to check
use constant OPT_PRESORT => 0; # (true / false) sorts words by length
use constant OPT_CONSECUTIVE => 1; # (true / false) prefilter data while loading
use constant OPT_ONDRA => 1; # select the awesome Ondra algorithm
use constant BLABBER_AT => 10; # (false, n) print progress at n percent
die q(The optimisations Ondra and Presort are mutually exclusive.)
if OPT_PRESORT and OPT_ONDRA;
exit main();
main
Encapsulates the main logic, and does multi-threading. The output of n words will be matched will be considerably smaller than the number of input words, if the input was sorted. After I have selected all matched words, I print them to STDOUT. All status updates etc. are printed to STDERR, so that they don't interfere with the output.
sub main {
my #matching; # the matching words.
my #words = load_words(\#matching); # the words to be searched
say STDERR 0+#words . " words to be matched";
my $prepared_words = prepare_words(#words);
# do the matching, possibly multithreading
if (PROCESSORS) {
my #threads =
map {threads->new(
\&test_range,
$prepared_words,
#words[$$_[0] .. $$_[1]] )
} divide(PROCESSORS, OPT_MAX || 0+#words);
push #matching, $_->join for #threads;
} else {
push #matching, test_range(
$prepared_words,
#words[0 .. (OPT_MAX || 0+#words)-1]);
}
say STDERR 0+#matching . " words matched";
say for #matching; # print out the matching words.
0;
}
load_words
This reads all the words from the input files which were supplied as command line arguments. Here the OPT_CONSECUTIVE optimisation takes place. The $last word is either put into the list of matching words, or into the list of words to be matched later. The -1 != index($a, $b) decides if the word $b is a substring of word $a.
sub load_words {
my $matching = shift;
my #words;
if (OPT_CONSECUTIVE) {
my $last;
while (<>) {
chomp;
if (defined $last) {
push #{-1 != index($_, $last) ? $matching : \#words}, $last;
}
$last = $_;
}
push #words, $last // ();
} else {
#words = map {chomp; $_} <>;
}
#words;
}
prepare_words
This "blows up" the input words, sorting them after their length into each slot, that has the words of larger or equal length. Therefore, slot 1 will contain all words. If this optimisation is deselected, it is a no-op and passes the input list right through.
sub prepare_words {
if (OPT_ONDRA) {
my $ondra_split = sub { # evil: using $_ as implicit argument
my #split_list;
for my $i (0 .. length $_) {
for my $j (1 .. length($_) - ($i || 1)) {
push #split_list, substr $_, $i, $j;
}
}
#split_list;
};
return +{map {$_ => 1} map &$ondra_split(), #_};
} elsif (OPT_PRESORT) {
my #prepared = ([]);
for my $w (#_) {
push #{$prepared[$_]}, $w for 1 .. length $w;
}
return \#prepared;
} else {
return [#_];
}
}
test
This tests if the word $w is a substring in any of the other words. $wbl points to the data structure that was created by the previous sub: Either a flat list of words, or the words sorted by length. The appropriate algorithm is then selected. Nearly all of the running time is spent in this loop. Using index is much faster than using a regex.
sub test {
my ($w, $wbl) = #_;
my $l = length $w;
if (OPT_PRESORT) {
for my $try (#{$$wbl[$l + 1]}) {
return 1 if -1 != index $try, $w;
}
} else {
for my $try (#$wbl) {
return 1 if $w ne $try and -1 != index $try, $w;
}
}
return 0;
}
divide
This just encapsulates an algorithm that guarantees a fair distribution of $items items into $parcels buckets. It outputs the bounds of a range of items.
sub divide {
my ($parcels, $items) = #_;
say STDERR "dividing $items items into $parcels parcels.";
my ($min_size, $rest) = (int($items / $parcels), $items % $parcels);
my #distributions =
map [
$_ * $min_size + ($_ < $rest ? $_ : $rest),
($_ + 1) * $min_size + ($_ < $rest ? $_ : $rest - 1)
], 0 .. $parcels - 1;
say STDERR "range division: #$_" for #distributions;
return #distributions;
}
test_range
This calls test for each word in the input list, and is the sub that is multithreaded. grep selects all those elements in the input list where the code (given as first argument) return true. It also regulary outputs a status message like thread 2 at 10% which makes waiting for completition much easier. This is a psychological optimisation ;-).
sub test_range {
my $wbl = shift;
if (BLABBER_AT) {
my $range = #_;
my $step = int($range / 100 * BLABBER_AT) || 1;
my $i = 0;
return
grep {
if (0 == ++$i % $step) {
printf STDERR "... thread %d at %2d%%\n",
threads->tid,
$i / $step * BLABBER_AT;
}
OPT_ONDRA ? $wbl->{$_} : test($_, $wbl)
} #_;
} else {
return grep {OPT_ONDRA ? $wbl->{$_} : test($_, $wbl)} #_;
}
}
invocation
Using bash, I invoked the script like
$ time (head -n 1000 /usr/share/dict/words | perl script.pl >/dev/null)
Where 1000 is the number of lines I wanted to input, dict/words was the word list I used, and /dev/null is the place I want to store the output list, in this case, throwing the output away. If the whole file should be read, it can be passed as an argument, like
$ perl script.pl input-file >output-file
time just tells us how long the script ran. Using 2 slow processors and 50000 words, it executed in just over two minutes in my case, which is actually quite good.
update: more like 6–7 seconds now, with the Ondra + Presort optimisation, and no threading.
further optimisations
update: overcome by better algorithm. This section is no longer completely valid.
The multithreading is awful. It allocates quite some memory and isn't exactly fast. This isn't suprising considering the amount of data. I considered using a Thread::Queue, but that thing is slow like $#*! and therefore is a complete no-go.
If the inner loop in test was coded in a lower-level language, some performance might be gained, as the index built-in wouldn't have to be called. If you can code C, take a look at the Inline::C module. If the whole script were coded in a lower language, array access would also be faster. A language like Java would also make the multithreading less painful (and less expensive).

What's the point of Perl's map?

Not really getting the point of the map function. Can anyone explain with examples its use?
Are there any performance benefits to using this instead of a loop or is it just sugar?
Any time you want to generate a list based another list:
# Double all elements of a list
my #double = map { $_ * 2 } (1,2,3,4,5);
# #double = (2,4,6,8,10);
Since lists are easily converted pairwise into hashes, if you want a hash table for objects based on a particular attribute:
# #user_objects is a list of objects having a unique_id() method
my %users = map { $_->unique_id() => $_ } #user_objects;
# %users = ( $id => $obj, $id => $obj, ...);
It's a really general purpose tool, you have to just start using it to find good uses in your applications.
Some might prefer verbose looping code for readability purposes, but personally, I find map more readable.
First of all, it's a simple way of transforming an array: rather than saying e.g.
my #raw_values = (...);
my #derived_values;
for my $value (#raw_values) {
push (#derived_values, _derived_value($value));
}
you can say
my #raw_values = (...);
my #derived_values = map { _derived_value($_) } #raw_values;
It's also useful for building up a quick lookup table: rather than e.g.
my $sentence = "...";
my #stopwords = (...);
my #foundstopwords;
for my $word (split(/\s+/, $sentence)) {
for my $stopword (#stopwords) {
if ($word eq $stopword) {
push (#foundstopwords, $word);
}
}
}
you could say
my $sentence = "...";
my #stopwords = (...);
my %is_stopword = map { $_ => 1 } #stopwords;
my #foundstopwords = grep { $is_stopword{$_} } split(/\s+/, $sentence);
It's also useful if you want to derive one list from another, but don't particularly need to have a temporary variable cluttering up the place, e.g. rather than
my %params = ( username => '...', password => '...', action => $action );
my #parampairs;
for my $param (keys %params) {
push (#parampairs, $param . '=' . CGI::escape($params{$param}));
}
my $url = $ENV{SCRIPT_NAME} . '?' . join('&', #parampairs);
you say the much simpler
my %params = ( username => '...', password => '...', action => $action );
my $url = $ENV{SCRIPT_NAME} . '?'
. join('&', map { $_ . '=' . CGI::escape($params{$_}) } keys %params);
(Edit: fixed the missing "keys %params" in that last line)
The map function is used to transform lists. It's basically syntactic sugar for replacing certain types of for[each] loops. Once you wrap your head around it, you'll see uses for it everywhere:
my #uppercase = map { uc } #lowercase;
my #hex = map { sprintf "0x%x", $_ } #decimal;
my %hash = map { $_ => 1 } #array;
sub join_csv { join ',', map {'"' . $_ . '"' } #_ }
See also the Schwartzian transform for advanced usage of map.
It's also handy for making lookup hashes:
my %is_boolean = map { $_ => 1 } qw(true false);
is equivalent to
my %is_boolean = ( true => 1, false => 1 );
There's not much savings there, but suppose you wanted to define %is_US_state?
map is used to create a list by transforming the elements of another list.
grep is used to create a list by filtering elements of another list.
sort is used to create a list by sorting the elements of another list.
Each of these operators receives a code block (or an expression) which is used to transform, filter or compare elements of the list.
For map, the result of the block becomes one (or more) element(s) in the new list. The current element is aliased to $_.
For grep, the boolean result of the block decides if the element of the original list will be copied into the new list. The current element is aliased to $_.
For sort, the block receives two elements (aliased to $a and $b) and is expected to return one of -1, 0 or 1, indicating whether $a is greater, equal or less than $b.
The Schwartzian Transform uses these operators to efficiently cache values (properties) to be used in sorting a list, especially when computing these properties has a non-trivial cost.
It works by creating an intermediate array which has as elements array references with the original element and the computed value by which we want to sort. This array is passed to sort, which compares the already computed values, creating another intermediate array (this one is sorted) which in turn is passed to another map which throws away the cached values, thus restoring the array to its initial list elements (but in the desired order now).
Example (creates a list of files in the current directory sorted by the time of their last modification):
#file_list = glob('*');
#file_modify_times = map { [ $_, (stat($_))[8] ] } #file_list;
#files_sorted_by_mtime = sort { $a->[1] <=> $b->[1] } #file_modify_times;
#sorted_files = map { $_->[0] } #files_sorted_by_mtime;
By chaining the operators together, no declaration of variables is needed for the intermediate arrays;
#sorted_files = map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [ $_, (stat($_))[8] ] } glob('*');
You can also filter the list before sorting by inserting a grep (if you want to filter on the same cached value):
Example (a list of the files modified in the last 24 hours sorted the last modification time):
#sorted_files = map { $_->[0] } sort { $a->[1] <=> $b->[1] } grep { $_->[1] > (time - 24 * 3600 } map { [ $_, (stat($_))[8] ] } glob('*');
The map function is an idea from the functional programming paradigm. In functional programming, functions are first-class objects, meaning that they can be passed as arguments to other functions. Map is a simple but a very useful example of this. It takes as its arguments a function (lets call it f) and a list l. f has to be a function taking one argument, and map simply applies f to every element of the list l. f can do whatever you need done to every element: add one to every element, square every element, write every element to a database, or open a web browser window for every element, which happens to be a valid URL.
The advantage of using map is that it nicely encapsulates iterating over the elements of the list. All you have to do is say "do f to every element, and it is up to map to decide how best to do that. For example map may be implemented to split up its work among multiple threads, and it would be totally transparent to the caller.
Note, that map is not at all specific to Perl. It is a standard technique used by functional languages. It can even be implemented in C using function pointers, or in C++ using "function objects".
The map function runs an expression on each element of a list, and returns the list results. Lets say I had the following list
#names = ("andrew", "bob", "carol" );
and I wanted to capitalize the first letter of each of these names. I could loop through them and call ucfirst of each element, or I could just do the following
#names = map (ucfirst, #names);
"Just sugar" is harsh. Remember, a loop is just sugar -- if's and goto can do everything loop constructs do and more.
Map is a high enough level function that it helps you hold much more complex operations in your head, so you can code and debug bigger problems.
To paraphrase "Effective Perl Programming" by Hall & Schwartz,
map can be abused, but I think that it's best used to create a new list from an existing list.
Create a list of the squares of 3,2, & 1:
#numbers = (3,2,1);
#squares = map { $_ ** 2 } #numbers;
Generate password:
$ perl -E'say map {chr(32 + 95 * rand)} 1..16'
# -> j'k=$^o7\l'yi28G
You use map to transform a list and assign the results to another list, grep to filter a list and assign the results to another list. The "other" list can be the same variable as the list you are transforming/filtering.
my #array = ( 1..5 );
#array = map { $_+5 } #array;
print "#array\n";
#array = grep { $_ < 7 } #array;
print "#array\n";
It allows you to transform a list as an expression rather than in statements. Imagine a hash of soldiers defined like so:
{ name => 'John Smith'
, rank => 'Lieutenant'
, serial_number => '382-293937-20'
};
then you can operate on the list of names separately.
For example,
map { $_->{name} } values %soldiers
is an expression. It can go anywhere an expression is allowed--except you can't assign to it.
${[ sort map { $_->{name} } values %soldiers ]}[-1]
indexes the array, taking the max.
my %soldiers_by_sn = map { $->{serial_number} => $_ } values %soldiers;
I find that one of the advantages of operational expressions is that it cuts down on the bugs that come from temporary variables.
If Mr. McCoy wants to filter out all the Hatfields for consideration, you can add that check with minimal coding.
my %soldiers_by_sn
= map { $->{serial_number}, $_ }
grep { $_->{name} !~ m/Hatfield$/ }
values %soldiers
;
I can continue chaining these expression so that if my interaction with this data has to reach deep for a particular purpose, I don't have to write a lot of code that pretends I'm going to do a lot more.
It's used anytime you would like to create a new list from an existing list.
For instance you could map a parsing function on a list of strings to convert them to integers.
As others have said, map creates lists from lists. Think of "mapping" the contents of one list into another. Here's some code from a CGI program to take a list of patent numbers and print hyperlinks to the patent applications:
my #patents = ('7,120,721', '6,809,505', '7,194,673');
print join(", ", map { "$_" } #patents);
As others have said, map is most useful for transforming a list. What hasn't been mentioned is the difference between map and an "equivalent" for loop.
One difference is that for doesn't work well for an expression that modifies the list its iterating over. One of these terminates, and the other doesn't:
perl -e '#x=("x"); map { push #x, $_ } #x'
perl -e '#x=("x"); push #x, $_ for #x'
Another small difference is that the context inside the map block is a list context, but the for loop imparts a void context.