How to use regular expression to find keys in hash - regex

I have 6mio hashes and need to count how many of these have keys that start with AA00, AB10 and how many of them have keys starting with with both strings.
For each hash I have done this:
if (exists $hash{AA00}) {
$AA00 +=1;
}
if (exists $hash{AB10}) {
$AB10 += 1;
}
if (exists $hash{AA00} and exists $hash{AA10}) {
$both += 1;
}
but then I count only the number of hashes that contains exactly AA00 or AB10 as keys, but I would also like to count hashes that contain, say AA001. Can I use regular expression for this?

I completely misunderstood your question. To find the number of hashes with keys matching a regex (as opposed to the number of keys matching a regex in a single hash), you can still use the grep approach I outlined in my earlier answer. This time, however, you need to loop through your hashes (I assume you're storing them in an array if you have 6 million of them) and run grep twice on each one:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my #array = (
{ AA00 => 'foo' },
{ AB10 => 'bar' },
{ AA001 => 'foo' },
{ AA00 => 'foo', AB10 => 'bar' }
);
my ($hashes_with_aa00, $hashes_with_ab10, $hashes_with_both) = (0, 0, 0);
foreach my $hash (#array) {
my $aa_count = grep { /^AA00/ } keys %$hash;
my $ab_count = grep { /^AB10/ } keys %$hash;
$hashes_with_aa00++ if $aa_count;
$hashes_with_ab10++ if $ab_count;
$hashes_with_both++ if $aa_count and $ab_count;
}
say "AA00: $hashes_with_aa00";
say "AB10: $hashes_with_ab10";
say "Both: $hashes_with_both";
Output:
AA00: 3
AB10: 2
Both: 1
This works, but is pretty poor in terms of performance: grep loops through every element in the list of keys for each hash, and we're calling it twice per hash!
Since we don't care how many keys match in each hash, only whether there is a match, a better solution would be any from List::MoreUtils. any works much like grep but returns as soon as it finds a match. To use any instead of grep, change this:
foreach my $hash (#array) {
my $aa_count = grep { /^AA00/ } keys %$hash;
my $ab_count = grep { /^AB10/ } keys %$hash;
$hashes_with_aa00++ if $aa_count;
$hashes_with_ab10++ if $ab_count;
$hashes_with_both++ if $aa_count and $ab_count;
}
to this:
use List::MoreUtils 'any';
foreach my $hash (#array) {
my $aa_exists = any { /^AA00/ } keys %$hash;
my $ab_exists = any { /^AB10/ } keys %$hash;
$hashes_with_aa00++ if $aa_exists;
$hashes_with_ab10++ if $ab_exists;
$hashes_with_both++ if $aa_exists and $ab_exists;
}
Note that I changed the variable names to better reflect their meaning.
This is much better in terms of performance, but as Borodin notes in a comment on your question, you're losing the speed advantage of hashes by not accessing them with specific keys. You might want to change your data structure accordingly.
Original Answer: Counting keys that match a regex in a single hash
This is my original answer based on a misunderstanding of your question. I'm leaving it up because I think it could be useful for similar situations.
To count the number of keys that match a regex in a single hash, you can use grep:
my $aa_count = grep { /^AA00/ } keys %hash;
my $ab_count = grep { /^AB10/ } keys %hash;
my $both = $aa_count + $ab_count;
As HunterMcMillen points out in the comments, there's no need to search through the hash keys again to get the total count; in this case, you can simply add the two subtotals. You can get away with this because the two patterns you're searching for are mutually exclusive; in other words, you cannot have a key that both begins with AA00 and AB10.
In the more general case, it might be possible for a single key to match both patterns (thanks Borodin). In that case, you cannot simply add up the two subtotals. For example, if you wanted your keys to merely contain AA00 or AB10 anywhere in the string, not necessarily at the beginning, you would need to do something like this:
my $aa_count = grep { /AA00/ } keys %hash;
my $ab_count = grep { /AB10/ } keys %hash;
my $both = grep { /(?:AA00|AB10)/ } keys %hash;
Note that this calls grep multiple times, which means traversing the entire hash multiple times. This could be done more efficiently using a single for loop like FlyingFrog and Kenosis did.

Related

Perl anchored regex performance

Problem and Data
At the bottom of this post is the entire script from which this NYTProf data was generated. The script builds a hash and then attempts to delete keys that contain certain bad pattern. Running the code through NYTProf generates the following
delete #$hash{ grep { /\Q$bad_pattern\E/ } sort keys %$hash };
# spent 7.29ms making 2 calls to main::CORE:sort, avg 3.64ms/call
# spent 808µs making 7552 calls to main::CORE:match, avg 107ns/call
# spent 806µs making 7552 calls to main::CORE:regcomp, avg 107ns/call
There are over 7,000 calls being made to main::CORE:match and main::CORE:regcomp. The assumption is that this is a sufficient amount of calls to reduce noise levels.
Moving on! The bad patterns only need to be deleted if they appear at the beginning of a key. Sounds great! Adding a ^ to anchor the regex should improve performance. However, NYTProf generates the following. NYTprof has been run many times and this is quite consistent
delete #$hash{ grep { /^\Q$bad_pattern\E/ } sort keys %$hash };
# spent 7.34ms making 2 calls to main::CORE:sort, avg 3.67ms/call
# spent 1.62ms making 7552 calls to main::CORE:regcomp, avg 214ns/call
# spent 723µs making 7552 calls to main::CORE:match, avg 96ns/call
Questions
The anchored regex nearly doubles the amount of time spent in these main::CORE:* methods. But an anchored regex should improve performance. What is unique about this dataset that causes the anchored regex to take so much additional time?
Entire Script
use strict;
use Devel::NYTProf;
my #states = qw(KansasCity MississippiState ColoradoMountain IdahoInTheNorthWest AnchorageIsEvenFurtherNorth);
my #cities = qw(WitchitaHouston ChicagoDenver);
my #streets = qw(DowntownMainStreetInTheCity CenterStreetOverTheHill HickoryBasketOnTheWall);
my #seasoncode = qw(8000S 8000P 8000F 8000W);
my #historycode = qw(7000S 7000P 7000F 7000W 7000A 7000D 7000G 7000H);
my #sides = qw(left right up down);
my $hash;
for my $state (#states) {
for my $city (#cities) {
for my $street (#streets) {
for my $season (#seasoncode) {
for my $history (#historycode) {
for my $side (#sides) {
$hash->{$state . '[0].' . $city . '[1].' . $street . '[2].' . $season . '.' . $history . '.' . $side} = 1;
}
}
}
}
}
}
sub CleanseHash {
my #bad_patterns = (
'KansasCity[0].WitchitaHouston[1].DowntownMainStreetInTheCity[2]',
'ColoradoMountain[0].ChicagoDenver[1].HickoryBasketOnTheWall[2].8000F'
);
for my $bad_pattern (#bad_patterns) {
delete #$hash{ grep { /^\Q$bad_pattern\E/ } sort keys %$hash };
}
}
DB::enable_profile();
CleanseHash();
DB::finish_profile();
It's very unlikely you can optimise the regex engine. If performance is your goal, though, you can concentrate on other parts of the code. For example, try this:
for my $bad_pattern (#bad_patterns) {
my $re = qr/^\Q$bad_pattern\E/;
delete #$hash{ grep /$re/, sort keys %$hash };
}
On my machine, it runs much faster (regardless of the presence of the anchor), because the expression form of grep doesn't have to create a scope and the complex compilation of the regex happens just once for each bad pattern.
That's a fairly straightforward matching, with a pattern being a fixed string. So the anchored pattern must be faster in general. The profiling confirms that much, with 96 ns/call vs 107 ns/call.
But when I benchmark anchored and un-anchored versions of the code they run neck-to-neck. This is about the rest of the code, which overwhelms the regex's match: the sort of keys is unneeded for comparison, and the regex is being compiled inside grep's loop, unneeded.
When that is relieved I do get the anchored call to be 11--15% faster (multiple runs)
use warnings;
use strict;
use feature 'say';
use Data::Dump;
use Storable qw(dclone);
use Benchmark qw(cmpthese);
my $runfor = shift // 3;
my #states = qw(KansasCity MississippiState ColoradoMountain IdahoInTheNorthWest AnchorageIsEvenFurtherNorth);
my #cities = qw(WitchitaHouston ChicagoDenver);
my #streets = qw(DowntownMainStreetInTheCity CenterStreetOverTheHill HickoryBasketOnTheWall);
my #seasoncode = qw(8000S 8000P 8000F 8000W);
my #historycode = qw(7000S 7000P 7000F 7000W 7000A 7000D 7000G 7000H);
my #sides = qw(left right up down);
my #bad_patterns = (
'KansasCity[0].WitchitaHouston[1].DowntownMainStreetInTheCity[2]',
'ColoradoMountain[0].ChicagoDenver[1].HickoryBasketOnTheWall[2].8000F'
);
my $hash_1;
for my $state (#states) {
for my $city (#cities) {
for my $street (#streets) {
for my $season (#seasoncode) {
for my $history (#historycode) {
for my $side (#sides) {
$hash_1->{$state . '[0].' . $city . '[1].' . $street . '[2].' . $season . '.' . $history . '.' . $side} = 1;
}
}
}
}
}
}
my $hash_2 = dclone $hash_1;
#say for #bad_patterns; say '---'; dd $hash_1; exit;
sub no_anchor {
for my $bad_pattern (#bad_patterns) {
my $re = qr/\Q$bad_pattern\E/;
delete #$hash_2{ grep { /$re/ } keys %$hash_2 };
}
}
sub w_anchor {
for my $bad_pattern (#bad_patterns) {
my $re = qr/^\Q$bad_pattern\E/;
delete #$hash_1{ grep { /$re/ } keys %$hash_1 };
}
}
cmpthese( -$runfor, {
'no_anchor' => sub { no_anchor() },
'w_anchor' => sub { w_anchor() },
});
I have the comparison subs use external data (not passed to tested subs as usually), to cut out any extra work, and then I use separate hashref copies obtained with Storable::dclone.
The output of benchmark above run with 10 seconds (pass 10 to program when run):
Rate no_anchor w_anchor
no_anchor 296/s -- -13%
w_anchor 341/s 15% --
So the anchored version does win, albeit with a modest margin. With this data the match fails in about 96% cases and for all of that the un-anchored version does more work, having to search through the whole string; I'd expect a larger difference.
The relative closeness of runtimes is due to the rest of the code (grep, hash manipulation, loop), and in particular the regex compilation cost, being included in the timing, what dilutes the difference in the matching efficiency itself.
This lends us an important lesson about timing code: it can be subtle. One needs to ensure that only the relevant sections are compared, and fairly (in equal situataions).

Regular expression is too complex error in tcl

I have not seen this error for a small list. Issue popped up when the list went >10k. Is there any limit on the number of regex patterns in tcl?
puts "#LEVELSHIFTER_TEMPLATES_LIMITSFILE:$perc_limit(levelshifter_templates)"
puts "#length of templates is :[llength $perc_limit(levelshifter_templates)]"
if { [regexp [join $perc_limit(levelshifter_templates) |] $temp] }
#LEVELSHIFTER_TEMPLATES_LIMITSFILE:HDPELT06_LVLDBUF_CAQDP_1 HDPELT06_LVLDBUF_CAQDPNRBY2_1 HDPELT06_LVLDBUF_CAQDP_1....
#length of templates is :13520
ERROR: couldn't compile regular expression pattern: regular expression is too complex
If $temp is a single word and you're really just doing a literal test, you should invert the check. One of the easiest ways might be:
if {$temp in $perc_limit(levelshifter_templates)} {
# ...
}
But if you're doing that a lot (well, more than a small number of times, 3 or 4 say) then building a dictionary for this might be best:
# A one-off cost
foreach key $perc_limit(levelshifter_templates) {
# Value is arbitrary
dict set perc_limit_keys $key 1
}
# This is now very cheap
if {[dict exists $perc_limit_keys $temp]} {
# ...
}
If you've got multiple words in $temp, split and check (using the second technique, which is now definitely worthwhile). This is where having a helper procedure can be a good plan.
proc anyWordIn {inputString keyDictionary} {
foreach word [split $inputString] {
if {[dict exists $keyDictionary $word]} {
return true
}
}
return false
}
if {[anyWordIn $temp $perc_limit_keys]} {
# ...
}
Assuming you want to see if the value in temp is an exact match for one of the elements of the list in perf_limit(levelshifter_templates), here's a few ways that are better than trying to use regular expressions:
Using lsearch`:
# Sort the list after populating it so we can do an efficient binary search
set perf_limit(levelshifter_templates) [lsort $perf_limit(levelshifter_templates)]
# ...
# See if the value in temp exists in the list
if {[lsearch -sorted $perf_limit(levelshifter_templates) $temp] >= 0} {
# ...
}
Storing the elements of the list in a dict (or array if you prefer) ahead of time for an O(1) lookup:
foreach item $perf_limit(levelshifter_templates) {
dict set lookup $item 1
}
# ...
if {[dict exists $lookup $temp]} {
# ...
}
I found a simple workaround for this problem by using a foreach statement to loop over all the regexes in the list instead of joining them and searching, which failed for a super-long list.
foreach pattern $perc_limit(levelshifter_templates) {
if { [regexp $pattern $temp]}
#puts "$fullpath: [is_std_cell_dev $dev]"
puts "##matches: $pattern return 0"
return 0
}
}

regular expression help: catch this: |TrxId=475665|

For example I have a string:
MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|
and I want to catch this: |TrxId=475665|
after TrxId= it could be any numbers and any amount of them, so regex should catch as well:
|TrxId=111333| and |TrxId=0000011112222| and |TrxId=123|
TrxId=(\d+)
That would give a group (1) with the TrxId.
PS: Use global modifier.
The regex should look somewhat like this:
TrxId=[0-9]+
It will match TrxId= followed by at least one digit.
An example solution in Python:
In [107]: data = 'MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|'
In [108]: m = re.search(r'\|TrxId=(\d+)\|', data)
In [109]: m.group(0)
Out[109]: '|TrxId=475665|'
In [110]: m.group(1)
Out[110]: '475665'
/MsgNam\=.*?\|(TrxId\=\d+)\|.*/
for example in perl:
$a = "MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100111|";
$a =~ /MsgNam\=.*?\|(TrxId\=\d+)\|.*/;
print $1;
will print TrxId=475665
You know what your delimiters look like, so you don't need a regex, you need to split. Here's an implementation in Perl.
use strict;
use warnings;
my $input = "MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|";
my #first_array = split(/\|/,$input); #splitting $input on "|"
#Now, since the last character of $input is "|", the last element
#of this array is undef (ie the Perl equivalent of null)
#So, filter that out.
#first_array = grep{defined}#first_array;
#Also filter out elements that do not have an equals sign appearing.
#first_array = grep{/=/}#first_array;
#Now, put these elements into an associative array:
my %assoc_array;
foreach(#first_array)
{
if(/^([^=]+)=(.+)$/)
{
$assoc_array{$1} = $2;
}
else
{
#Something weird may be happening...
#we may have an element starting with "=" for example.
#Do what you want: throw a warning, die, silently move on, etc.
}
}
if(exists $assoc_array{TrxId})
{
print "|TrxId=" . $assoc_array{TrxId} . "|\n";
}
else
{
print "Sorry, TrxId not found!\n";
}
The code above yields the expected output:
|TrxId=475665|
Now, obviously this is more complex than some of the other answers, but it's also a bit more robust in that it allows you to search for more keys as well.
This approach does have a potential issue if your keys appear more than once. In that case, it's easy enough to modify the code above to collect an array reference of values for each key.

Find words, that are substrings of other words efficiently

I have an Ispell list of english words (nearly 50 000 words), my homework in Perl is to get quickly (like under one minute) list of all strings, that are substrings of some other word. I have tried solution with two foreach cycles comparing all words, but even with some optimalizations, its still too slow. I think, that right solution could be some clever use of regular expressions on array of words. Do you know how to solve this problem quicky (in Perl)?
I have found fast solution, which can find some all these substrings in about 15 seconds on my computer, using just one thread. Basically, for each word, I have created array of every possible substrings (eliminating substrings which differs only in "s" or "'s" endings):
#take word and return list of all valid substrings
sub split_to_all_valid_subwords {
my $word = $_[0];
my #split_list;
my ($i, $j);
for ($i = 0; $i < length($word); ++$i){
for ($j = 1; $j <= length($word) - $i; ++$j){
unless
(
($j == length($word)) or
($word =~ m/s$/ and $i == 0 and $j == length($word) - 1) or
($word =~ m/\'s$/ and $i == 0 and $j == length($word) - 2)
)
{
push(#split_list, substr($word, $i, $j));
}
}
}
return #split_list;
}
Then I just create list of all candidates for substrings and make intersection with words:
my #substring_candidates;
foreach my $word (#words) {
push( #substring_candidates, split_to_all_valid_subwords($word));
}
#make intersection between substring candidates and words
my %substring_candidates=map{$_ =>1} #substring_candidates;
my %words=map{$_=>1} #words;
my #substrings = grep( $substring_candidates{$_}, #words );
Now in substrings I have array of all words, that are substrings of some other words.
Perl regular expressions will optimize patterns like foo|bar|baz into an Aho-Corasick match - up to a certain limit of total compiled regex length. Your 50000 words will probably exceed that length, but could be broken into smaller groups. (Indeed, you probably want to break them up by length and only check words of length N for containing words of length 1 through N-1.)
Alternatively, you could just implement Aho-Corasick in your perl code - that's kind of fun to do.
update
Ondra supplied a beautiful solution in his answer; I leave my post here as an example of overthinking a problem and failed optimisation techniques.
My worst case kicks in for a word that doesn't match any other word in the input. In that case, it goes quadratic. The OPT_PRESORT was a try to advert the worst case for most words. The OPT_CONSECUTIVE was a linear-complexity filter that reduced the total number of items in the main part of the algorithm, but it is just a constant factor when considering the complexity. However, it is still useful with Ondras algorithm and saves a few seconds, as building his split list is more expensive than comparing two consecutive words.
I updated the code below to select ondras algorithm as a possible optimisation. Paired with zero threads and the presort optimisation, it yields maximum performance.
I would like to share a solution I coded. Given an input file, it outputs all those words that are a substring of any other word in the same input file. Therefore, it computes the opposite of ysth's ideas, but I took the idea of optimisation #2 from his answer. There are the following three main optimisations that can be deactivated if required.
Multithreading
The questions "Is word A in list L? Is word B in L?" can be easily parallelised.
Pre-sorting all the words for their length
I create an array that points to the list of all words that are longer than a certain length, for every possible length. For long words, this can cut down the number of possible words dramatically, but it trades quite a lot of space, as one word of length n appears in all lists from length 1 to length n.
Testing consecutive words
In my /usr/share/dict/words, most consecutive lines look quite similar:
Abby
Abby's
for example. As every word that would match the first word also matches the second one, I immediately add the first word to the list of matching words, and only keep the second word for further testing. This saved about 30% of words in my test cases. Because I do that before optimisation No 2, this also saves a lot of space. Another trade-off is that the output will not be sorted.
The script itself is ~120 lines long; I explain each sub before showing it.
head
This is just a standard script header for multithreading. Oh, and you need perl 5.10 or better to run this. The configuration constants define the optimisation behaviour. Add the number of processors of your machine in that field. The OPT_MAX variable can take the number of words you want to process, however this is evaluated after the optimisations have taken place, so the easy words will already have been caught by the OPT_CONSECUTIVE optimisation. Adding anything there will make the script seemingly slower. $|++ makes sure that the status updates are shown immediately. I exit after the main is executed.
#!/usr/bin/perl
use strict; use warnings; use feature qw(say); use threads;
$|=1;
use constant PROCESSORS => 0; # (false, n) number of threads
use constant OPT_MAX => 0; # (false, n) number of words to check
use constant OPT_PRESORT => 0; # (true / false) sorts words by length
use constant OPT_CONSECUTIVE => 1; # (true / false) prefilter data while loading
use constant OPT_ONDRA => 1; # select the awesome Ondra algorithm
use constant BLABBER_AT => 10; # (false, n) print progress at n percent
die q(The optimisations Ondra and Presort are mutually exclusive.)
if OPT_PRESORT and OPT_ONDRA;
exit main();
main
Encapsulates the main logic, and does multi-threading. The output of n words will be matched will be considerably smaller than the number of input words, if the input was sorted. After I have selected all matched words, I print them to STDOUT. All status updates etc. are printed to STDERR, so that they don't interfere with the output.
sub main {
my #matching; # the matching words.
my #words = load_words(\#matching); # the words to be searched
say STDERR 0+#words . " words to be matched";
my $prepared_words = prepare_words(#words);
# do the matching, possibly multithreading
if (PROCESSORS) {
my #threads =
map {threads->new(
\&test_range,
$prepared_words,
#words[$$_[0] .. $$_[1]] )
} divide(PROCESSORS, OPT_MAX || 0+#words);
push #matching, $_->join for #threads;
} else {
push #matching, test_range(
$prepared_words,
#words[0 .. (OPT_MAX || 0+#words)-1]);
}
say STDERR 0+#matching . " words matched";
say for #matching; # print out the matching words.
0;
}
load_words
This reads all the words from the input files which were supplied as command line arguments. Here the OPT_CONSECUTIVE optimisation takes place. The $last word is either put into the list of matching words, or into the list of words to be matched later. The -1 != index($a, $b) decides if the word $b is a substring of word $a.
sub load_words {
my $matching = shift;
my #words;
if (OPT_CONSECUTIVE) {
my $last;
while (<>) {
chomp;
if (defined $last) {
push #{-1 != index($_, $last) ? $matching : \#words}, $last;
}
$last = $_;
}
push #words, $last // ();
} else {
#words = map {chomp; $_} <>;
}
#words;
}
prepare_words
This "blows up" the input words, sorting them after their length into each slot, that has the words of larger or equal length. Therefore, slot 1 will contain all words. If this optimisation is deselected, it is a no-op and passes the input list right through.
sub prepare_words {
if (OPT_ONDRA) {
my $ondra_split = sub { # evil: using $_ as implicit argument
my #split_list;
for my $i (0 .. length $_) {
for my $j (1 .. length($_) - ($i || 1)) {
push #split_list, substr $_, $i, $j;
}
}
#split_list;
};
return +{map {$_ => 1} map &$ondra_split(), #_};
} elsif (OPT_PRESORT) {
my #prepared = ([]);
for my $w (#_) {
push #{$prepared[$_]}, $w for 1 .. length $w;
}
return \#prepared;
} else {
return [#_];
}
}
test
This tests if the word $w is a substring in any of the other words. $wbl points to the data structure that was created by the previous sub: Either a flat list of words, or the words sorted by length. The appropriate algorithm is then selected. Nearly all of the running time is spent in this loop. Using index is much faster than using a regex.
sub test {
my ($w, $wbl) = #_;
my $l = length $w;
if (OPT_PRESORT) {
for my $try (#{$$wbl[$l + 1]}) {
return 1 if -1 != index $try, $w;
}
} else {
for my $try (#$wbl) {
return 1 if $w ne $try and -1 != index $try, $w;
}
}
return 0;
}
divide
This just encapsulates an algorithm that guarantees a fair distribution of $items items into $parcels buckets. It outputs the bounds of a range of items.
sub divide {
my ($parcels, $items) = #_;
say STDERR "dividing $items items into $parcels parcels.";
my ($min_size, $rest) = (int($items / $parcels), $items % $parcels);
my #distributions =
map [
$_ * $min_size + ($_ < $rest ? $_ : $rest),
($_ + 1) * $min_size + ($_ < $rest ? $_ : $rest - 1)
], 0 .. $parcels - 1;
say STDERR "range division: #$_" for #distributions;
return #distributions;
}
test_range
This calls test for each word in the input list, and is the sub that is multithreaded. grep selects all those elements in the input list where the code (given as first argument) return true. It also regulary outputs a status message like thread 2 at 10% which makes waiting for completition much easier. This is a psychological optimisation ;-).
sub test_range {
my $wbl = shift;
if (BLABBER_AT) {
my $range = #_;
my $step = int($range / 100 * BLABBER_AT) || 1;
my $i = 0;
return
grep {
if (0 == ++$i % $step) {
printf STDERR "... thread %d at %2d%%\n",
threads->tid,
$i / $step * BLABBER_AT;
}
OPT_ONDRA ? $wbl->{$_} : test($_, $wbl)
} #_;
} else {
return grep {OPT_ONDRA ? $wbl->{$_} : test($_, $wbl)} #_;
}
}
invocation
Using bash, I invoked the script like
$ time (head -n 1000 /usr/share/dict/words | perl script.pl >/dev/null)
Where 1000 is the number of lines I wanted to input, dict/words was the word list I used, and /dev/null is the place I want to store the output list, in this case, throwing the output away. If the whole file should be read, it can be passed as an argument, like
$ perl script.pl input-file >output-file
time just tells us how long the script ran. Using 2 slow processors and 50000 words, it executed in just over two minutes in my case, which is actually quite good.
update: more like 6–7 seconds now, with the Ondra + Presort optimisation, and no threading.
further optimisations
update: overcome by better algorithm. This section is no longer completely valid.
The multithreading is awful. It allocates quite some memory and isn't exactly fast. This isn't suprising considering the amount of data. I considered using a Thread::Queue, but that thing is slow like $#*! and therefore is a complete no-go.
If the inner loop in test was coded in a lower-level language, some performance might be gained, as the index built-in wouldn't have to be called. If you can code C, take a look at the Inline::C module. If the whole script were coded in a lower language, array access would also be faster. A language like Java would also make the multithreading less painful (and less expensive).

What's the point of Perl's map?

Not really getting the point of the map function. Can anyone explain with examples its use?
Are there any performance benefits to using this instead of a loop or is it just sugar?
Any time you want to generate a list based another list:
# Double all elements of a list
my #double = map { $_ * 2 } (1,2,3,4,5);
# #double = (2,4,6,8,10);
Since lists are easily converted pairwise into hashes, if you want a hash table for objects based on a particular attribute:
# #user_objects is a list of objects having a unique_id() method
my %users = map { $_->unique_id() => $_ } #user_objects;
# %users = ( $id => $obj, $id => $obj, ...);
It's a really general purpose tool, you have to just start using it to find good uses in your applications.
Some might prefer verbose looping code for readability purposes, but personally, I find map more readable.
First of all, it's a simple way of transforming an array: rather than saying e.g.
my #raw_values = (...);
my #derived_values;
for my $value (#raw_values) {
push (#derived_values, _derived_value($value));
}
you can say
my #raw_values = (...);
my #derived_values = map { _derived_value($_) } #raw_values;
It's also useful for building up a quick lookup table: rather than e.g.
my $sentence = "...";
my #stopwords = (...);
my #foundstopwords;
for my $word (split(/\s+/, $sentence)) {
for my $stopword (#stopwords) {
if ($word eq $stopword) {
push (#foundstopwords, $word);
}
}
}
you could say
my $sentence = "...";
my #stopwords = (...);
my %is_stopword = map { $_ => 1 } #stopwords;
my #foundstopwords = grep { $is_stopword{$_} } split(/\s+/, $sentence);
It's also useful if you want to derive one list from another, but don't particularly need to have a temporary variable cluttering up the place, e.g. rather than
my %params = ( username => '...', password => '...', action => $action );
my #parampairs;
for my $param (keys %params) {
push (#parampairs, $param . '=' . CGI::escape($params{$param}));
}
my $url = $ENV{SCRIPT_NAME} . '?' . join('&', #parampairs);
you say the much simpler
my %params = ( username => '...', password => '...', action => $action );
my $url = $ENV{SCRIPT_NAME} . '?'
. join('&', map { $_ . '=' . CGI::escape($params{$_}) } keys %params);
(Edit: fixed the missing "keys %params" in that last line)
The map function is used to transform lists. It's basically syntactic sugar for replacing certain types of for[each] loops. Once you wrap your head around it, you'll see uses for it everywhere:
my #uppercase = map { uc } #lowercase;
my #hex = map { sprintf "0x%x", $_ } #decimal;
my %hash = map { $_ => 1 } #array;
sub join_csv { join ',', map {'"' . $_ . '"' } #_ }
See also the Schwartzian transform for advanced usage of map.
It's also handy for making lookup hashes:
my %is_boolean = map { $_ => 1 } qw(true false);
is equivalent to
my %is_boolean = ( true => 1, false => 1 );
There's not much savings there, but suppose you wanted to define %is_US_state?
map is used to create a list by transforming the elements of another list.
grep is used to create a list by filtering elements of another list.
sort is used to create a list by sorting the elements of another list.
Each of these operators receives a code block (or an expression) which is used to transform, filter or compare elements of the list.
For map, the result of the block becomes one (or more) element(s) in the new list. The current element is aliased to $_.
For grep, the boolean result of the block decides if the element of the original list will be copied into the new list. The current element is aliased to $_.
For sort, the block receives two elements (aliased to $a and $b) and is expected to return one of -1, 0 or 1, indicating whether $a is greater, equal or less than $b.
The Schwartzian Transform uses these operators to efficiently cache values (properties) to be used in sorting a list, especially when computing these properties has a non-trivial cost.
It works by creating an intermediate array which has as elements array references with the original element and the computed value by which we want to sort. This array is passed to sort, which compares the already computed values, creating another intermediate array (this one is sorted) which in turn is passed to another map which throws away the cached values, thus restoring the array to its initial list elements (but in the desired order now).
Example (creates a list of files in the current directory sorted by the time of their last modification):
#file_list = glob('*');
#file_modify_times = map { [ $_, (stat($_))[8] ] } #file_list;
#files_sorted_by_mtime = sort { $a->[1] <=> $b->[1] } #file_modify_times;
#sorted_files = map { $_->[0] } #files_sorted_by_mtime;
By chaining the operators together, no declaration of variables is needed for the intermediate arrays;
#sorted_files = map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [ $_, (stat($_))[8] ] } glob('*');
You can also filter the list before sorting by inserting a grep (if you want to filter on the same cached value):
Example (a list of the files modified in the last 24 hours sorted the last modification time):
#sorted_files = map { $_->[0] } sort { $a->[1] <=> $b->[1] } grep { $_->[1] > (time - 24 * 3600 } map { [ $_, (stat($_))[8] ] } glob('*');
The map function is an idea from the functional programming paradigm. In functional programming, functions are first-class objects, meaning that they can be passed as arguments to other functions. Map is a simple but a very useful example of this. It takes as its arguments a function (lets call it f) and a list l. f has to be a function taking one argument, and map simply applies f to every element of the list l. f can do whatever you need done to every element: add one to every element, square every element, write every element to a database, or open a web browser window for every element, which happens to be a valid URL.
The advantage of using map is that it nicely encapsulates iterating over the elements of the list. All you have to do is say "do f to every element, and it is up to map to decide how best to do that. For example map may be implemented to split up its work among multiple threads, and it would be totally transparent to the caller.
Note, that map is not at all specific to Perl. It is a standard technique used by functional languages. It can even be implemented in C using function pointers, or in C++ using "function objects".
The map function runs an expression on each element of a list, and returns the list results. Lets say I had the following list
#names = ("andrew", "bob", "carol" );
and I wanted to capitalize the first letter of each of these names. I could loop through them and call ucfirst of each element, or I could just do the following
#names = map (ucfirst, #names);
"Just sugar" is harsh. Remember, a loop is just sugar -- if's and goto can do everything loop constructs do and more.
Map is a high enough level function that it helps you hold much more complex operations in your head, so you can code and debug bigger problems.
To paraphrase "Effective Perl Programming" by Hall & Schwartz,
map can be abused, but I think that it's best used to create a new list from an existing list.
Create a list of the squares of 3,2, & 1:
#numbers = (3,2,1);
#squares = map { $_ ** 2 } #numbers;
Generate password:
$ perl -E'say map {chr(32 + 95 * rand)} 1..16'
# -> j'k=$^o7\l'yi28G
You use map to transform a list and assign the results to another list, grep to filter a list and assign the results to another list. The "other" list can be the same variable as the list you are transforming/filtering.
my #array = ( 1..5 );
#array = map { $_+5 } #array;
print "#array\n";
#array = grep { $_ < 7 } #array;
print "#array\n";
It allows you to transform a list as an expression rather than in statements. Imagine a hash of soldiers defined like so:
{ name => 'John Smith'
, rank => 'Lieutenant'
, serial_number => '382-293937-20'
};
then you can operate on the list of names separately.
For example,
map { $_->{name} } values %soldiers
is an expression. It can go anywhere an expression is allowed--except you can't assign to it.
${[ sort map { $_->{name} } values %soldiers ]}[-1]
indexes the array, taking the max.
my %soldiers_by_sn = map { $->{serial_number} => $_ } values %soldiers;
I find that one of the advantages of operational expressions is that it cuts down on the bugs that come from temporary variables.
If Mr. McCoy wants to filter out all the Hatfields for consideration, you can add that check with minimal coding.
my %soldiers_by_sn
= map { $->{serial_number}, $_ }
grep { $_->{name} !~ m/Hatfield$/ }
values %soldiers
;
I can continue chaining these expression so that if my interaction with this data has to reach deep for a particular purpose, I don't have to write a lot of code that pretends I'm going to do a lot more.
It's used anytime you would like to create a new list from an existing list.
For instance you could map a parsing function on a list of strings to convert them to integers.
As others have said, map creates lists from lists. Think of "mapping" the contents of one list into another. Here's some code from a CGI program to take a list of patent numbers and print hyperlinks to the patent applications:
my #patents = ('7,120,721', '6,809,505', '7,194,673');
print join(", ", map { "$_" } #patents);
As others have said, map is most useful for transforming a list. What hasn't been mentioned is the difference between map and an "equivalent" for loop.
One difference is that for doesn't work well for an expression that modifies the list its iterating over. One of these terminates, and the other doesn't:
perl -e '#x=("x"); map { push #x, $_ } #x'
perl -e '#x=("x"); push #x, $_ for #x'
Another small difference is that the context inside the map block is a list context, but the for loop imparts a void context.