What's the point of Perl's map? - list

Not really getting the point of the map function. Can anyone explain with examples its use?
Are there any performance benefits to using this instead of a loop or is it just sugar?

Any time you want to generate a list based another list:
# Double all elements of a list
my #double = map { $_ * 2 } (1,2,3,4,5);
# #double = (2,4,6,8,10);
Since lists are easily converted pairwise into hashes, if you want a hash table for objects based on a particular attribute:
# #user_objects is a list of objects having a unique_id() method
my %users = map { $_->unique_id() => $_ } #user_objects;
# %users = ( $id => $obj, $id => $obj, ...);
It's a really general purpose tool, you have to just start using it to find good uses in your applications.
Some might prefer verbose looping code for readability purposes, but personally, I find map more readable.

First of all, it's a simple way of transforming an array: rather than saying e.g.
my #raw_values = (...);
my #derived_values;
for my $value (#raw_values) {
push (#derived_values, _derived_value($value));
}
you can say
my #raw_values = (...);
my #derived_values = map { _derived_value($_) } #raw_values;
It's also useful for building up a quick lookup table: rather than e.g.
my $sentence = "...";
my #stopwords = (...);
my #foundstopwords;
for my $word (split(/\s+/, $sentence)) {
for my $stopword (#stopwords) {
if ($word eq $stopword) {
push (#foundstopwords, $word);
}
}
}
you could say
my $sentence = "...";
my #stopwords = (...);
my %is_stopword = map { $_ => 1 } #stopwords;
my #foundstopwords = grep { $is_stopword{$_} } split(/\s+/, $sentence);
It's also useful if you want to derive one list from another, but don't particularly need to have a temporary variable cluttering up the place, e.g. rather than
my %params = ( username => '...', password => '...', action => $action );
my #parampairs;
for my $param (keys %params) {
push (#parampairs, $param . '=' . CGI::escape($params{$param}));
}
my $url = $ENV{SCRIPT_NAME} . '?' . join('&', #parampairs);
you say the much simpler
my %params = ( username => '...', password => '...', action => $action );
my $url = $ENV{SCRIPT_NAME} . '?'
. join('&', map { $_ . '=' . CGI::escape($params{$_}) } keys %params);
(Edit: fixed the missing "keys %params" in that last line)

The map function is used to transform lists. It's basically syntactic sugar for replacing certain types of for[each] loops. Once you wrap your head around it, you'll see uses for it everywhere:
my #uppercase = map { uc } #lowercase;
my #hex = map { sprintf "0x%x", $_ } #decimal;
my %hash = map { $_ => 1 } #array;
sub join_csv { join ',', map {'"' . $_ . '"' } #_ }

See also the Schwartzian transform for advanced usage of map.

It's also handy for making lookup hashes:
my %is_boolean = map { $_ => 1 } qw(true false);
is equivalent to
my %is_boolean = ( true => 1, false => 1 );
There's not much savings there, but suppose you wanted to define %is_US_state?

map is used to create a list by transforming the elements of another list.
grep is used to create a list by filtering elements of another list.
sort is used to create a list by sorting the elements of another list.
Each of these operators receives a code block (or an expression) which is used to transform, filter or compare elements of the list.
For map, the result of the block becomes one (or more) element(s) in the new list. The current element is aliased to $_.
For grep, the boolean result of the block decides if the element of the original list will be copied into the new list. The current element is aliased to $_.
For sort, the block receives two elements (aliased to $a and $b) and is expected to return one of -1, 0 or 1, indicating whether $a is greater, equal or less than $b.
The Schwartzian Transform uses these operators to efficiently cache values (properties) to be used in sorting a list, especially when computing these properties has a non-trivial cost.
It works by creating an intermediate array which has as elements array references with the original element and the computed value by which we want to sort. This array is passed to sort, which compares the already computed values, creating another intermediate array (this one is sorted) which in turn is passed to another map which throws away the cached values, thus restoring the array to its initial list elements (but in the desired order now).
Example (creates a list of files in the current directory sorted by the time of their last modification):
#file_list = glob('*');
#file_modify_times = map { [ $_, (stat($_))[8] ] } #file_list;
#files_sorted_by_mtime = sort { $a->[1] <=> $b->[1] } #file_modify_times;
#sorted_files = map { $_->[0] } #files_sorted_by_mtime;
By chaining the operators together, no declaration of variables is needed for the intermediate arrays;
#sorted_files = map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [ $_, (stat($_))[8] ] } glob('*');
You can also filter the list before sorting by inserting a grep (if you want to filter on the same cached value):
Example (a list of the files modified in the last 24 hours sorted the last modification time):
#sorted_files = map { $_->[0] } sort { $a->[1] <=> $b->[1] } grep { $_->[1] > (time - 24 * 3600 } map { [ $_, (stat($_))[8] ] } glob('*');

The map function is an idea from the functional programming paradigm. In functional programming, functions are first-class objects, meaning that they can be passed as arguments to other functions. Map is a simple but a very useful example of this. It takes as its arguments a function (lets call it f) and a list l. f has to be a function taking one argument, and map simply applies f to every element of the list l. f can do whatever you need done to every element: add one to every element, square every element, write every element to a database, or open a web browser window for every element, which happens to be a valid URL.
The advantage of using map is that it nicely encapsulates iterating over the elements of the list. All you have to do is say "do f to every element, and it is up to map to decide how best to do that. For example map may be implemented to split up its work among multiple threads, and it would be totally transparent to the caller.
Note, that map is not at all specific to Perl. It is a standard technique used by functional languages. It can even be implemented in C using function pointers, or in C++ using "function objects".

The map function runs an expression on each element of a list, and returns the list results. Lets say I had the following list
#names = ("andrew", "bob", "carol" );
and I wanted to capitalize the first letter of each of these names. I could loop through them and call ucfirst of each element, or I could just do the following
#names = map (ucfirst, #names);

"Just sugar" is harsh. Remember, a loop is just sugar -- if's and goto can do everything loop constructs do and more.
Map is a high enough level function that it helps you hold much more complex operations in your head, so you can code and debug bigger problems.

To paraphrase "Effective Perl Programming" by Hall & Schwartz,
map can be abused, but I think that it's best used to create a new list from an existing list.
Create a list of the squares of 3,2, & 1:
#numbers = (3,2,1);
#squares = map { $_ ** 2 } #numbers;

Generate password:
$ perl -E'say map {chr(32 + 95 * rand)} 1..16'
# -> j'k=$^o7\l'yi28G

You use map to transform a list and assign the results to another list, grep to filter a list and assign the results to another list. The "other" list can be the same variable as the list you are transforming/filtering.
my #array = ( 1..5 );
#array = map { $_+5 } #array;
print "#array\n";
#array = grep { $_ < 7 } #array;
print "#array\n";

It allows you to transform a list as an expression rather than in statements. Imagine a hash of soldiers defined like so:
{ name => 'John Smith'
, rank => 'Lieutenant'
, serial_number => '382-293937-20'
};
then you can operate on the list of names separately.
For example,
map { $_->{name} } values %soldiers
is an expression. It can go anywhere an expression is allowed--except you can't assign to it.
${[ sort map { $_->{name} } values %soldiers ]}[-1]
indexes the array, taking the max.
my %soldiers_by_sn = map { $->{serial_number} => $_ } values %soldiers;
I find that one of the advantages of operational expressions is that it cuts down on the bugs that come from temporary variables.
If Mr. McCoy wants to filter out all the Hatfields for consideration, you can add that check with minimal coding.
my %soldiers_by_sn
= map { $->{serial_number}, $_ }
grep { $_->{name} !~ m/Hatfield$/ }
values %soldiers
;
I can continue chaining these expression so that if my interaction with this data has to reach deep for a particular purpose, I don't have to write a lot of code that pretends I'm going to do a lot more.

It's used anytime you would like to create a new list from an existing list.
For instance you could map a parsing function on a list of strings to convert them to integers.

As others have said, map creates lists from lists. Think of "mapping" the contents of one list into another. Here's some code from a CGI program to take a list of patent numbers and print hyperlinks to the patent applications:
my #patents = ('7,120,721', '6,809,505', '7,194,673');
print join(", ", map { "$_" } #patents);

As others have said, map is most useful for transforming a list. What hasn't been mentioned is the difference between map and an "equivalent" for loop.
One difference is that for doesn't work well for an expression that modifies the list its iterating over. One of these terminates, and the other doesn't:
perl -e '#x=("x"); map { push #x, $_ } #x'
perl -e '#x=("x"); push #x, $_ for #x'
Another small difference is that the context inside the map block is a list context, but the for loop imparts a void context.

Related

Regular expression is too complex error in tcl

I have not seen this error for a small list. Issue popped up when the list went >10k. Is there any limit on the number of regex patterns in tcl?
puts "#LEVELSHIFTER_TEMPLATES_LIMITSFILE:$perc_limit(levelshifter_templates)"
puts "#length of templates is :[llength $perc_limit(levelshifter_templates)]"
if { [regexp [join $perc_limit(levelshifter_templates) |] $temp] }
#LEVELSHIFTER_TEMPLATES_LIMITSFILE:HDPELT06_LVLDBUF_CAQDP_1 HDPELT06_LVLDBUF_CAQDPNRBY2_1 HDPELT06_LVLDBUF_CAQDP_1....
#length of templates is :13520
ERROR: couldn't compile regular expression pattern: regular expression is too complex
If $temp is a single word and you're really just doing a literal test, you should invert the check. One of the easiest ways might be:
if {$temp in $perc_limit(levelshifter_templates)} {
# ...
}
But if you're doing that a lot (well, more than a small number of times, 3 or 4 say) then building a dictionary for this might be best:
# A one-off cost
foreach key $perc_limit(levelshifter_templates) {
# Value is arbitrary
dict set perc_limit_keys $key 1
}
# This is now very cheap
if {[dict exists $perc_limit_keys $temp]} {
# ...
}
If you've got multiple words in $temp, split and check (using the second technique, which is now definitely worthwhile). This is where having a helper procedure can be a good plan.
proc anyWordIn {inputString keyDictionary} {
foreach word [split $inputString] {
if {[dict exists $keyDictionary $word]} {
return true
}
}
return false
}
if {[anyWordIn $temp $perc_limit_keys]} {
# ...
}
Assuming you want to see if the value in temp is an exact match for one of the elements of the list in perf_limit(levelshifter_templates), here's a few ways that are better than trying to use regular expressions:
Using lsearch`:
# Sort the list after populating it so we can do an efficient binary search
set perf_limit(levelshifter_templates) [lsort $perf_limit(levelshifter_templates)]
# ...
# See if the value in temp exists in the list
if {[lsearch -sorted $perf_limit(levelshifter_templates) $temp] >= 0} {
# ...
}
Storing the elements of the list in a dict (or array if you prefer) ahead of time for an O(1) lookup:
foreach item $perf_limit(levelshifter_templates) {
dict set lookup $item 1
}
# ...
if {[dict exists $lookup $temp]} {
# ...
}
I found a simple workaround for this problem by using a foreach statement to loop over all the regexes in the list instead of joining them and searching, which failed for a super-long list.
foreach pattern $perc_limit(levelshifter_templates) {
if { [regexp $pattern $temp]}
#puts "$fullpath: [is_std_cell_dev $dev]"
puts "##matches: $pattern return 0"
return 0
}
}

Perl list interpolation performance

Background
Perldoc for List::Util suggests that some uses of map may be replaced by reduce in order to avoid creating unnecessary intermadiate list:
For example, to find the total length of the all the strings in a
list, we could use
$total = sum map { length } #strings;
However, this produces a list of temporary integer values as long as
the original list of strings, only to reduce it down to a single value
again. We can compute the same result more efficiently by using reduce
with a code block that accumulates lengths by writing this instead as:
$total = reduce { $a + length $b } 0, #strings;
That makes sense. However, reduce in order to work in this example needs "identity value", that would be prepended to input list:
$total = reduce { $a + length $b } 0, #strings;
# ^^^^^^^^^^^
That makes me think, doesn't 0, #strings create a new list, thus offset any gains from not creaing list in map?
Question
How does list interpolation ($scalar, #list) work in Perl? Does it involve copying elements from source list or is it done in some smarter way? My simple benchmark suggests copying taking place:
use strict;
use warnings;
use Benchmark qw/cmpthese/;
my #a1 = 1..10;
my #a2 = 1..100;
my #a3 = 1..1000;
my #a4 = 1..10000;
my #a5 = 1..100000;
my #a6 = 1..1000000;
cmpthese(10000, {
'a1' => sub { my #l = (0, #a1); },
'a2' => sub { my #l = (0, #a2); },
'a3' => sub { my #l = (0, #a3); },
'a4' => sub { my #l = (0, #a4); },
'a5' => sub { my #l = (0, #a5); },
'a6' => sub { my #l = (0, #a6); },
});
Results:
(warning: too few iterations for a reliable count)
Rate a6 a5 a4 a3 a2 a1
a6 17.6/s -- -90% -99% -100% -100% -100%
a5 185/s 952% -- -90% -99% -100% -100%
a4 1855/s 10438% 902% -- -90% -99% -100%
a3 17857/s 101332% 9545% 862% -- -91% -98%
a2 200000/s 1135940% 107920% 10680% 1020% -- -80%
a1 1000000/s 5680100% 540000% 53800% 5500% 400% --
Bonus question: If my assumptions are correct (i.e. 0, #strings creates a new list), does replacing map with reduce make sense?
doesn't 0, #strings create a new list
Not really. If you decompile the code, it's just one additional SVOP.
But you're measuring the wrong thing. The values are flattened and passed into the map or reduce subroutine in both cases!
The documentation is talking about what happens inside the subroutine. map creates a list of as many input values and returns them, and then sum takes the list and condenses it into a value. The return list is ephemeral and is not represented directly in the code. (This list passing is not that efficient, it could be made faster by using references.)
In contrast, in reduce, there no such return list. reduce only works on the input list of values and returns a single value.
"This produces a list of temporary integer values as long as the original list of strings" refers to map putting N scalars on the stack. The thing is, the reduce approach creates just as many scalars, and they also all go on the stack. The only difference is that the reduce approach only keeps one on them on the stack at once. That means the reduce approach uses less memory, but it doesn't speak to its performance at all. The reason it gives for reduce computing the same result more efficiently is nonsense.
There could be a performance difference, but not for that reason. If you want to find which one performs better for you, will need to run a benchmark.
That makes me think, doesn't 0, #strings create a new list
No. reduce creates a single list unconditonally. This is unrelated to the number expressions in the argument list.
Lists aren't arrays. When we say "the sub returns a list" or "the op evaluates to a list", we actually mean "the sub or op places some quantity of scalars on the stack".
List are created for ops that will pop a variable number of scalars from the stack. This is done by simply pushing a mark onto the stack. For example, reduce { ... } 0, #a would create a list for the entersub op. { ... } will end up leaving one code ref on the list/stack, 0 will end up leaving a number on the list/stack, and #strings will end up leaving its elements on the list/stack. One last thing is added to the list/stack before the sub is called: the glob *reduce.
Note that creating the list is effectively free, since it's simply pushing a mark on the stack. Placing an array on the stack is proportional to the number of its elements, but it's still quite cheap since we're only copying a block of pointers (in the C sense of the word).
That means there's effectively no performance difference between reduce { ... } #strings and reduce { ... } 0, #strings. Both create a single list, and both add roughly the same number of elements to the list/stack.
Exceptions:
for (#a) is optimized to be for* (\#a).This saves memory, and it saves time if the loop is exited prematurely.
sub f(\#); f(#a) is equivalent to &f(\#a).
AFAIK, map and grep aren't optimized in this manner.
In detail:
$ perl -MO=Concise,-exec -MList::Util=reduce -e'reduce { ... } #a'
...
3 <0> pushmark s <-- Creates list (adds mark to the stack).
4 <$> anoncode[CV ] sRM <-- Adds CV to the stack.
5 <1> srefgen sKM/1 <-- Replaces CV with a ref to the CV.
6 <#> gv[*a] s <-- Places *a on the stack.
7 <1> rv2av[t4] lKM/1 <-- Replaces *a with the contents of #a.
8 <#> gv[*reduce] s <-- Places *reduce on the stack.
9 <1> entersub[t5] vKS/TARG <-- Will remove the entire list from the stack.
...
$ perl -MO=Concise,-exec -MList::Util=reduce -e'reduce { ... } 0, #a'
...
3 <0> pushmark s
4 <$> anoncode[CV ] sRM
5 <1> srefgen sKM/1
6 <$> const[IV 0] sM <-- The only difference.
7 <#> gv[*a] s
8 <1> rv2av[t4] lKM/1
9 <#> gv[*reduce] s
a <1> entersub[t5] vKS/TARG
...
The direct question can be answered directly by a benchmark
use strict;
use warnings;
use List::Util qw(sum reduce);
use Benchmark qw(cmpthese);
my #ary = 1..10_000;
sub by_reduce { my $res = reduce { $a + length $b } 0, #ary }
sub by_map { my $res = sum map { length } #ary }
cmpthese(-3, {
reduce => sub { by_reduce },
map => sub { by_map },
});
which prints on my v5.16 at hand
Rate map reduce
map 780/s -- -41%
reduce 1312/s 68% --
Thus reduce does something significantly better for this task.
As for the question of lists in general, it would have to depend on how the full list is used.
In your benchmark there is an assignment to a new array so the data copy clearly must be done. Then longer arrays take longer, and by about an order of magnitude quite like the ratio of their sizes.
With list inputs for functions like map and reduce I don't see a reason for an additional data copy. This can be checked by a benchmark, comparing an identical operation
my #ary = 1..10_000;
# benchmark:
my $r1 = sum map { length } #ary;
my $r2 = sum map { length } (1..5000, 5001..10_000);
The reported rates are nearly identical, for example780/s and 782/s, showing that the flattening of the ranges for map input doesn't involve a data copy. (The ranges are converted to arrays at compile time, thanks to ikegami for comments.)

How to update hash by adding keys from the list in perl

Adding a set of keys when declaring a hash is straightforward as -
my %hash = map { $_ => 1 } #list;
If I want to add more keys from another list, how can I achieve that with single line?
With %hash declared you may use #hash{LIST}, with existing or new keys in LIST
#hash{ #more_keys } = #values_for_new_keys;
See Slices in perldata
If you meant to initiliaze the new keys to a fixed value, you can do for example
#hash { #more_keys } = (1) x #more_keys;
where (1) x N returns a list of 1s of length N, and #more_keys in scalar context returns its length.

How to use regular expression to find keys in hash

I have 6mio hashes and need to count how many of these have keys that start with AA00, AB10 and how many of them have keys starting with with both strings.
For each hash I have done this:
if (exists $hash{AA00}) {
$AA00 +=1;
}
if (exists $hash{AB10}) {
$AB10 += 1;
}
if (exists $hash{AA00} and exists $hash{AA10}) {
$both += 1;
}
but then I count only the number of hashes that contains exactly AA00 or AB10 as keys, but I would also like to count hashes that contain, say AA001. Can I use regular expression for this?
I completely misunderstood your question. To find the number of hashes with keys matching a regex (as opposed to the number of keys matching a regex in a single hash), you can still use the grep approach I outlined in my earlier answer. This time, however, you need to loop through your hashes (I assume you're storing them in an array if you have 6 million of them) and run grep twice on each one:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my #array = (
{ AA00 => 'foo' },
{ AB10 => 'bar' },
{ AA001 => 'foo' },
{ AA00 => 'foo', AB10 => 'bar' }
);
my ($hashes_with_aa00, $hashes_with_ab10, $hashes_with_both) = (0, 0, 0);
foreach my $hash (#array) {
my $aa_count = grep { /^AA00/ } keys %$hash;
my $ab_count = grep { /^AB10/ } keys %$hash;
$hashes_with_aa00++ if $aa_count;
$hashes_with_ab10++ if $ab_count;
$hashes_with_both++ if $aa_count and $ab_count;
}
say "AA00: $hashes_with_aa00";
say "AB10: $hashes_with_ab10";
say "Both: $hashes_with_both";
Output:
AA00: 3
AB10: 2
Both: 1
This works, but is pretty poor in terms of performance: grep loops through every element in the list of keys for each hash, and we're calling it twice per hash!
Since we don't care how many keys match in each hash, only whether there is a match, a better solution would be any from List::MoreUtils. any works much like grep but returns as soon as it finds a match. To use any instead of grep, change this:
foreach my $hash (#array) {
my $aa_count = grep { /^AA00/ } keys %$hash;
my $ab_count = grep { /^AB10/ } keys %$hash;
$hashes_with_aa00++ if $aa_count;
$hashes_with_ab10++ if $ab_count;
$hashes_with_both++ if $aa_count and $ab_count;
}
to this:
use List::MoreUtils 'any';
foreach my $hash (#array) {
my $aa_exists = any { /^AA00/ } keys %$hash;
my $ab_exists = any { /^AB10/ } keys %$hash;
$hashes_with_aa00++ if $aa_exists;
$hashes_with_ab10++ if $ab_exists;
$hashes_with_both++ if $aa_exists and $ab_exists;
}
Note that I changed the variable names to better reflect their meaning.
This is much better in terms of performance, but as Borodin notes in a comment on your question, you're losing the speed advantage of hashes by not accessing them with specific keys. You might want to change your data structure accordingly.
Original Answer: Counting keys that match a regex in a single hash
This is my original answer based on a misunderstanding of your question. I'm leaving it up because I think it could be useful for similar situations.
To count the number of keys that match a regex in a single hash, you can use grep:
my $aa_count = grep { /^AA00/ } keys %hash;
my $ab_count = grep { /^AB10/ } keys %hash;
my $both = $aa_count + $ab_count;
As HunterMcMillen points out in the comments, there's no need to search through the hash keys again to get the total count; in this case, you can simply add the two subtotals. You can get away with this because the two patterns you're searching for are mutually exclusive; in other words, you cannot have a key that both begins with AA00 and AB10.
In the more general case, it might be possible for a single key to match both patterns (thanks Borodin). In that case, you cannot simply add up the two subtotals. For example, if you wanted your keys to merely contain AA00 or AB10 anywhere in the string, not necessarily at the beginning, you would need to do something like this:
my $aa_count = grep { /AA00/ } keys %hash;
my $ab_count = grep { /AB10/ } keys %hash;
my $both = grep { /(?:AA00|AB10)/ } keys %hash;
Note that this calls grep multiple times, which means traversing the entire hash multiple times. This could be done more efficiently using a single for loop like FlyingFrog and Kenosis did.

Matching Values in Hashes

I have two arrays of hashes. I want to narrow down the second one according to variables in the first.
The first array contains hashes with keys seqname, source, feature, start, end, score, strand, frame, geneID and transcriptID.
The second array contains hashes with keys
organism, geneID, number, motifnumber, position, strand and sequence.
What I want to do, is remove from the first array of hashes, all the hashes which have a variable geneID which is not found in any of the hashes of the second array. - Note both types of hash have the geneID key. Simply put, I want to keep those hashes in the first array, which have geneID values which are found in the hashes of the second array.
My attempt at this so far was with two loops:
my #subset # define a new array for the wanted hashes to go into.
for my $i (0 .. $#first_hash_array){ # Begin loop to go through the hashes of the first array.
for my $j (0 .. $#second_hash_array){ # Begin loop through the hashes of the 2nd array.
if ($second_hash_array[$j]{geneID} =~ m/$first_hash_array[$i]{geneID}/)
{
push #subset, $second_hash_array[$j];
}
}
}
However I'm not sure that this is the right way to go about this.
For starters, $a =~ /$b/ doesn't check for equality. You'd need
$second_hash_array[$j]{geneID} =~ m/^\Q$first_hash_array[$i]{geneID}\E\z/
or simply
$second_hash_array[$j]{geneID} eq $first_hash_array[$i]{geneID}
for that.
Secondly,
for my $i (0 .. $#first_hash_array) {
... $first_hash_array[$i] ...
}
can be written more succinctly as
for my $first (#first_hash_array) {
... $first ...
}
Next on the list is that
for my $second (#second_hash_array) {
if (...) {
push #subset, $second;
}
}
can add $second to #subset more than once. You either need to add a last
# Perform the push if the condition is true for any element.
for my $second (#second_hash_array) {
if (...) {
push #subset, $second;
last;
}
}
or move the push out of the loop
# Perform the push if the condition is true for all elements.
my $flag = 1;
for my $second (#second_hash_array) {
if (!...) {
$flag = 0;
last;
}
}
if ($flag) {
push #subset, $second;
}
depending on what you want to do.
To remove from an array, one would use splice. But removing from an array messes up all the indexes, so it's better to iterate the array backwards (from last to first index).
Not only is it complicated, it's also expensive. Every time you splice, all subsequent elements in the array need to moved.
A better approach is to filter the elements and assign the resulting element to the array.
my #new_first_hash_array;
for my $first (#first_hash_array) {
my $found = 0;
for my $second (#second_hash_array) {
if ($first->{geneID} eq $second->{geneID}) {
$found = 1;
last;
}
}
if ($found) {
push #new_first_hash_array, $first;
}
}
#first_hash_array = #new_first_hash_array;
Iterating through #second_hash_array repeatedly is needlessly expensive.
my %geneIDs_to_keep;
for (#second_hash_array) {
++$geneIDs_to_keep{ $_->{geneID} };
}
my #new_first_hash_array;
for (#first_hash_array) {
if ($geneIDs_to_keep{ $_->{geneID} }) {
push #new_first_hash_array, $_;
}
}
#first_hash_array = #new_first_hash_array;
Finally, we can replace that for with a grep to give the following simple and efficient answer:
my %geneIDs_to_keep;
++$geneIDs_to_keep{ $_->{geneID} } for #second_hash_array;
#first_hash_array = grep $geneIDs_to_keep{ $_->{geneID} }, #first_hash_array;
This is how I would do it.
Create an array req_geneID for geneIDs required and put all geneIds of the second hash in it.
Traverse the first hash and check if the geneId is contained in the req_geneID array.(its easy in ruby using "include?" but you may try this in perl)
and,
Finally delete the hash that doesnot match any geneID in req_geneID using this in perl
for (keys %hash)
{
delete $hash{$_};
}
Hope this helps.. :)