I need to find and replace keywords from a hash in a large documents as fast as possible.
I tired the below two methods, one is faster by 320% but I am sure I am doing this the wrong way and sure there is a better way to do it.
The idea I want to replace only the keywords that exist in the dictionary hash and keep those that does not exist so I know it is not in the dictionary.
Both methods below scan twice to find and replace as I think. I am sure the regex like look ahead or behind can optimize it much faster.
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(:all);
my %dictionary = (
pollack => "pollard",
polynya => "polyoma",
pomaces => "pomaded",
pomades => "pomatum",
practic => "praetor",
prairie => "praised",
praiser => "praises",
prajnas => "praline",
quakily => "quaking",
qualify => "quality",
quamash => "quangos",
quantal => "quanted",
quantic => "quantum",
);
my $content =qq{
Start this is the text that contains the words to replace. {quantal} A computer {pollack} is a general {pomaces} purpose device {practic} that
can be {quakily} programmed to carry out a set {quantic} of arithmetic or logical operations automatically {quamash}.
Since a {prajnas} sequence of operations can {praiser} be readily changed, the computer {pomades} can solve more than {prairie}
one kind of problem {qualify} {doesNotExist} end.
};
# just duplicate content many times
$content .= $content;
cmpthese(100000, {
replacer_1 => sub {my $text = replacer1($content)},
replacer_2 => sub {my $text = replacer2($content)},
});
print replacer1($content) , "\n--------------------------\n";
print replacer2($content) , "\n--------------------------\n";
exit;
sub replacer1 {
my ($content) = shift;
$content =~ s/\{(.+?)\}/exists $dictionary{$1} ? "[$dictionary{$1}]": "\{$1\}"/gex;
return $content;
}
sub replacer2 {
my ($content) = shift;
my #names = $content =~ /\{(.+?)\}/g;
foreach my $name (#names) {
if (exists $dictionary{$name}) {
$content =~ s/\{$name\}/\[$dictionary{$name}\]/;
}
}
return $content;
}
Here is the benchmark result:
Rate replacer_2 replacer_1
replacer_2 5565/s -- -76%
replacer_1 23397/s 320% --
Here's a way that's a little faster and more compact:
sub replacer3 {
my ($content) = shift;
$content =~ s#\{(.+?)\}#"[".($dictionary{$1} // $1)."]"#ge;
return $content;
}
In Perl 5.8, it is ok to use || instead of // if none of your dictionary values are "false".
There's also a little to be gained by using a dictionary that already contains the braces and brackets:
sub replacer5 {
my ($content) = shift;
our %dict2;
if (!%dict2) {
%dict2 = map { "{".$_."}" => "[".$dictionary{$_}."]" } keys %dictionary
}
$content =~ s#(\{.+?\})#$dict2{$1} || $1#ge;
return $content;
}
Benchmark results:
Rate replacer_2 replacer_1 replacer_3 replacer_5
replacer_2 2908/s -- -79% -83% -84%
replacer_1 14059/s 383% -- -20% -25%
replacer_3 17513/s 502% 25% -- -7%
replacer_5 18741/s 544% 33% 7% --
It helps to build a regex that will match any of the hash keys beforehand. Like this
my $pattern = join '|', sort {length $b <=> length $a } keys %dictionary;
$pattern = qr/$pattern/;
sub replacer4 {
my ($string) = #_;
$string =~ s# \{ ($pattern) \} #"[$dictionary{$1}]"#gex;
$string;
}
with these results
Rate replacer_2 replacer_1 replacer_3 replacer_4
replacer_2 4883/s -- -80% -84% -85%
replacer_1 24877/s 409% -- -18% -22%
replacer_3 30385/s 522% 22% -- -4%
replacer_4 31792/s 551% 28% 5% --
It would also make an improvement if you could the braces and brackets in the hash, instead of having to add them each time.
I'd recommend using meaningful names for your benchmarking subroutines, it'll make the output and intent more clear.
The following reproduces a bit of what Borodin and mob have tried out, and then combines them as well.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'state';
use Benchmark qw(:all);
# Data separated by paragraph mode.
my %dictionary = split ' ', do {local $/ = ''; <DATA>};
my $content = do {local $/; <DATA>};
# Quadruple Content
$content = $content x 4;
cmpthese(100_000, {
original => sub { my $text = original($content) },
build_list => sub { my $text = build_list($content) },
xor_regex => sub { my $text = xor_regex($content) },
list_and_xor => sub { my $text = list_and_xor($content) },
});
exit;
sub original {
my $content = shift;
$content =~ s/\{(.+?)\}/exists $dictionary{$1} ? "[$dictionary{$1}]": "\{$1\}"/gex;
return $content;
}
sub build_list {
my $content = shift;
state $list = join '|', map quotemeta, keys %dictionary;
$content =~ s/\{($list)\}/[$dictionary{$1}]/gx;
return $content;
}
sub xor_regex {
my $content = shift;
state $with_brackets = {
map {("{$_}" => "[$dictionary{$_}]")} keys %dictionary
};
$content =~ s{(\{.+?\})}{$with_brackets->{$1} // $1}gex;
return $content;
}
sub list_and_xor {
my $content = shift;
state $list = join '|', map quotemeta, keys %dictionary;
state $with_brackets = {
map {("{$_}" => "[$dictionary{$_}]")} keys %dictionary
};
$content =~ s{(\{(?:$list)\})}{$with_brackets->{$1} // $1}gex;
return $content;
}
__DATA__
pollack pollard
polynya polyoma
pomaces pomaded
pomades pomatum
practic praetor
prairie praised
praiser praises
prajnas praline
quakily quaking
qualify quality
quamash quangos
quantal quanted
quantic quantum
Start this is the text that contains the words to replace. {quantal} A computer {pollack} is a general {pomaces} purpose device {practic} that
can be {quakily} programmed to carry out a set {quantic} of arithmetic or logical operations automatically {quamash}.
Since a {prajnas} sequence of operations can {praiser} be readily changed, the computer {pomades} can solve more than {prairie}
one kind of problem {qualify} {doesNotExist} end.
Outputs:
Rate original xor_regex build_list list_and_xor
original 19120/s -- -23% -24% -29%
xor_regex 24938/s 30% -- -1% -8%
build_list 25253/s 32% 1% -- -7%
list_and_xor 27027/s 41% 8% 7% --
My solutions make heavy use of state variables to avoid reinitializing static data structures. However, one could also use closures or our $var; $var ||= VAL.
Addendum about enhancing the LHS of the regex
Actually, editing the LHS to use an explicit list is about improving the regular expression. And this change showed a 30% improvement in speed.
There isn't likely to be any magic solution to this. You have a list of values, that you're wanting to replace. It isn't like there is some mysterious way to simplify the language of this goal.
You could perhaps use a code block in the LHS to Fail and skip if the word does not exist in the dictionary hash. However, the following shows that this is actually 36% slower than your original method:
sub skip_fail {
my $content = shift;
$content =~ s{\{(.+?)\}(?(?{! $dictionary{$1}})(*SKIP)(*FAIL))}{[$dictionary{$1}]}gx;
return $content;
}
Outputs:
Rate skip_fail original xor_regex build_list list_and_xor
skip_fail 6769/s -- -36% -46% -49% -53%
original 10562/s 56% -- -16% -21% -27%
xor_regex 12544/s 85% 19% -- -6% -14%
build_list 13355/s 97% 26% 6% -- -8%
list_and_xor 14537/s 115% 38% 16% 9% --
Related
In perl, I am reading a line and trying to replace a set of strings with corresponding expressions using a sequence of if statements. For example:
my #shiftInstructions=("lsr", "lsl", "rol", "ror");
while (my $line = <>) {
if ($line =~ /\$sh/) {
my $r = int(rand(6));
$line =~ s/\$sh/$r/;
}
if ($line =~ /\$ish/) {
my $r = $shiftInstructions[rand(4)]
$line =~ s/\$ish/$r/;
}
}
I don't like this approach for a number of reasons. First, it is repetitive. I have to first check if the pattern exists, and then if it does, execute a function to generate a replacement value, then substitute. So it is both verbose, and slow (2 regex searches per pattern, perhaps eventually dozens of pattern strings).
I thought of a map where a number of codes are mapped to corresponding code to execute.
I can imagine mapping to a string and then using eval but then I can't check the code except at runtime. Is there any cleaner way of doing this?
I found the execute option in regex. What about writing a set of subroutines to process each regex, then creating a mapping:
my %regexMap = (
"\$fn", &foundFunc,
"\$hw", &hex8,
"\$hb", &hex2,
"\$sh", &rand6,
"\$ish", &shiftInst,
);
while (my $line = <>) {
$line =~ s/(\$fn|\$hw|\$hb|\$sh|\$ish|)/$regexMap{$1}/e;
print $line;
}
if ($line =~ /\$sh/) {
my $r = int(rand(6));
$line =~ s/\$sh/$r/;
}
is a poor way of writing
$line =~ s/\$sh/ int(rand(6)) /e;
So
my #shiftInstructions=("lsr", "lsl", "rol", "ror");
while (my $line = <>) {
if ($line =~ /\$sh/) {
my $r = int(rand(6));
$line =~ s/\$sh/$r/;
}
if ($line =~ /\$ish/) {
my $r = $shiftInstructions[rand(4)]
$line =~ s/\$ish/$r/;
}
print($line);
}
can be written as
my #shiftInstructions = qw( lsr lsl rol ror );
while (my $line = <>) {
$line =~ s/\$sh/ int(rand(6)) /e;
$line =~ s/\$ish/ $shiftInstructions[rand(#shiftInstructions)] /e;
print($line);
}
But that means you are scanning the string over and over again. Let's avoid that.
my #shiftInstructions = qw( lsr lsl rol ror );
while (my $line = <>) {
$line =~ s/\$(sh|ish)/
if ( $1 eq "sh" ) { int(rand(6)) }
elsif ( $1 eq "ish" ) { $shiftInstructions[rand(#shiftInstructions)] }
/eg;
print($line);
}
Unfortunately, that reintroduces repetition. We can solve that using a dispatch table.
my #shiftInstructions = qw( lsr lsl rol ror );
my %replacements = (
sh => sub { int(rand(6)) },
ish => sub { $shiftInstructions[rand(#shiftInstructions)] },
);
my $alt = join '|', map quotemeta, keys(%replacements);
my $re = qr/\$($alt)/;
while (my $line = <>) {
print $line =~ s/$re/ $replacements{$1}->() /reg;
}
Now we have an efficient solution that can be extended without slowing down the matching, all while avoiding repetition.
The solution you added to your question was close, but it had two bugs.
&foo calls foo. To get a reference to it, use \&foo.
my %regexMap = (
"\$fn", \&foundFunc,
"\$hw", \&hex8,
"\$hb", \&hex2,
"\$sh", \&rand6,
"\$ish", \&shiftInst,
);
$regexMap{$1} now returns the reference. You want to call the referenced sub, which can be done using $regexMap{$1}->().
while (my $line = <>) {
$line =~ s/(\$fn|\$hw|\$hb|\$sh|\$ish|)/ $regexMap{$1}->() /e;
print $line;
}
In these cases, I often make some sort of data structure that holds the patterns and their actions:
my #tuples = (
[ qr/.../, sub { ... } ]
[ ... ].
);
Now the meat of the process stays the same no matter how many patterns I want to try:
while( <> ) {
foreach $tuple ( #tuples ) {
$tuple->[1]() if /$tuple[0]/
}
}
Abstract this a little further with a subroutine that takes the data structure. Then you can pass it different tables depending on what you would like to do:
sub some_sub {
my #tuples = #_;
while( <> ) {
foreach $tuple ( #tuples ) {
$tuple->[1]() if /$tuple[0]/
}
}
}
I've written about this sort of thing in Mastering Perl and Effective Perl Programming, and it's the sort of thing that does into my obscure modules like Brick and Data::Constraint.
I've been thinking about this more, and I wonder if regexes are actually part of what you are trying to do. It looks like you are matching literal strings, but using the match operator to do it. You don't give details of the input, so I'm guessing here—it looks like there's an operation (e.g. $fn, and you want to match exactly that operation. The problem is finding that string then mapping it onto code. That looks something like this (and ikegami's answer is another form of this idea). Instead of an alternation, I match anything that might look like the string:
while( <> ) {
# find the string. Need example input to guess better
if( m/(\$[a-z]+)/ ) {
$table{$1}->() if exists $table{$1};
}
}
But again, it's dependent on the input, how many actual substrings you might want to match (so, the number of branches in an alternation), how many lines you want to process, and so on. There was a wonderful talk about processing apache log files with Regex::Trie and the various experiments they tried to make things faster. I've forgotten all the details, but very small adjustments made noticeable differences over tens of millions of lines.
Interesting reading:
Maybe this talk? An exploration of trie regexp matching
http://taint.org/2006/07/07/184022a.html
Matching a long list of phrases
OP's code can be written in following form
use strict;
use warnings;
use feature 'say';
my %regexMap = (
'$fn' => \&foundFunc,
'$hw' => \&hex8,
'$hb' => \&hex2,
'$sh' => \&rand6,
'$ish' => \&shiftInst,
);
my #keys = map { "\\$_" } keys %regexMap;
my $re = join('|', #keys);
while (<DATA>) {
chomp;
next unless /($re)/;
$regexMap{$1}->();
}
sub foundFunc { say 'sub_foundFunc' }
sub hex8 { say 'sub_hex8' }
sub hex2 { say 'sub_hex2' }
sub rand6 { say 'sub_rand6' }
sub shiftInst { say 'sub_shiftInst' }
__DATA__
$fn
$hw
$ac
$hb
$sh
$fn
$mf
$hb
$ish
$hw
Output
sub_foundFunc
sub_hex8
sub_hex2
sub_rand6
sub_foundFunc
sub_hex2
sub_shiftInst
sub_hex8
I have a large flat text file with lines that hold name/value pairs ("varname=value"). These pairs are seperated by a multi-character delimiter. So a single line in this file might look like this:
var1=value1|^|var2=value2|^|var3=value3|^|var4=value4
Each line holds about 50 name/value pairs.
I need to iterate through the lines of this file (there are about 100,000 lines) and store the name/value pairs in a hash so that
$field{'var1'} = value1
$field{'var2'} = value2
etc...
What I did was this:
# $line holds a single line from the file
my #fields = split( /\Q|^|\E/, $line );
foreach my $field (#fields) {
my ($name, $value) = split( /=/, $field );
$hash{$name} = $value;
}
Doing this for each line of the entire file takes (on my PC) about 2 seconds. This doesn't seem like a long time, but I really want to speed this up by quite a bit.
Of this 2 seconds, the first split takes about 0.6 seconds, while the foreach loop takes about 1.4 seconds. So I thought I'd get rid of the foreach loop and put it all in a single split:
%hash = split( /\Q|^|\E|=/, $line );
Much to my surprise, parsing the entire file this way took a full second longer! My question isn't really why this takes longer (although it would be a nice bonus to understand why), but my question is if there are any other (faster) ways to get the job done.
Thanks in advance.
------ Edit below this line ------
I just found out that changing this:
%hash = split( /\Q|^|\E|=/, $line );
into this:
$line =~ s/\Q|^|\E/=/g;
%hash = split( /=/, $line );
makes it three times faster! Parsing the entire file this way now takes just over a second...
------ Snippet below this line ------
use strict;
use Time::HiRes qw( time );
my $line = "a=1|^|b=2|^|c=3|^|d=4|^|e=5|^|f=6|^|g=7|^|h=8|^|i=9|^|j=10|^|k=11|^|l=12|^|m=13|^|n=14|^|o=15|^|p=16|^|q=17|^|r=18|^|s=19|^|t=20|^|u=21|^|v=22|^|w=23|^|x=24|^|y=25|^|z=26|^|aa=27|^|ab=28|^|ac=29|^|ad=30|^|ae=31|^|af=32|^|ag=33|^|ah=34|^|ai=35|^|aj=36|^|ak=37|^|al=38|^|am=39|^|an=40|^|ao=41|^|ap=42|^|aq=43|^|ar=44|^|as=45|^|at=46|^|au=47|^|av=48|^|aw=49|^|ax=50";
ResetTimer();
my %hash;
for( my $i = 1; $i <= 100000; $i++ ) {
my #fields = split( /\Q|^|\E/, $line );
foreach my $field (#fields) {
my ($name, $value) = split( /=/, $field );
$hash{$name} = $value;
}
}
print Elapsed() . "\n";
ResetTimer();
%hash = ();
for( my $i = 1; $i <= 100000; $i++ ) {
%hash = split( /\Q|^|\E|=/, $line );
}
print Elapsed() . "\n";
ResetTimer();
%hash = ();
for( my $i = 1; $i<=100000; $i++ ) {
$line =~ s/\Q|^|\E/=/g;
%hash = split( /=/, $line );
}
print Elapsed() . "\n";
################################################################################################################################
BEGIN {
my $startTime;
sub ResetTimer {
$startTime = time();
return $startTime;
}
sub Elapsed {
return time() - $startTime;
}
}
I can't easily answer your performance question, because I'd need a test case. But I'd guess that it's to do with how the regular expression is being processed.
You can see what that's doing with use re 'debug';, and that'll print the regular expression steps.
But for the broader question - I'd probably just tackle it with a global (assuming your data is as simple as the example):
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
while ( <DATA> ) {
my %row = m/(\w+)=(\w+)/g;
print Dumper \%row;
}
__DATA__
var1=value1|^|var2=value2|^|var3=value3|^|var4=value4
You can use lookahead/behind to match delimiters if you've got more complicated things in there, but because it's one regex per line, you're invoking the regex engine less often, and that'll probably be faster. (But I can't tell you for sure without a test case).
If your data is more complicated, then perhaps:
my %row = s/\Q|^|\E/\n/rg =~ m/(.*)=(.*)/g;
This will 'force' splitting the input into a new line, and then match 'anything' = 'anything'. But that's probably overkill unless your values include whitespace/pipes/metachars.
With editing your test case to use Benchmark:
#!/usr/bin/env perl
use strict;
use warnings;
use Benchmark qw ( cmpthese );
my $line =
"a=1|^|b=2|^|c=3|^|d=4|^|e=5|^|f=6|^|g=7|^|h=8|^|i=9|^|j=10|^|k=11|^|l=12|^|m=13|^|n=14|^|o=15|^|p=16|^|q=17|^|r=18|^|s=19|^|t=20|^|u=21|^|v=22|^|w=23|^|x=24|^|y=25|^|z=26|^|aa=27|^|ab=28|^|ac=29|^|ad=30|^|ae=31|^|af=32|^|ag=33|^|ah=34|^|ai=35|^|aj=36|^|ak=37|^|al=38|^|am=39|^|an=40|^|ao=41|^|ap=42|^|aq=43|^|ar=44|^|as=45|^|at=46|^|au=47|^|av=48|^|aw=49|^|ax=50";
sub double_split {
my %hash;
my #fields = split( /\Q|^|\E/, $line );
foreach my $field (#fields) {
my ( $name, $value ) = split( /=/, $field );
$hash{$name} = $value;
}
}
sub single_split {
my %hash = split( /\Q|^|\E|=/, $line );
}
sub re_replace_then_split {
$line =~ s/\Q|^|\E/=/g;
my %hash = split( /=/, $line );
}
sub single_regex {
my %hash = $line =~ m/(\w+)=(\w+)/g;
}
sub compound {
my %hash = $line =~ s/\Q|^|\E/\n/rg =~ m/(.*)=(.*)/g;
}
cmpthese(
1_000_000,
{ "Double Split" => \&double_split,
"single split with regex" => \&single_split,
"Replace then split" => \&re_replace_then_split,
"Single Regex" => \&single_regex,
"regex to linefeed them match" => \&compound
}
);
Looks like the results come out like:
Rate Double Split single split with regex Single Regex Replace then split regex to linefeed them match
Double Split 18325/s -- -4% -34% -56% -97%
single split with regex 19050/s 4% -- -31% -54% -97%
Single Regex 27607/s 51% 45% -- -34% -96%
Replace then split 41733/s 128% 119% 51% -- -93%
regex to linefeed them match 641026/s 3398% 3265% 2222% 1436% --
... I'm a bit suspicious of that last, because that's absurdly faster. There's probably caching of results happening there.
But looking at it, what's slowing you down is the alternation in the regex:
sub single_split_with_alt {
my %hash = split( /\Q|^|\E|=/, $line );
}
sub single_split {
my %hash = split( /[\|\^\=]+/, $line );
}
(I know that latter might not be quite what you want, but it's for illustrative purposes)
Gives:
Rate alternation single split
alternation 19135/s -- -37%
single split 30239/s 58% --
But there does come a point where this is moot, because your limiting factor is disk IO, not CPU.
I'm trying to dynamically catch regex matching in Perl. I've known that eval will help me do this but I may be doing something wrong.
Code:
use strict;
use warnings;
my %testHash = (
'(\d+)\/(\d+)\/(\d+)' => '$1$2$3'
);
my $str = '1/12/2016';
foreach my $pattern (keys (%testHash)) {
my $value = $testHash{$pattern};
my $result;
eval {
local $_ = $str;
/$pattern/;
print "\$1 - $1\n";
print "\$2 - $2\n";
print "\$3 - $3\n";
eval { print "$value\n"; }
}
}
Is it also possible to store captured regex patterns in an array?
I believe what you really want is a dynamic version of the following:
say $str =~ s/(\d+)\/(\d+)\/(\d+)/$1$2$3/gr;
String::Substitution provides what we need to achieve that.
use String::Substitution qw( gsub_copy );
for my $pattern (keys(%testHash)) {
my $replacement = $testHash{$pattern};
say gsub_copy($str, $pattern, $replacement);
}
Note that $replacement can also be a callback. This permits far more complicated substitutions. For example, if you wanted to convert 1/12/2016 into 2016-01-12, you could use the following:
'(\d+)/(\d+)/(\d+)' => sub { sprintf "%d-%02d-%02d", #_[3,1,2] },
To answer your actual question:
use String::Substitution qw( interpolate_match_vars last_match_vars );
for my $pattern (keys(%testHash)) {
my $template = $testHash{$pattern};
$str =~ $pattern # Or /$pattern/ if you prefer
or die("No match!\n");
say interpolate_match_vars($template, last_match_vars());
}
I am not completely sure what you want to do here, but I don't think your program does what you think it does.
You are useing eval with a BLOCK of code. That's like a try block. If it dies inside of that eval block, it will catch that error. It will not run your string like it was code. You need a string eval for that.
Instead of explaining that, here's an alternative.
This program uses sprintf and numbers the parameters. The %1$s syntax in the pattern says _take the first argument (1$) and format it as a string (%s). You don't need to localize or assign to $_ to do a match. The =~ operator does that on other variables for you. I also use qr{} to create a quoted regular expression (essentially a variable containing a precompiled pattern) that I can use directly. Because of the {} as delimiter, I don't need to escape the slashes.
use strict;
use warnings;
use feature 'say'; # like print ..., "\n"
my %testHash = (
qr{(\d+)/(\d+)/(\d+)} => '%1$s.%2$s.%3$s',
qr{(\d+)/(\d+)/(\d+) nomatch} => '%1$s.%2$s.%3$s',
qr{(\d+)/(\d+)/(\d\d\d\d)} => '%3$4d-%2$02d-%1$02d',
qr{\d} => '%s', # no capture group
);
my $str = '1/12/2016';
foreach my $pattern ( keys %testHash ) {
my #captures = ( $str =~ $pattern );
say "pattern: $pattern";
if ($#+ == 0) {
say " no capture groups";
next;
}
unless (#captures) {
say " no match";
next;
}
# debug-output
for my $i ( 1 .. $#- ) {
say sprintf " \$%d - %s", $i, $captures[ $i - 1 ];
}
say sprintf $testHash{$pattern}, #captures;
}
I included four examples:
The first pattern is the one you had. It uses %1$s and so on as explained above.
The second one does not match. We check the number of elements in #captured by looking at it in scalar context.
The third one shows that you can also reorder the result, or even use the sprintf formatting.
The last one has no capture group. We check by looking at the index of the last element ($# as the sigil for arrays that usually have an # sigil) in #+, which holds the offsets of the ends of the last successful submatches in the currently active dynamic scope. The first element is the end of the overall match, so if this only has one element, we don't have capture groups.
The output for me is this:
pattern: (?^:(\d+)/(\d+)/(\d\d\d\d))
$1 - 1
$2 - 12
$3 - 2016
2016-12-01
pattern: (?^:(\d+)/(\d+)/(\d+) nomatch)
no match
pattern: (?^:\d)
no capture groups
pattern: (?^:(\d+)/(\d+)/(\d+))
$1 - 1
$2 - 12
$3 - 2016
1.12.2016
Note that the order in the output is mixed up. That's because hashes are not ordered in Perl, and if you iterate over the keys in a hash without sort the order is random.
Apologies! I realized both my question and sample code were both vague. But after reading your suggestions I came of with the following code.
I haven't optimized this code yet and there is a limit to the replacement.
foreach my $key (keys %testHash) {
if ( $str =~ $key ) {
my #matchArr = ($str =~ $key); # Capture all matches
# Search and replace (limited from $1 to $9)
for ( my $i = 0; $i < #matchArr; $i++ ) {
my $num = $i+1;
$testHash{$key} =~ s/\$$num/$matchArr[$i]/;
}
$result = $testHash{$key};
last;
}
}
print "$result\n";
Evaluing the regexp in list context returns the matches. so in your example:
use Data::Dumper; # so we can see the result
foreach my $pattern (keys (%testHash)) {
my #a = ($str =~/$pattern/);
print Dumper(\#a);
}
would do the job.
HTH
Georg
Is it also possible to store captured regex patterns in an array?
Of course it is possible to store captured substrings in an array:
#!/usr/bin/env perl
use strict;
use warnings;
my #patterns = map qr{$_}, qw{
(\d+)/(\d+)/(\d+)
};
my $str = '1/12/2016';
foreach my $pattern ( #patterns ) {
my #captured = ($str =~ $pattern)
or next;
print "'$_'\n" for #captured;
}
Output:
'1'
'12'
'2016'
I do not quite understand what you are trying to do with combinations of local, eval EXPR and eval BLOCK in your code and the purpose of the following hash:
my %testHash = (
'(\d+)\/(\d+)\/(\d+)' => '$1$2$3'
);
If you are trying to codify that this pattern should result in three captures, you can do that like this:
my #tests = (
{
pattern => qr{(\d+)/(\d+)/(\d+)},
ncaptures => 3,
}
);
my $str = '1/12/2016';
foreach my $test ( #tests ) {
my #captured = ($str =~ $test->{pattern})
or next;
unless (#captured == $test->{ncaptures}) {
# handle failure
}
}
See this answer to find out how you can automate counting the number of capture groups in a pattern. Using the technique in that answer:
#!/usr/bin/env perl
use strict;
use warnings;
use Test::More;
my #tests = map +{ pattern => qr{$_}, ncaptures => number_of_capturing_groups($_) }, qw(
(\d+)/(\d+)/(\d+)
);
my $str = '1/12/2016';
foreach my $test ( #tests ) {
my #captured = ($str =~ $test->{pattern});
ok #captured == $test->{ncaptures};
}
done_testing;
sub number_of_capturing_groups {
"" =~ /|$_[0]/;
return $#+;
}
Output:
ok 1
1..1
I get the following result from my code below.
For example, with $seq set to aaaGACGTCaaaGAATTCaaaGACGTCaaa:
RE: AatII
GACGTC found at 4-9
GACGTC found at 22-27
RE: EcoRI
GACGTC found at 4-9
GACGTC found at 22-27
GAATTC found at 13-18
While this is pretty close to what I want to achieve, what I really want to do is use a list of "reference enzymes" - which I store as hash keys (in the example below AatII and EcoRI in %RE) - in order to find the best cut site in the $seq sequence string and the store the results in a data-structure such as a hash.
The cut site selection is done according to an associated "enzyme string" value for each reference enzyme key in the hash. In the code below the key AatII is set to value GACGTC 5; it will cut GACGTC after the fifth position: (GACGT|C) ; whereas EcoRI => GAATTC 1 splits the sequence GAATTC after the first position (G|AATTC) . So, for each enzyme key in my hash I find a site in the sequence $seq based on that key's associated string and a "cut site" from the number associated with that string in the key's hash value. The number refers to the position of the cut after position "1" of the enzyme string.
For the sequence $seq the results would be: (aaaGACGT)(CaaaG)(AATTCaaaGACGT)(Caaa) (here brackets are used to show cut points). The positions of the cut sites in the string would be as follows:
0------8 9---13 14----------26 27--30
This is based on a run of the script using both EcoRI and AatII to select enzyme sequences and cut the full sequence leaving: aaaGACGT CaaaG AATTCaaaGACGT Caaa
I would like my script to store results of each run in a hash with keys/values such as {0=>8, 9=>13, 14=>26, 27=>30}. By using sort on my keys and values after each iteration; then using a binary search to find the closest previous cutsite and adding "1" to be the value of $end in the current iteration there should be as many entries in the hash as there are cutsites.
I do not know if this is possible. If it is, can someone point me in the right direction as to how to Can anyone help me transform my code in order to approach this problem.
#!/usr/bin/perl
use warnings;
use strict;
my %RE =( 'AatII' => 'GACGTC 5', 'EcoRI' => 'GAATTC 1' );
my $input='';
my #matches =();
my #enz = keys %RE;
my #value = values %RE;
print "Seq:";
my $seq = <STDIN>;
chomp $seq;
print "OK \n";
while ($input ne 'quit') {
print "RE:";
$input = <STDIN>;
chomp $input;
foreach (#enz) {
if ($input =~ /$_/) {
#print "Key:", $_," Value:", $RE{$_};
my #seqval = $seq;
my $val = $RE{$_};
my $real = substr($val, 0, -2);
#my $cut = substr($val, 0, (length($val)-3));
my $cut = chop $val;
my $length = length ($real);
my $mew = substr ($real, 0, $cut);
my $two = substr ($real, -1, ($length-$cut));
#my $push = push #valval;
#chomp %RE{$_};
while ($seq =~ /($real)/g) {
my $match = $1;
#print "$match", "\n";
my $length = length($&);
#print "$length", "\n";
my $pos = length($`);
#print "$pos", "\n";
my $start = $pos + 1;
#print "$start", "\n";
my $end = $pos + $length;
#print "$end", "\n";
my $hitpos = "$start-$end";
#print "$hitpos", "\n";
push #matches, "$match found at $hitpos ";
#print "\tfound:", "\n","\n";
#print "\t\t\t$1$mew", "\n";
#print "\t\t\t$two$3", "\n";
#print "next restriction enzyme:","\n";
} print "$_\n" foreach #matches;
}
}
}
Hi Khuram and welcome to Stackoverflow :-)
It seems you may have dropped your question but I'm adding this answer to make it more complete and potentially useful to others who find it. As #mappec suggested, you should consult the Bioperl website where you may find more resources.
While there may be simpler ways of doing this, I like your idea of creating a hash to store the cut sites is a good one because it leverages one of the powers of perl: the ability to create arbitrarily complex data structures on the fly. That said, it can sometimes be complicated to get your data back out! :-)
As #user1937198 notes, hashes are unordered, so if you want your output to preserve the order/positions of the enzyme strings in your sequence you'll not only have to sort your hash by its keys, you'll have create sortable keys to start with. In your question your sample output shows found at 4-9, ... 22-27, and ... 13-18 out of order because you don't have a datastructure you have sorted. Fixing that part is not too hard. To prove it, here's your script with some of the print statements removed and with the $seq sequence string processed into a HoH(hash of hashes) called %cuttings that is sorted by its keys (but remember, the order is not preserved):
#!/usr/bin/perl
use warnings;
use strict;
my %RE =( 'AatII' => 'GACGTC 5', 'EcoRI' => 'GAATTC 1' );
my %cuttings = ();
my $input='';
my #enz = keys %RE;
print "Seq:";
my $seq = "aaaGACGTCaaaGAATTCaaaGACGTCaaa";
chomp $seq;
print "OK \n";
while ($input ne 'quit') {
print "RE:";
$input = <STDIN>;
chomp $input;
foreach (#enz) {
if ($input =~ /$_/) {
my #seqval = $seq;
my $val = $RE{$_};
my $real = substr($val, 0, -2);
my $cut = chop $val;
my $cutsite = 0 ;
my $length = length ($real);
my $mew = substr ($real, 0, $cut);
my $two = substr ($real, -1, ($length-$cut));
while ($seq =~ /($real)/g) {
my $match = $1;
my $length = length($&);
my $pos = length($`); #`fix SO syntax highlighting :)
my $start = $pos + 1;
my $end = $pos + $length;
my $hitpos = "$start..$end";
my $cutsite = $end ;
${$cuttings{ $cutsite }}{ $input } = "$match at $hitpos ";
}
}
}
foreach my $cutsite (sort { $a <=> $b} keys %cuttings) {
print " $cuttings{$cutsite}{$_}\n" for ( keys %{ $cuttings{$cutsite} } );
}
}
The output would be:
$ ~/tmp/ perl biogenetic.pl
Seq:OK
RE:EcoRI
GAATTC found at 13..18
RE:AatII
GACGTC found at 4..9
GAATTC found at 13..18
GACGTC found at 22..27
RE:quit
The AatII enzyme cut sites are sorted correctly "around" the first EcoRI reference enzyme. If you want to see what the has looks like as you go along you could use Data::Dumper or Data::Printer (also known as DDP) to dump the hash when the program exits in an END block:
END {
use DDP;
p %cuttings ;
}
That would show the following:
{
9 {
AatII "GACGTC found at 4..9 "
},
18 {
EcoRI "GAATTC found at 13..18 "
},
27 {
AatII "GACGTC found at 22..27 "
}
}
NB: I've just reused your code to do this so you were most of the way there as it was. I'm not a geneticist so there may still be issues if enzyme strings do things like overlap (do they do that?). There are a lot of variable names to keep track of in your code and there's probably a way to refactor things to be bit simpler or more elegant - which I leave as an exercise for you and other contributors :-) If you use perl frequently you get good at it very quickly.
HTH. Good luck with your project.
I just need to extract the numbers in each of these items and store them separately, whats the best way to do this ?
IF the data is something like
p °c 4'i
App data usage stats
E
iii
! 12:12PM
Received data
Sent data
Current usage
598KB
28KB
626KB :18%
Get Current Stat Browser App
J
Battery Level
I tried this, but I get only 18 as an output in this case.
foreach my $line (#lines) {
if ($line =~/ :[ ]*(\d+)[ ]*(KB|%)/) {
$value = $1;
print "the value is $value\n";
push (#array, $1);
}
}
Thanks,
Loop over every line, and using a regular expression
foreach my $line (#lines) {
if ($line =~ /(\d+)/) {
push (#array, $1);
}
}
And you'll have all the numbers in your #array array
Here's one way to do it. Note that it does not care about which kind of numbers it extracts, as per your request.
It splits the line on colons in max two fields, key and value. Then we extract numbers from the values and insert into the hash. This part will effectively skip all lines where values do not contain numbers. This is also where you would insert stricter checks, e.g. if ($value =~ /(\d+)\s*KB/i) would only capture numbers followed by KB (I opted to add case insensitivity).
use strict;
use warnings;
use Data::Dumper;
my %hash;
while (<DATA>) {
my ($key, $value) = split /\s*:\s*/, $_, 2;
if ($value =~ /(\d+)/) {
$hash{$key} = $1;
}
}
print Dumper \%hash;
__DATA__
Received data : 598 KB
Sent data : 28 KB
Current usage : 626 KB
Battery Level : 35 %
Output:
$VAR1 = {
'Sent data' => '28',
'Current usage' => '626',
'Battery Level' => '35',
'Received data' => '598'
};