I have a list of strings which denote time values, i.e. 1us, 100ps, 10s, 7fs (microseconds, picoseconds, seconds, femtoseconds, etc).
I need to sort the list in ascending order. To do that, I wrote a comparison routine which when given two strings, translates each string into the corresponding numerical value and compares them. Still, the list is not sorted correctly. After much debugging, I'm at a loss. Here's my code:
#!/usr/bin/perl
use warnings;
use strict;
my %scales = ('fs'=>1e-15,'ps'=>1e-12,'ns'=>1e-9,'us'=>1e-6,'ms'=>1e-3,'s'=>1);
my #times = ('1s','7ns','100ps','500ms','9us');
my #stimes = sort bytime #times;
sub bytime {
my $op1 = $a;
my $op2 = $b;
$op1 =~ /\b(\d+)([munpf]*s)\b/;
my $v1 = $1 * $scales{$2};
$op2 =~ /\b(\d+)([munpf]*s)\b/;
my $v2 = $1 * $scales{$2};
return $v1 > $v2;
}
print "#times"."\n"."#stimes"."\n";
The result I get is:
1s 7ns 100ps 500ms 9us
100ps 7ns 500ms 1s 9us
which is clearly wrong, even though some sorting takes place.
What's going on?
Your error is on the line
return $v1 > $v2;
It will return 1 (true) if $v1 is greater than $v2 and 0 (false) in all other cases.
You want to instead use the numerical comparison operator <=>:
return $v1 <=> $v2;
which will return 1, 0 or -1 depending on the comparison result and will allow the sort to do its work correctly. (If you were dealing with strings instead of numbers, the operator would be cmp.)
Complete working code:
#!/usr/bin/perl
use warnings;
use strict;
my %scales = ('fs'=>1e-15,'ps'=>1e-12,'ns'=>1e-9,'us'=>1e-6,'ms'=>1e-3,'s'=>1);
my #times = ('1s','7ns','100ps','500ms','9us');
my #stimes = sort bytime #times;
sub bytime {
my $op1 = $a;
my $op2 = $b;
$op1 =~ /\b(\d+)([munpf]*s)\b/;
my $v1 = $1 * $scales{$2};
$op2 =~ /\b(\d+)([munpf]*s)\b/;
my $v2 = $1 * $scales{$2};
return $v1 <=> $v2;
}
print "#times"."\n"."#stimes"."\n";
Output:
1s 7ns 100ps 500ms 9us
100ps 7ns 9us 500ms 1s
Related
i need to implement a program to count the occurrence of a substring in a string in perl. i have implemented it as follows
sub countnmstr
{
$count =0;
$count++ while $_[0] =~ /$_[1]/g;
return $count;
}
$count = countnmstr("aaa","aa");
print "$count\n";
now this is what i would normally do. however, in the implementation above i want to count occurrence of 'aa' in 'aaa'. here i get answer as 1 which seems reasonable but i need to consider the overlapping cases as well. hence the above case should give an answer as 2 since there are two 'aa's if we consider overlap.
can anyone suggest how to implement such a function??
Everyone is getting pretty complicated in their answers (d'oh! daotoad should have made his comment an answer!), perhaps because they are afraid of the goatse operator. I didn't name it, that's just what people call it. It uses the trick that the result of a list assignment is the number of elements in the righthand list.
The Perl idiom for counting matches is then:
my $count = () = $_[0] =~ /($pattern)/g;
The goatse part is the = () =, which is an empty list in the middle of two assignments. The lefthand part of the goatse gets the count from the righthand side of the goatse. Note the you need a capture in the pattern because that's the list the match operator will return in list context.
Now, the next trick in your case is that you really want a positive lookbehind (or lookahead maybe). The lookarounds don't consume characters, so you don't need to keep track of the position:
my $count = () = 'aaa' =~ /((?<=a)a)/g;
Your aaa is just an example. If you have a variable-width pattern, you have to use a lookahead. Lookbehinds in Perl have to be fixed width.
See ysth's answer ... I failed to realize that the pattern could consist solely of a zero width assertion and still work for this purpose.
You can use positive lookahead as suggested by others, and write the function as:
sub countnmstr {
my ($haystack, $needle) = #_;
my ($first, $rest) = $needle =~ /^(.)(.*)$/;
return scalar (() = $haystack =~ /(\Q$first\E(?=\Q$rest\E))/g);
}
You can also use pos to adjust where the next search picks up from:
#!/usr/bin/perl
use strict; use warnings;
sub countnmstr {
my ($haystack, $needle) = #_;
my $adj = length($needle) - 1;
die "Search string cannot be empty!" if $adj < 0;
my $count = 0;
while ( $haystack =~ /\Q$needle/g ) {
pos $haystack -= $adj;
$count += 1;
}
return $count;
}
print countnmstr("aaa","aa"), "\n";
Output:
C:\Temp> t
2
sub countnmstr
{
my ($string, $substr) = #_;
return scalar( () = $string =~ /(?=\Q$substr\E)/g );
}
$count = countnmstr("aaa","aa");
print "$count\n";
A few points:
//g in list context matches as many times as possible.
\Q...\E is used to auto-escape any meta characters, so that you are doing a substring count, not a subpattern count.
Using a lookahead (?= ... ) causes each match to not "consume" any of the string, allowing the following match to be attempted at the very next character.
This uses the same feature where a list assignment (in this case, to an empty list) in scalar context returns the count of elements on the right of the list assignment as the goatse/flying-lentil/spread-eagle/whatever operator, but uses scalar() instead of a scalar assignment to provide the scalar context.
$_[0] is not used directly, but instead copied to a lexical; a naive use of $_[0] in place of $string would cause the //g to start partway through the string instead of at the beginning if the passed string had a stored pos().
Update: s///g is faster, though not as fast as using index:
sub countnmstr
{
my ($string, $substr) = #_;
return scalar( $string =~ s/(?=\Q$substr\E)//g );
}
You could use a lookahead assertion in the regular expression:
sub countnmstr {
my #matches = $_[0] =~ /(?=($_[1]))/g;
return scalar #matches;
}
I suspect Sinan's suggestion will be quicker though.
you can try this, no more regex than needed.
$haystack="aaaaabbbcc";
$needle = "aa";
while ( 1 ){
$ind = index($haystack,$needle);
if ( $ind == -1 ) {last};
$haystack = substr($haystack,$ind+1);
$count++;
}
print "Total count: $count\n";
output
$ ./perl.pl
Total count: 4
If speed is an issue, the index approach suggested by ghostdog74 (with cjm's improvement) is likely to be considerably faster than the regex solutions.
use strict;
use warnings;
sub countnmstr_regex {
my ($haystack, $needle) = #_;
return scalar( () = $haystack =~ /(?=\Q$needle\E)/g );
}
sub countnmstr_index {
my ($haystack, $needle) = #_;
my $i = 0;
my $tally = 0;
while (1){
$i = index($haystack, $needle, $i);
last if $i == -1;
$tally ++;
$i ++;
}
return $tally;
}
use Benchmark qw(cmpthese);
my $size = 1;
my $h = 'aaa aaaaaa' x $size;
my $n = 'aa';
cmpthese( -2, {
countnmstr_regex => sub { countnmstr_regex($h, $n) },
countnmstr_index => sub { countnmstr_index($h, $n) },
} );
__END__
# Benchmarks run on Windows.
# Result using a small haystack ($size = 1).
Rate countnmstr_regex countnmstr_index
countnmstr_regex 93701/s -- -66%
countnmstr_index 271893/s 190% --
# Result using a large haystack ($size = 100).
Rate countnmstr_regex countnmstr_index
countnmstr_regex 929/s -- -81%
countnmstr_index 4960/s 434% --
I am parsing a text file with the following format that needs to be distinguished.
cat dog [numeric, 00-23]
pickle fence [numeric, 0-5]
tack glue [numeric,1-53]
dent paint [numeric01-15]
If the minimum of the range is a single digit, then I need to process it a certain way. If the minimum of the range is double-digit (including 00,01,02,etc), I need to process it another way.
if($line =~ /\[numeric.*(\d+)\-(\d+)/i){
$rangemin=$1;
$rangemax=$2;
#find # of digits in $rangemin
#length() doesn't work
#I'm trying to find a function that finds number of digits so length of `00` or `01` or `02` or etc. returns `2`
}
How do I find the # of digits of $rangemin?
Your regular expression grabs the leading 0 because .* is very greedy.
use warnings;
use strict;
while (my $line = <DATA>) {
if ($line =~ /\[numeric[\D]*(\d+)\-(\d+)/i){
my $rangemin = $1;
my $len = length $rangemin;
print "rangemin=$rangemin len=$len\n";
}
}
__DATA__
cat dog [numeric, 00-23]
pickle fence [numeric, 0-5]
tack glue [numeric,1-53]
dent paint [numeric01-15]
Output:
rangemin=00 len=2
rangemin=0 len=1
rangemin=1 len=1
rangemin=01 len=2
You can use the length built-in, just like you would for a string; Perl implicitly converts arguments based on the operator/builtin that they are being passed to:
#!perl
use strict;
use warnings;
my $number = 5;
print length($number), "\n";
You can also use a base-10 logarithm to determine how long a number is:
#!perl
use strict;
use warnings;
my $number = 12345;
my $length = $number > 0
? int(log($number)/log(10)) + 1
: $number == 0
? 0
: int(log(abs($number))/log(10)) + 2;
print $length, "\n";
# 5
**EDIT: as #Keith Thompson points out, log is not defined for numbers <= 0. Use abs to prevent this condition.
I want to compare two numbers isolated from this sample data:
'gi|112807938|emb|CU075707.1|_Xenopus_tropicalis_finished_cDNA,_clone_TNeu129d01 C1:TCONS_00039972(XLOC_025068),_12.9045:32.0354,_Change:1.3118,_p:0.00025,_q:0.50752 C2:TCONS_00045925(XLOC_029835),_10.3694:43.8379,_Change:2.07985,_p:0.0004,_q:0.333824',
'gi|115528274|gb|BC124894.1|_Xenopus_laevis_islet-1,_mRNA_(cDNA_clone_MGC:154537_IMAGE:8320777),_complete_cds C1:TCONS_00080221(XLOC_049570),_17.9027:40.8136,_Change:1.18887,_p:0.00535,_q:0.998852 C2:TCONS_00092192(XLOC_059015),_17.8995:35.5534,_Change:0.990066,_p:0.0355,_q:0.998513',
'gi|118404233|ref|NM_001078963.1|_Xenopus_(Silurana)_tropicalis_pancreatic_lipase-related_protein_2_(pnliprp2),_mRNA C1:TCONS_00031955(XLOC_019851),_0.944706:5.88717,_Change:2.63964,_p:0.01915,_q:0.998852 C2:TCONS_00036655(XLOC_023660),_2.31819:11.556,_Change:2.31757,_p:0.0358,_q:0.998513',
using the following regex:
#!/usr/bin/perl -w
use strict;
use File::Slurp;
use Data::Dumper;
$Data::Dumper::Sortkeys = 1;
my (#log_change, #largest_change);
foreach (#intersect) {
chomp;
my #condition1_match = ($_ =~ /C1:.*?Change:(-?\d+\.\d+)|C1:.*?Change:(-?inf)/); # Sometimes the value is 'inf' or '-inf'. This allows either a numerical or inf value to be captured.
my #condition2_match = ($_ =~ /C2:.*?Change:(-?\d+\.\d+)|C2:.*?Change:(-?inf)/);
push #log_change, "#condition1_match\t#condition2_match";
}
print Dumper (\#log_change);
Which gives this output:
'1.3118 2.07985 ',
'1.18887 0.990066 ',
'2.63964 2.31757 ',
Ideally, within the same loop I now want to make a comparison between the values held in #condition1_match and #condition2_match such that the larger value is pushed onto a new array, unless comparing against a non numerical 'inf' in which case push the numerical value.
Something like this:
my (#log_change, #largest_change);
foreach (#intersect) {
chomp;
my #condition1_match = ($_ =~ /C1:.*?Change:(-?\d+\.\d+)|C1:.*?Change:(-?inf)/);
my #condition2_match = ($_ =~ /C2:.*?Change:(-?\d+\.\d+)|C2:.*?Change:(-?inf)/);
push #log_change, "#condition1_match\t#condition2_match";
unless ($_ =~ /Change:-?inf/) {
if (#condition1_match > #condition2_match) {
push #largest_change, #condition1_match;
}
else {
push #largest_change, #condition2_match;
}
}
}
print Dumper (\#largest_change);
Which gives:
'2.07985',
undef,
'0.990066',
undef,
'2.31757',
undef,
as well as a lot of this error message:
Use of uninitialized value $condition2_match[1] in join or string at intersect.11.8.pl line 114.
I'm unsure as to what exactly the error message means, as well as why I'm getting undef values in my #largest_change
As you've written your code, #condition_match1 and #condition_match2 will be created with 2 elements -- corresponding to the 2 capture groups in your regular expression -- each time there is a match. But one of these elements will always necessarily be undef, leading to the uninitialized ... warnings.
In this case, you can repair this program by putting the | inside the capture group:
my ($condition1_match) = ($_ =~ /C1:.*?Change:(-?\d+\.\d+|-?inf)/);
my ($condition2_match) = ($_ =~ /C2:.*?Change:(-?\d+\.\d+|-?inf)/);
so that there is a single capture group and the matching operation produces a list with a single, defined element.
In addition, the comparison
if (#condition1_match > #condition2_match) {
is probably not doing what you think it is doing. In Perl, a numerical comparison between two arrays is a comparison of array lengths. What you apparently mean to do is to compare the defined value in each of those arrays, so you would need to do something more cumbersome like:
my $condition1_match = $condition1_match[0] // $condition1_match[1];
my $condition2_match = $condition2_match[0] // $condition2_match[1];
if ($condition1_match > $condition2_match) {
push #largest_change, $condition1_match;
} else {
push #largest_change, $condition2_match;
}
I have input like this:
"[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|phone]0|0|{A=145,B=2,C=12,D=18}|!0|0|{A=167,B=2,C=67,D=17}|.1iit196|0|{A=244,B=6,C=67,D=12}|10:48AM204|0|{A=9,B=201,C=61,D=11}|Calculator66|0|{A=145,B=450,C=49,D=14}|phone";
It appears as a continuous line, there are no line breaks. I need the
largest value out of the values between [ and the first occurrence of
|. In this case, for example, the largest value is 204. Once
that is obtained, I want to print the contents of that element
between []. In this case, it would be "204|0|{A=9,B=201,C=61,D=11}|Calculator".
I've tried something like this, but it is not going anywhere:
my #array1;
my $data = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=1
+7}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,
+D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|phone]0|0|{A=145,B=2,C
+=12,D=18}|!0|0|{A=167,B=2,C=67,D=17}|.1iit196|0|{A=244,B=6,C=67,D=12}
+|10:48AM204|0|{A=9,B=201,C=61,D=11}|Calculator66|0|{A=145,B=450,C=49,
+D=14}|phone";
my $high = 0;
my #values = split(/\[([^\]]+)\]/,$data) ;
print "Values is #values \n";
foreach (#values) {
# I want the value that preceeds the first occurence of | in each array
# element, i.e. 0,0,196,204, etc.
my ($conf,$rest)= split(/\|/,$_);
print "Conf is $conf \n";
print "Rest is $rest \n";
push(#array1, $conf);
push (#array2, $rest);
print "Array 1 is #array1 \n";
print "Array 2 is #array2 \n";
}
$conf = highest(#array1);
my $i=0;
# I want the index value of the element that contains the highest conf value,
# in this case 204.
for (#myarray1) { last if $conf eq $_; $i++; };
print "$conf=$i\n";
# I want to print the rest of the string that was split in the same index
# position.
$rest = #array2[$i];
print "Rest is $rest \n";
# To get the highest conf value
sub highest {
my #data = #_;
my $high = 0;
for(#data) {
$high = $_ if $_ > $high;
}
$high;
}
Maybe I should be using a different approach. Could someone help me, please?
One way of doing it:
#!/usr/bin/perl
use strict;
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|phone]";
my #parts = split(/\]/, $s);
my $max = 0;
my $data = "";
foreach my $part (#parts) {
if ($part =~ /\[(\d+)/) {
if ($1 > $max) {
$max = $1;
$data = substr($part, 1);
}
}
}
print $data."\n";
A couple of notes:
you can split your original string by \], so you get parts like [0|0|{A=145,B=2,C=12,D=18}|!
then you parse each part to get the integer after the initial [
the rest it's easy: keep track of the biggest integer and of the corresponding part, and output it at the end.
In shell script:
#!/bin/bash
MAXVAL=$(cat /tmp/data | tr [ "\\n" | cut -d"|" -f1 | sort -n | tail -1)
cat /tmp/data | tr [] "\\n" | grep ^$MAXVAL
The first line cuts your big mass of data into lines, extracts just the first field, sorts it and takes the max. The second line cuts the data into lines again and greps for that max val.
If you have a LOT of data, this could be slow, so you could put the "lined" data into a temp file or something.
split() is the Right Tool when you know what you want to throw away. Capturing or m//g is the Right Tool when you know what you want to keep. (paraphrased from a Randal Schwartz quote).
You want to specify what to keep (between square brackets) rather than what to throw away (nothing!).
Luckily, your data is "hash shaped" (ie. alternating keys and values), so load it into a hash, sort the keys, and output the value for the highest key:
my %data = $data =~ /\[
(\d+) # digits are the keys
([^]]+) # rest are the values
\]/gx;
my($highest) = sort {$b <=> $a} keys %data; # inefficent if $data is big
print $highest, $data{$highest}, "\n";
Another way of doing this :
#!/usr/bin/perl
use strict;
my $str = '[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|phone]0|0|{A=145,B=2,C=12,D=18}|!0|0|{A=167,B=2,C=67,D=17}|.1iit196|0|{A=244,B=6,C=67,D=12}|10:48AM204|0|{A=9,B=201,C=61,D=11}|Calculator66|0|{A=145,B=450,C=49,D=14}|phone';
my $maxval = 0;
my $pattern;
while ( $str =~ /(\[(\d+)\|.+?\])/g)
{
if ( $maxval < $2 ) {
$maxval = $2;
$pattern = $1;
}
}
print "Maximum value = $maxval and the associate pattern = $pattern \n";
# In this example $maxvalue = 204
# and $pattern = [204|0|{A=9,B=201,C=61,D=11}|Calculator]
I have the following Perl script counting the number of Fs and Ts in a string:
my $str = "GGGFFEEIIEETTGGG";
my $ft_count = 0;
$ft_count++ while($str =~ m/[FT]/g);
print "$ft_count\n";
Is there a more concise way to get the count (in other words, to combine line 2 and 3)?
my $ft_count = $str =~ tr/FT//;
See perlop.
If the REPLACEMENTLIST is empty, the
SEARCHLIST is replicated. This latter is useful for counting
characters in a class …
$cnt = $sky =~ tr/*/*/; # count the stars in $sky
$cnt = tr/0-9//; # count the digits in $_
Here's a benchmark:
use strict; use warnings;
use Benchmark qw( cmpthese );
my ($x, $y) = ("GGGFFEEIIEETTGGG" x 1000) x 2;
cmpthese -5, {
'tr' => sub {
my $cnt = $x =~ tr/FT//;
},
'm' => sub {
my $cnt = ()= $y =~ m/[FT]/g;
},
};
Rate tr m
Rate m tr
m 108/s -- -99%
tr 8118/s 7440% --
With ActiveState Perl 5.10.1.1006 on 32 Windows XP.
The difference seems to be starker with
C:\Temp> c:\opt\strawberry-5.12.1\perl\bin\perl.exe t.pl
Rate m tr
m 88.8/s -- -100%
tr 25507/s 28631% --
When the "m" operator has the /g flag AND is executed in list context, it returns a list of matching substrings. So another way to do this would be:
my #ft_matches = $str =~ m/[FT]/g;
my $ft_count = #ft_matches; # count elements of array
But that's still two lines. Another weirder trick that can make it shorter:
my $ft_count = () = $str =~ m/[FT]/g;
The "() =" forces the "m" to be in list context. Assigning a list with N elements to a list of zero variables doesn't actually do anything. But then when this assignment expression is used in a scalar context ($ft_count = ...), the right "=" operator returns the number of elements from its right-hand side - exactly what you want.
This is incredibly weird when first encountered, but the "=()=" idiom is a useful Perl trick to know, for "evaluate in list context, then get size of list".
Note: I have no data on which of these are more efficient when dealing with large strings. In fact, I suspect your original code might be best in that case.
Yes, you can use the CountOf secret operator:
my $ft_count = ()= $str =~ m/[FT]/g;
You can combine line 2, 3 and 4 into one like so:
my $str = "GGGFFEEIIEETTGGG";
print $str =~ s/[FT]//g; #Output 4;