I have recently noticed that a quick script I had written in Perl that was designed to be used on sub 10MB files has been modified, re-tasked and used in 40MB+ text files with significant performance issues in a batch environment.
The jobs have been running for about 12 hours per run when encountering a large text file and I am wondering how do I improve the perfomance of the code? Should I slurp the file into memory and if I do it will break the jobs reliance on the line numbers in the file. Any constructive thought would be greatly appreciated, I know the job is looping through the file too many times but how to reduce that?
#!/usr/bin/perl
use strict;
use warnings;
my $filename = "$ARGV[0]"; # This is needed for regular batch use
my $cancfile = "$ARGV[1]"; # This is needed for regular batch use
my #num =();
open(FILE, "<", "$filename") || error("Cannot open file ($!)");
while (<FILE>)
{
push (#num, $.) if (/^P\|/)
}
close FILE;
my $start;
my $end;
my $loop = scalar(#num);
my $counter =1;
my $test;
open (OUTCANC, ">>$cancfile") || error ("Could not open file: ($!)");
#Lets print out the letters minus the CANCEL letters
for ( 1 .. $loop )
{
$start = shift(#num) if ( ! $start );
$end = shift(#num);
my $next = $end;
$end--;
my $exclude = "FALSE";
open(FILE, "<", "$filename") || error("Cannot open file ($!)");
while (<FILE>)
{
my $line = $_;
$test = $. if ( eof );
if ( $. == $start && $line =~ /^P\|[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\|1I\|IR\|/)
{
print OUTCANC "$line";
$exclude = "TRUECANC";
next;
}
if ( $. >= $start && $. <= $end && $exclude =~ "TRUECANC")
{
print OUTCANC "$line";
} elsif ( $. >= $start && $. <= $end && $exclude =~ "FALSE"){
print $_;
}
}
close FILE;
$end = ++$test if ( $end < $start );
$start = $next if ($next);
}
#Lets print the last letter in the file
my $exclude = "FALSE";
open(FILE, "<", "$filename") || error("Cannot open file ($!)");
while (<FILE>)
{
my $line = $_;
if ( $. == $start && $line =~ /^P\|[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\|1I\|IR\|/)
{
$exclude = "TRUECANC";
next;
}
if ( $. >= $start && $. <= $end && $exclude =~ "TRUECANC")
{
print OUTCANC "$line";
} elsif ( $. >= $start && $. <= $end && $exclude =~ "FALSE"){
print $_;
}
}
close FILE;
close OUTCANC;
#----------------------------------------------------------------
sub message
{
my $m = shift or return;
print("$m\n");
}
sub error
{
my $e = shift || 'unknown error';
print("$0: $e\n");
exit 0;
}
There are some things that could speed the script up, like removing unneccessary regex usage.
/^P\|/ is equivalent to "P|" eq substr $_, 0, 2.
$foo =~ "BAR" could be -1 != index $foo, "BAR".
Then there is some repeated code. Factoring that out into a sub will not increase performance per se, but makes it easier to reason about the behaviour of the script.
There are a lot of unneccessary stringifications like "$filename" – $filename alone is sufficient.
But the worst offender would be this:
for ( 1 .. $loop ) {
...
open FILE, "<", $filename or ...
while (<FILE>) {
...
}
...
}
You only need to read that file in once, preferably into an array. You can the loop over the indices:
for ( 1 .. $loop ) {
...
for my $i (0 .. $#file_contents) {
my $line = $file_contents[$i];
... # swap $. for $i, but avoid off-by-one error
}
...
}
Disk IO is slow, so cache where you can!
I also see that you are using the $exclude variable as a boolean with the values FALSE and TRUECANC. Why not 0 and 1, so you can use it directly in a conditional?
You can factor out common tests in if/elsif:
if (FOO && BAR) { THING_A }
elsif (FOO && BAZ) { THING_B }
should be
if (FOO) {
if (BAR) { THING_A }
elsif (BAZ) { THING_B }
}
The $. == $start && $line =~ /^P\|.../ test may be silly, because $start contains only the numbers of lines that start with P| – so the regex may be sufficient here.
Edit
If I have understood the script correctly then the following should yield a significant performance increase:
#!/usr/bin/perl
use strict;
use warnings;
my ($filename, $cancfile) = #ARGV;
open my $fh, "<", $filename or die "$0: Couldn't open $filename: $!";
my (#num, #lines);
while (<$fh>)
{
push #lines, $_;
push #num, $#lines if "P|" eq substr $_, 0, 2;
}
open my $outcanc, ">>", $cancfile or die "$0: Couldn't open $cancfile: $!";
for my $i ( 0 .. $#num )
{
my $start = $num[$i];
my $end = ($num[$i+1] // #lines) - 1;
# pre v5.10:
# my $end = (defined $num[$i+1] ? $num[$i+1] : #lines) - 1
if ($lines[$start] =~ /^P[|][0-9]{9}[|]1I[|]IR[|]/) {
print {$outcanc} #lines[$start .. $end];
} else {
print STDOUT #lines[$start .. $end];
}
}
The script is cleaned up. The file is cached in an array. Only the parts of the array are iterated that are actually needed – we are down to O(n) from the previous O(n · m).
For your future scripts: Proving behaviour around loops and mutating variables is not impossible, but tedious and annoying. Realizing that
for (1 .. #num) {
$start = shift #num unless $next; # aka "do this only in the first iteration"
$next = shift #num:
$end = $next - 1:
while (<FH>) {
...
$test = $. if eof
...
}
$end = ++test if $end < $start;
$start = $next if $next;
}
is actually all about circumventing a possible undef in the 2nd shift takes some time. Instead of testing for eof in the inner loop, we can just pick the line number after the loop, so we don't need $test. Then we get:
$start = shift #num;
for my $i (1 .. #num) {
$end = $num[$i] - 1:
while (<FH>) { ... }
$end = $. + 1 if $end < $start; # $end < $start only true if not defined $num[$i]
$start = $num[$i] if $num[$i];
}
After translating $i down by one we confine the out-of-bounds problem to one point only:
for my $i (0 .. $#num) {
$start = $num[$i];
$end = $num[$i+1] - 1; # HERE: $end = -1 if $i == $#num
while (<FH>) { ... }
}
$end = $. + 1 if $end < $start;
After replacing the file reading with an array (careful, there is a difference of one between the array index and the line number), we see that the final file reading loop can be avoided if we pull that iteration into the for loop, because we know how many lines there are in total. So to say, we do
$end = ($num[$i+1] // $last_line_number) - 1;
Hopefully my cleaned up code is indeed equivalent to the original.
Related
With the use of perl regex, if two consecutive lines match than count the number of lines.
I want the number of lines until matches the pattern
D001
0000
open ($file, "$file") || die;
my #lines_f = $file;
my $total_size = $#lines_f +1;
foreach my $line (#lines_f)
{
if ($line =~ /D001/) {
$FSIZE = $k + 1;
} else {
$k++;}
}
Instead of just D001, I also want to check if the next line is 0000. If so $FSIZE is the $file size.
The $file would look something like this
00001
00002
.
.
.
D0001
00000
00000
Here is an example. This sets $FSIZE to undef if it cannot find the marker lines:
use strict;
use warnings;
my $fn = 'test.txt';
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
chomp (my #lines = <$fh>);
close $fh;
my $FSIZE = undef;
for my $i (0..$#lines) {
if ($lines[$i] =~ /D0001/) {
if ( $i < $#lines ) {
if ( $lines[$i+1] =~ /00000/ ) {
$FSIZE = $i + 1;
last;
}
}
}
}
I have a fasta file. I need to remove sequences containing “N” or did not contain at least 3 unique bases.
The code so far is below. Also how would I remove the sequence ID line as go along for sequences I delete.
#!/usr/bin/perl
use strict;
use warnings;
open FILE, '<', $ARGV[0] or die qq{Failed to open "$ARGV[1]" for input: $!\n};
open match_fh, ">$ARGV[0]_trimmed.fasta"
or die qq{Failed to open for output: $!\n};
while ( my $line = <FILE> ) {
chomp($line);
if ( $line =~ m/^>/ ) {
print match_fh "$line\n";
my #data = split( /\|/, $line );
my $nextline = <FILE>;
if ( $nextline !~ /N+/g ) {
if ( $nextline =~ /[ATGC]{3}/g ) {
}
print match_fh "$nextline";
}
}
}
close FILE;
close match_fh;
INPUT
>seq1
ATGCGGGATGATCCGAACGTTTAATCTCGTATGCCGTCTTCTATCTCNNN
>seq2
GATGAGCTTGACTCTAGTCCATCTCGTATGCCGTCTTCTGCTATCTCGTA
>seq3
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTC
>seq4
TGGTACTGTAAGCATGAGAGTAATCTCGTATGCCGTCTTCTGCTTGAAAA
OUTPUT
>seq2
GATGAGCTTGACTCTAGTCCATCTCGTATGCCGTCTTCTGCTATCTCGTA
>seq4
TGGTACTGTAAGCATGAGAGTAATCTCGTATGCCGTCTTCTGCTTGAAAA
while(my $head = <FILE>) {
next if($head !~ /^>/);
$_=<FILE>;
if(!/N+/ && /A/+/T/+/G/+/C/ >= 3) {
print match_fh $head, $_;
}
}
I come to you with my question: what is the best way to replace/remove pieces of C code dynamically? I already did something in Perl using regular expressions and reading what to replace/remove from a configuration file but I can't make it dynamic.
Code:
#autoflush output so it will not interfere with calling application
local $| = 1;
##
#libraries
use warnings;
use strict;
use Switch;
use Cwd;
use File::Find;
##
#global variables
my #config;
my $file;
my $directory;
my $result;
##
if(#ARGV < 1)
{
$directory = cwd();
}
else
{
$directory = $ARGV[0];
}
$result = $directory . "\\result";
if(! -d $result)
{
mkdir ($result);
}
open LOG, ">", $result . "\\log.log";
sub start
{
my $configFile = $_[0];
open CONFIG, $configFile;
local $/;
my $conf = <CONFIG>;
close CONFIG;
foreach my $line (split(/\n\*/, $conf))
{
if(index($line, "*") == 0)
{
$line = substr($line, 1);
}
setConfig($line);
}
processFiles();
}
sub setConfig
{
my $line = $_[0];
my $count = () = $line =~ /\s*==>\s*/;
switch($count)
{
case 0
{
remove($line);
}
case 1
{
replace($line);
}
}
}
sub addSlashes
{
$_[0] =~ s/([\.\\\/\+\*\?\[\^\]\(\)\{\}\=\!\<\>\|\:\-])/\\$1/xg;
if($_[1] == 1)
{
$_[0] =~ s/([\$])/\\$1/xg;
}
return;
}
sub remove
{
my $line = $_[0];
addSlashes($line, 1);
$line =~ s/(\\\$){3}/\.\+/g;
$config[#config][0] = qr/$line/;
$config[#config - 1][1] = q("");
$line = "\\(" . $line . "\\(";
$config[#config - 1][2] = qr/$line/;
}
sub replace
{
my $line = $_[0];
my #split = split(/\s*==>\s*/, $line);
my $original = $split[0];
my $replace = $split[1];
my $regex;
addSlashes($original, 1);
addSlashes($replace, 0);
my $counter = 1;
while($original =~ /\\\$([\d]{1,3})\\\$/g)
{
if($1 <= $counter && $1 > 0)
{
$counter++;
}
else
{
print "Invalid format\n";
return;
}
}
if($counter == 1)
{
$config[#config][0] = qr/$original/;
$config[#config - 1][1] = q(") . $replace . q(");
$original = "\\(" . $original . "\\(";
$config[#config - 1][2] = qr/$original/;
return;
}
while($replace =~ /\$([\d]{1,3})\$/g)
{
if($1 <= 0 && $1 >= $counter)
{
print "Invalid format\n";
return;
}
}
$original =~ s/\\\$\d{1,3}\\\$/\(\.\+\?\)/xg;
$original =~ s/\?\)$/\)/xg;
$replace =~ s/\$(\d{1,3})\$/\$$1/xg;
$config[#config][0] = qr/$original/;
$config[#config - 1][1] = q(") . $replace . q(");
$original = "\\(" . $original . "\\(";
$config[#config - 1][2] = qr/$original/;
}
sub processFiles
{
my #files = grep { ! -d } glob "$directory\\*";
foreach my $file (#files)
{
if($file =~ /\.(h|c)$/)
{
process($file);
}
}
}
sub process
{
my $file = $_[0];
open READ, $file;
local $/;
my $text = <READ>;
close READ;
print LOG "\n--> $file <--\n";
for(my $i = 0; $i < #config; $i++)
{
my $original = $config[$i][0];
my $replace = $config[$i][1];
my $log = $config[$i][2];
while($text =~ /$log/g)
{
print LOG $log . " ----> " . $1 . "\n";
}
$text =~ s/$original/$replace/eeg;
print LOG "\n";
}
$file = $result . substr($file, rindex($file, "\\"));
open WRITE, ">", $file;
print WRITE $text;
close WRITE;
}
start("qm2asil.cfg");
close LOG;
Configuration file content:
*static
*GET_$1$() ==> $1$
*GET_$1$($2$) ==> $1$[$2$]
*SET_$1$($2$,$3$); ==> $1$[$2$] = $3$;
*SET_$1$($2$); ==> $1$ = $2$;
The idea is that there are already a few rules to replace/remove and they work but can exist more complex rules that I couldn't manage.
Example:
SET_VAR1((i),(u8)(((s32)(((s32)GET_VAR2 ((i))) != 0)) && ((s32)((u8)(((s32) (((s32)VAR3[i]) != 0)) ^ ((s32)(((s32) VAR4[i]) != 0)))))));
I want to remove SET function and make it an assignment to the variable (VAR1[i] = ...). This is one of many variations of things that need to be removed/replaced.
What do you advise me to do? Can I make it work using Perl and regex or I should reorientate to another method and/or programming language?
EDIT: I already create regexes based on the configuration file but I have problem matching unknown expressions (currently I use .+). The main idea is that I want to keep the configuration as simple I can.
regexes
How can I print a matched line as well as a line before and after it? I've currently got:
perl -lane 'print if $F[3] > 100000 && $F[2] =~ /^C$/ && print $last; $last = $_'
Which is capable of printing both the matched line and a line before it - but I am not sure how to include the line after.
You can read the next line from the file directly using scalar <>:
perl -lane 'print, print scalar <>
if $F[3] > 100000 && $F[2] =~ /^C$/
&& print $last;
$last = $_' input
Or use a sliding window for overlapping matches:
perl -ane 'BEGIN { #b = map [], 1 .. 3 }
sub out {
shift #b;
if ($b[1][3] > 100_000 && $b[1][2] =~ /^C$/) {
print for map $_->[-1], #b;
}
}
push #b, [#F, $_];
out()
}{ out() # Process the last line
' input
The following handles overlapping matches. $N is the number of lines to printe before and after the matching lines.
perl -lane'
BEGIN { $N = 1 }
if ($F[3] > 100000 && $F[2] =~ /^C$/) { print for splice(#buf), $_; $next=$N }
elsif ($next) { --$next; print }
else { push #buf, $_; splice #buf, 0, -$N }
'
Since we know $N = 1, we can simplify the above into the following:
perl -lane'
if ($F[3] > 100000 && $F[2] =~ /^C$/) { print for splice(#buf), $_; $next=1 }
elsif ($next) { $next=0; print }
else { #buf = $_ }
'
You can also use seek and tell and rewind back one line for overlapping matches:
#!/usr/bin/perl
use strict;
use warnings;
open my $fh ,'<', 'input' or die "unable to open file: $!\n";
my $last="";
while(<$fh>){
my #b=split;
if(($b[3] > 100000) && ($b[2] =~ /^C$/)){
print $last if $last;
print;
my $t=tell $fh;
print scalar <$fh>,"\n";
seek $fh,$t,0; #rewind
}
$last=$_;
}
close($fh);
I am taking the user input via -f option, and whatever he enters, accordingly files are being searched recursively.
My problem is: When user enters "tmp*", then also it searches for "abctmp", "xyztmp" etc. What I want to do is, only files starting with tmp should come.
In short, whatever user enters accordingly files should be pushed to array.
Currently I am doing this, but I am sure there's some classy, short way to do it.
#! /perl/bin/perl
use strict;
use warnings;
use File::Find;
use getopt::Long;
my $filename="tmp*.txt";
find( { wanted => \&wanted,
preprocess => \&dir_search,
}, '.');
sub wanted{
my $regex;
my $myop;
my #mylist;
my $firstchar= substr($filename, 0,1); # I am checking first character.
# Whether it's ".*tmp" or just "tmp*"
if($filename=~ m/[^a-zA-Z0-9_]/g){ #If contain wildcard
if($firstchar eq "."){ # first character "."
my $myop = substr($filename, 1,1);
my $frag = substr($filename,2);
$filename = $frag;
$regex = '\b(\w' . ${myop}. ${filename}. '\w*)\b';
# Has to find whatever comes before 'tmp', too
} else {
$regex = '\b(' . ${myop}. ${filename}. '\w*)\b';
# Like, "tmp+.txt" Only search for patterns starting with tmp
}
if($_ =~ /$regex/) {
push(#mylist, $_);
}
} else {
if($_ eq $filename) { #If no wildcard, match the exact name only.
push(#mylist, $_);
}
}
}
sub dir_search {
my (#entries) = #_;
if ($File::Find::dir eq './a') {
#entries = grep { ((-d && $_ eq 'g') ||
((-d && $_ eq 'h') ||
(!(-d && $_ eq 'x')))) } #entries;
# Want from 'g' and 'h' folders only, not from 'x' folder
}
return #entries;
}
And another thing is, I want to search for only '.txt' files. Where should I put that condition?
#!/perl/bin/perl
sub rec_dir {
($dir,$tmpfile_ref) = #_;
opendir(CURRENT, $dir);
#files = readdir(CURRENT);
closedir(CURRENT);
foreach $file (#files) {
if( $file eq ".." || $file eq "." ) { next; }
if( -d $dir."/".$file ) { rec_dir($dir."/".$file,$tmpfile_ref); }
elsif( $file =~ /^tmp/ && $file =~ /\.txf$/ ) { push(#{$tmpfile_ref},$dir."/".$file); }
}
}
#matching_files = ();
$start_dir = ".";
rec_dir($start_dir,\#matching_files);
foreach $file (#matching_files) { print($file."\n"); }
I didn't test it. Barring typographical errors I think it will work.