search and replace between two files -post#2 - regex

I need to replace a variable in one file with information/regex found in another file. For example, I need to grab the first string/variable (that starts with MSTRG) from each row in file1:
MSTRG.5734 MSTRG.5734 509 -4 0 -14 0 0
MSTRG.19266 MSTRG.19266 842 -4 0 -12 0 0
MSTRG.26588 MSTRG.26588 196 5 0 12 0 0
and use this to search in file2 that will look something like this:
Chr1 StringTie transcript 24039360 24041181 1000 - . gene_id "MSTRG.5734"; transcript_id "MSTRG.5734.1";
Chr1 StringTie transcript 24039810 24040595 1000 - . gene_id "MSTRG.5734"; transcript_id "Transcript:AT1G64700.1"; ref_gene_id "Gene:AT1G64700"
Chr1 StringTie exon 24040560 24041181 1000 - . gene_id "MSTRG.19266"; transcript_id "MSTRG.19266.1"; exon_number "2";
Chr1 StringTie exon 24040560 24041181 1000 - . gene_id "MSTRG.26588"; transcript_id "MSTRG.26588.1"; exon_number "2";
Chr1 StringTie transcript 24039810 24040595 1000 - . gene_id "MSTRG.26588"; transcript_id "Transcript:AT5G41000.1"; ref_gene_id "Gene:AT5G41000";
Ideally when e.g. MSTRG.5734 is found on a line in file2 that also contains the string e.g. Gene:AT1G64700, it will grab the information Gene:AT1G64700 and replace MSTRG.5734 in file1. So, every MSTRG on each row in file1 is unique and it will theoretically match a unique Gene in file2. If it does not match a Gene then I need the original row in file1 to be maintained.
File1 output should then look like:
Gene:AT1G64700 MSTRG.5734 509 -4 0 -14 0 0
MSTRG.19266 MSTRG.19266 842 -4 0 -12 0 0
Gene:AT5G41000 MSTRG.26588 196 5 0 12 0 0
My current perl code is:
use strict;
use warnings;
use vars qw($outfile #id $mstrg $gene);
open(SEARCH, $ARGV[0]) or die "Couldn't open $ARGV[0]: $!";
open(FILE, $ARGV[1]) or die "Couldn't open $ARGV[1]: $!";
$outfile = "testout.txt";
open (OUT, ">$outfile") || die "Can't open $outfile for creation: $!\n";
my %mstrg;
while (<SEARCH>) {
chomp;
if (/^MSTRG/) {
chomp $_;
#id = split (/\t/, $_);
$mstrg{$id[1]}.="$id[1]";
}
}
while (<FILE>) # {#ffn=<FILE>};
{
chomp ($gene=$_);
if ($mstrg =~ /$gene/) {
print OUT "$id[1]\t$id[2]";}
else {
#print OUT "$_\n";
}
next;
}
close FILE;
This is unfortunately where I get stuck and not sure how to proceed?
Thanks for any help given, appreciated and apologies for the previous post where I did not include any code to those that saw.

I would do this somewhat like:
use 5.014; #needed min 5.014 because the /r modifier
use warnings;
use Path::Tiny '0.077'; #added the min. req. version
my $file1='file1.txt';
my $file2='file2.txt';
my %mstmap = map { split /\s+/, s/.*?gene_id\s*"\s*(MSTRG\.\d+).*ref_gene_id\s*"\s*(Gene:\w+)".*/$1 $2/r }
grep { /ref_gene_id.*Gene:/ } path($file2)->lines({chomp => 1});
path($file1)->edit_lines( sub { s/^(MSTRG\.\d+)/exists($mstmap{$1}) ? $mstmap{$1} : $1/e });
for your input files produces
Gene:AT1G64700 MSTRG.5734 509 -4 0 -14 0 0
MSTRG.19266 MSTRG.19266 842 -4 0 -12 0 0
Gene:AT5G41000 MSTRG.26588 196 5 0 12 0 0
It create a hash for the pairs: MSTRG.number => Gene:String (from the file2), and using the Path::Tiny module editing fucntion doing the replaces in the file1.
After #Borodin comments the above could be reduced to:
use 5.014;
use warnings;
use Path::Tiny '0.077';
my $file1='f1';
my $file2='f2';
my %mstmap = map {
/.*?gene_id\s*"\s*(MSTRG\.\d+).*ref_gene_id\s*"\s*(Gene:\w+).*/
} path($file2)->lines({chomp => 1});
path($file1)->edit_lines( sub { s/^(MSTRG\.\d+)/exists($mstmap{$1}) ? $mstmap{$1} : $1/e });

There's no need to write impregnable code to achieve the result you want
This program reads through $file2 building hash %mstrg from all the lines that contain both an MSTRG. and a Gene: string. It then creates a regex in $re that will match any one of the MSTR. strings found
$file1 is opened, and that regex is used to replace any of the hash keys with the corresponding hash value it appears at the start of the line. The line is then printed
It isn't clear whether the first two fields of file1.txt are always the same, but I've opted to alter only the first field
I've used the autodie pragma to avoid having to explicitly check for success of any file IO operations
The program prints the output to STDOUT, so you can redirect it wherever you like on the command line
use strict;
use warnings 'all';
use autodie;
my ( $file1, $file2 ) = #ARGV;
my %mstrg;
{
open my $fh, '<', $file2;
while ( <$fh> ) {
$mstrg{$1} = $2 if /"(MSTRG.\d+)".*"(Gene:\w+)"/;
}
}
my $re = join '|', sort { length $b <=> length $a } keys %mstrg;
open my $fh, '<', $file1;
while ( <$fh> ) {
s/^($re)\b/$mstrg{$1}/;
print;
}
output
Gene:AT1G64700 MSTRG.5734 509 -4 0 -14 0 0
MSTRG.19266 MSTRG.19266 842 -4 0 -12 0 0
Gene:AT5G41000 MSTRG.26588 196 5 0 12 0 0

Related

Why do I get the first capture group only?

(https://stackoverflow.com/a/2304626/6607497 and https://stackoverflow.com/a/37004214/6607497 did not help me)
Analyzing a problem with /proc/stat in Linux I started to write a small utility, but I can't get the capture groups the way I wanted.
Here is the code:
#!/usr/bin/perl
use strict;
use warnings;
if (open(my $fh, '<', my $file = '/proc/stat')) {
while (<$fh>) {
if (my ($cpu, #vals) = /^cpu(\d*)(?:\s+(\d+))+$/) {
print "$cpu $#vals\n";
}
}
close($fh);
} else {
die "$file: $!\n";
}
For example with these input lines I get the output:
> cat /proc/stat
cpu 2709779 13999 551920 11622773 135610 0 194680 0 0 0
cpu0 677679 3082 124900 11507188 134042 0 164081 0 0 0
cpu1 775182 3866 147044 38910 135 0 15026 0 0 0
cpu2 704411 3024 143057 37674 1272 0 8403 0 0 0
cpu3 552506 4025 136918 38999 160 0 7169 0 0 0
intr 176332106 ...
0
0 0
1 0
2 0
3 0
So the match actually works, but I don't get the capture groups into #vals (perls 5.18.2 and 5.26.1).
Only the last of the repeated matches from a single pattern is captured.
Instead, can just split the line and then check on -- and adjust -- the first field
while (<$fh>) {
my ($cpu, #vals) = split;
next if not $cpu =~ s/^cpu//;
print "$cpu $#vals\n";
}
If the first element of the split's return doesn't start with cpu the regex substition fails and so the line is skipped. Otherwise, you get the number following cpu (or an empty string), as in OP.†
Or, can use the particular structure of the line you process
while (<$fh>) {
if (my ($cpu, #vals) = map { split } /^cpu([0-9]*) \s+ (.*)/x) {
print "$cpu $#vals\n";
}
}
The regex returns two items and each is split in the map, except that the first one is just passed as is into $cpu (being either a number or an empty string), while the other yields the numbers.
Both these produce the needed output in my tests.
† Since we always check for ^cpu (and remove it) it makes sense to do that first, and only then split -- when needed. However, that gets a little tricky for the following reason.
That bare split strips the leading (and trailing) whitespaces by its default, so for lines where cpu string has no trailing digits (cpu 2709779...) we would end up having the next number for what should be the cpu designation! A quiet error.
Thus we need to specify for split to use spaces, as it then leaves the leading spaces
while (<$fh>) {
next if not s/^cpu//;
my ($cpu, #vals) = split /\s+/; # now $cpu may be space(s)
print "$cpu $#vals\n";
}
This now works as intended as the cpu without trailing numbers gets space(s), a case to handle but clear. But this is misleading and an unaware maintainer -- or us the proverbial six months later -- may be tempted to remove the seemingly "unneeded" /\s+/, introducing an error.
Going by the example input, following content inside the while loop should work.
if (/^cpu(\d*)/) {
my $cpu = $1;
my (#vals) = /(?:\s+(\d+))+/g;
print "$cpu $#vals\n";
}
In an exercise for Learning Perl, we state a problem that's easy to solve with two simple regexes but hard with one (but then in Mastering Perl I pull out the big guns). We don't tell people this because we want to highlight the natural behavior to try to write everything in a single regex. Some of the contortions in other answers remind me of that, and I wouldn't want to maintain any of them.
First, there's the issue of only processing the interesting lines. Then, once we have that line, grab all the numbers. Translating that problem statement into code is very simple and straightforward. No acrobatics here because assertions and anchors do most of the work:
use v5.10;
while( <DATA> ) {
next unless /\A cpu(\d*) \s /ax;
my $cpu = $1;
my #values = / \b (\d+) \b /agx;
say "$cpu " . #values;
}
__END__
cpu 2709779 13999 551920 11622773 135610 0 194680 0 0 0
cpu0 677679 3082 124900 11507188 134042 0 164081 0 0 0
cpu1 775182 3866 147044 38910 135 0 15026 0 0 0
cpu2 704411 3024 143057 37674 1272 0 8403 0 0 0
cpu3 552506 4025 136918 38999 160 0 7169 0 0 0
intr 176332106 ...
Note that the OP still has to decide how to handle the cpu case with no trailing digits. Don't know what you want to do with the empty string.
Perl's regex engine will only remember the last capture group from a repeated expression. If you want to capture each number in a separate capture group, then one option would be to use an explicit regex pattern:
if (open(my $fh, '<', my $file = '/proc/stat')) {
while (<$fh>) {
if (my ($cpu, #vals) = /^cpu(\d*)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)$/) {
print "$cpu $#vals\n";
}
}
close($fh);
} else {
die "$file: $!\n";
}
Replacing
while (<$fh>) {
if (my ($cpu, #vals) = /^cpu(\d*)(?:\s+(\d+))+$/) {
with
while (<$fh>) {
my #vals;
if (my ($cpu) = /^cpu(\d*)(?:\s+(\d+)(?{ push(#vals, $^N) }))+$/) {
does what I wanted (requires perl 5.8 or newer).
he's my example. I thought I'd add it because I like simple code. It also allows "cpu7" with no trailing digits.
#!/usr/bin/perl
use strict;
use warnings;
my $file = "/proc/stat";
open(my $fh, "<", $file) or die "$file: $!\n";
while (<$fh>)
{
if ( /^cpu(\d+)(\s+)?(.*)$/ )
{
my $cpu = $1;
my $vals = scalar split( /\s+/, $3 ) ;
print "$cpu $vals\n";
}
}
close($fh);
Just adding to Tim's answer:
You can capture multiple values with one group (using the g-modifier), but then you have to split the statement.
if (my ($cpu) = /^cpu(\d*)(?:\s+(\d+))+$/) {
my #vals= /(?:\s+(\d+))/g;
print "$cpu $#vals\n";
}

Pattern counter in fasta file

I am trying to get the count of matching patterns in the fasta file. I am starting from a fasta file containing 57k sequences. I want to pull out the count of my matching pattern sequence and show the starting position of the pattern
Input file:
chr1 ATTAG**CAGAT**GTGACGTCGATGT**CAGAT**TG
chr2 TGAGCTG**CAGAT**CGTAGATGATTCTGCAGGAACCT
chr3 TCTTT**CAGAT**GCCTCTG**CAGAT**TC
Searching pattern "CAGAT"
Output Required:
chr Count P1 P2
chr1 -- 2 -- 6 -- 25
chr2 -- 1 -- 8
chr3 -- 2 -- 6 -- 19
Thanks in advance
I am assuming your file to be seperated with tab as delimiter.
open(my $in,"<:utf8","in.txt") or die "Cannot open FILE in.txt : $!\n";
while(<$in>) {
chomp($_);
my $cur = $_;
#$print "iam $cur\n";
my #tt = split(/\t/,$cur); #assuming you file tobe tab seperated
my $s1 = $tt[1];
my $s2 = "CAGAT";
my #val;
print "$cur\t";
while ($s1 =~ /($s2)/g) {
push(#val, $-[0]); #$-[0] is the offset of the start of the last successful match.
}
my $count = #val;
#val = join(",",#val);
print " No of Matches:$count Starting positions:#val\n";
}
in.txt
chr1 ATTAG**CAGAT**GTGACGTCGATGT**CAGAT**TG
chr2 TGAGCTG**CAGAT**CGTAGATGATTCTGCAGGAACCT
chr3 TCTTT**CAGAT**GCCTCTG**CAGAT**TC

Using regex to calculate the number of ints within a data set [Perl]

I am having difficulty with a certain set of regex that needs to be solved to calculate the frequency of positive, negative, and 0 integers within the data set inside of the sample code. I have successfully gotten it to solve the negative integers, but no such luck with positive and 0.
#!/usr/bin/perl
use strict;
use warnings;
my ( $ctrP, $ctrN, $ctrZ ) = ( 0, 0, 0 );
while( my $num = <DATA> ) {
chomp($num);
## print "num=[$num]\n";
if ( $num =~ /^-\d+$/ ) {
$ctrN++;
}
elsif ( $num =~ /^[1-9]\d*$/ ) {
$ctrZ++;
}
else {
$ctrP++;
}
}
printf("freq(Z+):%8s\n", $ctrP );
printf("freq(Z-):%8s\n", $ctrN );
printf("freq(0):%9s\n", $ctrZ );
printf("Total:%11s\n", ($ctrP+$ctrN+$ctrZ) );
exit;
__DATA__
29
42
324
-511
32
354
0
-29
765
17
-32
You can use the numeric comparison operator <=> which returns -1, 0, or 1 according to whether the first operand is less than, equal to, or greater than the second, respectively. If you use it to compare each value to zero and add one to the result then you can index into an array
Like this
use strict;
use warnings 'all';
my #counts;
++$counts[($_ <=> 0) + 1] while <DATA>;
my ($ctrN, $ctrZ, $ctrP) = #counts;
printf "freq(Z+): %4d\n", $ctrP;
printf "freq(Z-): %4d\n", $ctrN;
printf "freq(0): %4d\n", $ctrZ;
printf "Total: %4d\n", $ctrP + $ctrN + $ctrZ;
__DATA__
29
42
324
-511
32
354
0
-29
765
17
-32
output
freq(Z+): 7
freq(Z-): 3
freq(0): 1
Total: 11
Swap $ctrZ++; and $ctrP++; lines:
# ...................
elsif ( $num =~ /^[1-9]\d*$/ ) {
$ctrP++;
}
else {
$ctrZ++;
}
# ...................

Perl Regex Help - find lines in another file

I want the match to occur but it's not happening.
I have an issue. I wish to find occurences of lines from one file in another.
Here's one file (#file)
735 1 1
1891 1 0
2021 1 1
1892 2 1
667 1 0
802 2 1
665 1 0
666 1 1
596 1 0
3193 2 1
Here's the one in which I have to find above lines (#file1)
1521 1 0 : 1167 0 0 : 1167 2 0 : 1167 1 0 ;
2605 1 1 ;
2280 0 1 : 2280 2 0 : 1892 0 0 : 2280 1 0 : 2021 0 0 ;
1892 2 1 : 667 0 1 : 667 1 0 ;
1892 1 1 ;
Here's the code I wrote
foreach $leadline (#file1) {
foreach $line (#file) {
$_ = ' ' . $leadline;
$line = ' ' . $line;
if (m/$line/) {
push #final, $_;
}
}
}
But I am unable to detect the lines.
#file1 and #file variables store the contents of the files.
I either get no lines detected or all lines detected.
The reason I am concatenating a Space before the two lines is , that sometimes 667 1 0 can occur as the very first phrase in a given line.I am not comfortable with Regex to do that in Regex directly.
Note :- If line i and line j in first file occur as a pattern in the same line of the other file then output should be just one of the lines. Also , if a pattern 1667 1 0 is found, it shouldn't be confused with 667 1 0. Hence I added the whitespace.
I was able to achieve this goal in Python but unable to replicate it in Perl . Here's the Python snippet :-
for line1 in file1:
for j in range(0,len(file0)-1):# in file0:
if ' '+lines[j][0:len(file0[j])-1] in ' '+line1:
i = i + 1
print line1[0:len(line1)-1]
break
Expected output is :-
1892 2 1 : 667 0 1 : 667 1 0 ;
I now think this is a solution to a different problem, but here it is anyway!
use warnings;
use strict;
use 5.010;
use Array::Utils 'array_diff';
open my $fh, '<', 'f1.txt' or die $!;
my #f1;
while ( <$fh> ) {
push #f1, [split];
}
my #final;
open $fh, '<', 'f2.txt' or die $!;
while ( <$fh> ) {
my #f2 = map [ /\d+/g ], split /:/;
for my $f1 ( #f1 ) {
my #matches = grep { not array_diff(#$f1, #$_) } #f2;
push #final, map "#$_", #matches;
}
}
say for #final;
output
1892 2 1
667 0 1
667 1 0
Update
Okay here's my second attempt! This is essentially what choroba wrote but using map and with the addition of stripping all trailing whitespace on the data from the first file.
use warnings;
use strict;
use 5.014; # For non-destructive substitution
open my $fh, '<', 'f1.txt' or die $!;
my #f1 = map s/\s+\z//r, <$fh>;
my $re = join '|', #f1;
open $fh, '<', 'f2.txt' or die $!;
my #final = grep /\b(?:$re)\b/, <$fh>;
print for #final;
output
1892 2 1 : 667 0 1 : 667 1 0 ;
You can create a regex by joining the lines from file1 by | (and applying quotemeta on each). \b should prevent matching 667 in 1667.
#!/usr/bin/perl
use warnings;
use strict;
my #search;
open my $F1, '<', 'file1' or die $!;
while (<$F1>) {
chomp;
push #search, quotemeta;
}
my $regex = join '|', #search;
$regex = qr/\b(?:$regex)\b/;
open my $F2, '<', 'file2' or die $!;
while (<$F2>) {
print if /$regex/;
}
Here is how I'd do the job:
use Modern::Perl;
use Data::Dumper;$Data::Dumper::Indent = 1;
my #file = (
'735 1 1',
'1891 1 0',
'2021 1 1',
'1892 2 1',
'667 1 0',
'802 2 1',
'665 1 0',
'666 1 1',
'596 1 0',
'3193 2 1',
);
my #final;
while(my $line = <DATA>) {
chomp $line;
if (grep{$line =~ /\b$_\b/} #file) {
push #final, $line;
}
}
say Dumper\#final;
__DATA__
1521 1 0 : 1167 0 0 : 1167 2 0 : 1167 1 0 ;
2605 1 1 ;
2280 0 1 : 2280 2 0 : 1892 0 0 : 2280 1 0 : 2021 0 0 ;
1892 2 1 : 667 0 1 : 667 1 0 ;
1892 1 1 ;
Output:
$VAR1 = [
'1892 2 1 : 667 0 1 : 667 1 0 ; '
];
With your files:
use Modern::Perl;
use Data::Dumper;$Data::Dumper::Indent = 1;
open my $fh, '<', 'file.txt' or die "unable to open 'file.txt': $!";
my #file = <$fh>;
chomp #file;
my #final;
open $fh, '<', 'file1.txt' or die "unable to open 'file1.txt': $!";
while(my $line = <$fh>) {
chomp $line;
if (grep{$line =~ /\b$_\b/} #file) {
push #final, $line;
}
}
say Dumper\#final;

How to print next N lines for all instances of a string pattern?

I have a file which looks like the following:
File
variableStep chrom=chr1 span=25
10076 0.84
10101 1
10126 1
10151 1
10176 1
10201 1
10226 1.72
variableStep chrom=chr1 span=25
10251 2
10276 1.16
10301 1
10326 1
10351 1
10376 1
10401 1
10426 0.28
11451 0.04
variableStep chrom=chr2 span=25
9781451 2
19781476 2
19781501 2
19781526 2
19781551 1
19781576 1
19781601 0.48
variableStep chrom=chr2 span=25
19781826 0.28
19781851 1
19781876 1
19781901 1
19781926 1
19781951 1.48
19781976 3.68
19782001 4.56
19782026 4
variableStep chrom=chr3 span=25
4813476 1
24813501 1
24813526 1
24813551 1
24813576 1.88
24813601 2
variableStep chrom=chr3 span=25
24813626 1.4
24813651 1.48
24813676 2
24813701 2
24813726 2
24813751 2
variableStep chrom=chr4 span=25
24815401 2.24
24815426 3
24815451 3
24815476 3
24815501 3
24815526 2.04
variableStep chrom=chr4 span=25
24815551 2
24815576 1.76
24815601 0.76
24815951 0.48
24815976 1
24816001 1
24816026 1
24816051 1
variableStep chrom=chr5 span=25
24817226 0.92
24817251 1.48
24817276 3
24817301 3
variableStep chrom=chr5 span=25
24817326 3
24817351 3
24817376 3
24817401 3.04
24817426 3.08
What Is Needed
What I need to do is, for all instances of say variableStep chrom=chr1 span=25, print out the subsequent n lines to an output file. n I must mention, is highly variable. It can vary anywhere from 300,000 to 500,000+ in the actual file.
Desired Output
1.Output_file_1_for_variableStep chrom=chr1 span=25
10076 0.84
10101 1
10126 1
10151 1
10176 1
10201 1
10226 1.72
10251 2
10276 1.16
10301 1
10326 1
10351 1
10376 1
10401 1
10426 0.28
11451 0.04
2._Output_file_2_for_variableStep chrom=chr2 span=25
9781451 2
19781476 2
19781501 2
19781526 2
19781551 1
19781576 1
19781601 0.48
19781826 0.28
19781851 1
19781876 1
19781901 1
19781926 1
19781951 1.48
19781976 3.68
19782001 4.56
19782026 4
3._Output_file_3_for_variableStep chrom=chr3 span=25
4813476 1
24813501 1
24813526 1
24813551 1
24813576 1.88
24813601 2
24813626 1.4
24813651 1.48
24813676 2
24813701 2
24813726 2
24813751 2
4._Output_file_4_for_variableStep chrom=chr4 span=25
24815401 2.24
24815426 3
24815451 3
24815476 3
24815501 3
24815526 2.04
24815551 2
24815576 1.76
24815601 0.76
24815951 0.48
24815976 1
24816001 1
24816026 1
24816051 1
5._Output_file_5_for_variableStep chrom=chr5 span=25
24817226 0.92
24817251 1.48
24817276 3
24817301 3
24817326 3
24817351 3
24817376 3
24817401 3.04
24817426 3.08
Background
I still consider myself a Perl newbie so the code I've written doesn't quite completely accomplish the task.
In fact the following code depicts the 3 ways in which I was trying to get it to work. For the code with pattern variableStep chrom=chr1 span=25 I tried to print the subsequent lines after the regex match manually.
From that I figured, I need a loop to run through all the subsequent lines,which is what i've written with the pattern variableStep chrom=chr1 span=25. But then, I realized I need an exit mechanism as otherwise all the subsequent lines were getting printed.
It is this exit pattern written as last if /^v.*$/which I need to figure out. As the one I have is currently printing only the very first instance of the specific pattern. There is no blank line either on which I could exit. Had I had a blank line, this piece of code is working perfectly fine( modifying to last if /^$/ ). I even tried using a non decimal character as /^\D.*$/, but it isn't working. What exit pattern should I use?
The remaining part of the code was my baby attempts to get the program to work and it only prints the single subsequent line after the pattern match.
Code
#Trial code to parse main file
use 5.014;
use warnings;
#Assign filename
my $file = 'trial.txt';
#Open filename
open my $fh, '<' , $file || die $!;
#Open output
open OUT1, ">Trial_chr1.out" || die $!;
open OUT2, ">Trial_chr2.out" || die $!;
open OUT3, ">Trial_chr3.out" || die $!;
open OUT4, ">Trial_chr4.out" || die $!;
open out5, ">Trial_chr5.out" || die $!;
#Read in file
while(<$fh>){
chomp;
if (/^variableStep chrom=chr1 span=25/){
my $nextline1 = <$fh>;#means next line after pattern match
my $nextline2 = <$fh>;
my $nextline3 = <$fh>;
my $nextline4 = <$fh>;
my $nextline5 = <$fh>;
my $nextline6 = <$fh>;
my $nextline7 = <$fh>;
print OUT1 $nextline1;
print OUT1 $nextline2;
print OUT1 $nextline3;
print OUT1 $nextline4;
print OUT1 $nextline5;
print OUT1 $nextline6;
print OUT1 $nextline7;
}elsif(/^variableStep chrom=chr2 span=25/){
my #grabbed_lines; #Initialize array to store lines after pattern match
while (<$fh>){ #Read subsequent lines while in a loop
last if /^v.*$/; #Break out of the loop if line encountered begins with v
push #grabbed_lines, $_;# As long as the above condition is false, push the lines into the array
}print OUT2 #grabbed_lines; # Print the grabbed lines
}elsif(/^variableStep chrom=chr3 span=25/){
my $nextline = <$fh>;
print OUT3 $nextline;
}elsif(/^variableStep chrom=chr4 span=25/){
my $nextline = <$fh>;
print OUT4 $nextline;
}elsif(/^variableStep chrom=chr5 span=25/){
my $nextline = <$fh>;
print out5 $nextline;
}
}
#Exit
exit;
Thank you for taking the time to go through my question. I'd be grateful for any tips and suggestions.
Ok I missunderstood the n part, it is different for each match, this is tested and working:
my $found = 0;
while (<$fh>) {
if ( $found && /^\d/ ) {
print $_;
}
else {
$found = 0;
}
if (/^variableStep chrom=chr2 span=25/) {
$found = 1;
}
}
this way it prints all following rows that start with a digit.
Explanation:
The problem here is, that each time you are calling <$fh> it reads the next row, so if you test the row content and your test fails, you should not do the next loop, because then the next row is read and you lost the row where the test failed.
So I came to this solution:
I use a flag to know in which mode I am, am I searching for rows to print or not?
The first if is only entered
if I have been in the second if if in a loop before and the flag has been set to "1"
AND the row start with a digit.
When this test fails, i.e. there is no row with a digit at the beginning, I reset the flag and have then the chance to look again at the same row if it starts with "variableStep ..."
Oneliner below should do the trick (assuming that output files doesn't already exist):
perl -lne '/variableStep/ && open($fh, ">>", $_) && next; print $fh $_;' input.txt
--
btw: || operator has high priority (man perlop), so command:
open OUT1, ">Trial_chr1.out" || die $!;
is understood by perl as
open OUT1, (">Trial_chr1.out" || die $!);
To do error checking you should use and operator instead or use parentheses to enforce desired behaviour
I love perl, but awk is more suitable in this case, see :
$ awk '
{if ($0 ~ /^variableStep/) {file="output_file_"++c"_"$1"_"$2"_"$3}
else{print $0 > file}}
' file.txt
$ ls -l output_file_*
Using perl and File::Slurp useful module :
use strict; use warnings;
use File::Slurp;
my ($c, $file);
while (<>) {
if (/^variableStep\s+chrom=\w+\s+span=\d+/) {
$c++;
$file = $&;
$file =~ s/\s/_/g;
$file = "output_file_${c}_" . $file;
}
else {
append_file $file, $_;
}
}
Usage :
$ perl ./script.pl file.txt
$ ls -l output_file_*