I want the match to occur but it's not happening.
I have an issue. I wish to find occurences of lines from one file in another.
Here's one file (#file)
735 1 1
1891 1 0
2021 1 1
1892 2 1
667 1 0
802 2 1
665 1 0
666 1 1
596 1 0
3193 2 1
Here's the one in which I have to find above lines (#file1)
1521 1 0 : 1167 0 0 : 1167 2 0 : 1167 1 0 ;
2605 1 1 ;
2280 0 1 : 2280 2 0 : 1892 0 0 : 2280 1 0 : 2021 0 0 ;
1892 2 1 : 667 0 1 : 667 1 0 ;
1892 1 1 ;
Here's the code I wrote
foreach $leadline (#file1) {
foreach $line (#file) {
$_ = ' ' . $leadline;
$line = ' ' . $line;
if (m/$line/) {
push #final, $_;
}
}
}
But I am unable to detect the lines.
#file1 and #file variables store the contents of the files.
I either get no lines detected or all lines detected.
The reason I am concatenating a Space before the two lines is , that sometimes 667 1 0 can occur as the very first phrase in a given line.I am not comfortable with Regex to do that in Regex directly.
Note :- If line i and line j in first file occur as a pattern in the same line of the other file then output should be just one of the lines. Also , if a pattern 1667 1 0 is found, it shouldn't be confused with 667 1 0. Hence I added the whitespace.
I was able to achieve this goal in Python but unable to replicate it in Perl . Here's the Python snippet :-
for line1 in file1:
for j in range(0,len(file0)-1):# in file0:
if ' '+lines[j][0:len(file0[j])-1] in ' '+line1:
i = i + 1
print line1[0:len(line1)-1]
break
Expected output is :-
1892 2 1 : 667 0 1 : 667 1 0 ;
I now think this is a solution to a different problem, but here it is anyway!
use warnings;
use strict;
use 5.010;
use Array::Utils 'array_diff';
open my $fh, '<', 'f1.txt' or die $!;
my #f1;
while ( <$fh> ) {
push #f1, [split];
}
my #final;
open $fh, '<', 'f2.txt' or die $!;
while ( <$fh> ) {
my #f2 = map [ /\d+/g ], split /:/;
for my $f1 ( #f1 ) {
my #matches = grep { not array_diff(#$f1, #$_) } #f2;
push #final, map "#$_", #matches;
}
}
say for #final;
output
1892 2 1
667 0 1
667 1 0
Update
Okay here's my second attempt! This is essentially what choroba wrote but using map and with the addition of stripping all trailing whitespace on the data from the first file.
use warnings;
use strict;
use 5.014; # For non-destructive substitution
open my $fh, '<', 'f1.txt' or die $!;
my #f1 = map s/\s+\z//r, <$fh>;
my $re = join '|', #f1;
open $fh, '<', 'f2.txt' or die $!;
my #final = grep /\b(?:$re)\b/, <$fh>;
print for #final;
output
1892 2 1 : 667 0 1 : 667 1 0 ;
You can create a regex by joining the lines from file1 by | (and applying quotemeta on each). \b should prevent matching 667 in 1667.
#!/usr/bin/perl
use warnings;
use strict;
my #search;
open my $F1, '<', 'file1' or die $!;
while (<$F1>) {
chomp;
push #search, quotemeta;
}
my $regex = join '|', #search;
$regex = qr/\b(?:$regex)\b/;
open my $F2, '<', 'file2' or die $!;
while (<$F2>) {
print if /$regex/;
}
Here is how I'd do the job:
use Modern::Perl;
use Data::Dumper;$Data::Dumper::Indent = 1;
my #file = (
'735 1 1',
'1891 1 0',
'2021 1 1',
'1892 2 1',
'667 1 0',
'802 2 1',
'665 1 0',
'666 1 1',
'596 1 0',
'3193 2 1',
);
my #final;
while(my $line = <DATA>) {
chomp $line;
if (grep{$line =~ /\b$_\b/} #file) {
push #final, $line;
}
}
say Dumper\#final;
__DATA__
1521 1 0 : 1167 0 0 : 1167 2 0 : 1167 1 0 ;
2605 1 1 ;
2280 0 1 : 2280 2 0 : 1892 0 0 : 2280 1 0 : 2021 0 0 ;
1892 2 1 : 667 0 1 : 667 1 0 ;
1892 1 1 ;
Output:
$VAR1 = [
'1892 2 1 : 667 0 1 : 667 1 0 ; '
];
With your files:
use Modern::Perl;
use Data::Dumper;$Data::Dumper::Indent = 1;
open my $fh, '<', 'file.txt' or die "unable to open 'file.txt': $!";
my #file = <$fh>;
chomp #file;
my #final;
open $fh, '<', 'file1.txt' or die "unable to open 'file1.txt': $!";
while(my $line = <$fh>) {
chomp $line;
if (grep{$line =~ /\b$_\b/} #file) {
push #final, $line;
}
}
say Dumper\#final;
Related
I need to replace a variable in one file with information/regex found in another file. For example, I need to grab the first string/variable (that starts with MSTRG) from each row in file1:
MSTRG.5734 MSTRG.5734 509 -4 0 -14 0 0
MSTRG.19266 MSTRG.19266 842 -4 0 -12 0 0
MSTRG.26588 MSTRG.26588 196 5 0 12 0 0
and use this to search in file2 that will look something like this:
Chr1 StringTie transcript 24039360 24041181 1000 - . gene_id "MSTRG.5734"; transcript_id "MSTRG.5734.1";
Chr1 StringTie transcript 24039810 24040595 1000 - . gene_id "MSTRG.5734"; transcript_id "Transcript:AT1G64700.1"; ref_gene_id "Gene:AT1G64700"
Chr1 StringTie exon 24040560 24041181 1000 - . gene_id "MSTRG.19266"; transcript_id "MSTRG.19266.1"; exon_number "2";
Chr1 StringTie exon 24040560 24041181 1000 - . gene_id "MSTRG.26588"; transcript_id "MSTRG.26588.1"; exon_number "2";
Chr1 StringTie transcript 24039810 24040595 1000 - . gene_id "MSTRG.26588"; transcript_id "Transcript:AT5G41000.1"; ref_gene_id "Gene:AT5G41000";
Ideally when e.g. MSTRG.5734 is found on a line in file2 that also contains the string e.g. Gene:AT1G64700, it will grab the information Gene:AT1G64700 and replace MSTRG.5734 in file1. So, every MSTRG on each row in file1 is unique and it will theoretically match a unique Gene in file2. If it does not match a Gene then I need the original row in file1 to be maintained.
File1 output should then look like:
Gene:AT1G64700 MSTRG.5734 509 -4 0 -14 0 0
MSTRG.19266 MSTRG.19266 842 -4 0 -12 0 0
Gene:AT5G41000 MSTRG.26588 196 5 0 12 0 0
My current perl code is:
use strict;
use warnings;
use vars qw($outfile #id $mstrg $gene);
open(SEARCH, $ARGV[0]) or die "Couldn't open $ARGV[0]: $!";
open(FILE, $ARGV[1]) or die "Couldn't open $ARGV[1]: $!";
$outfile = "testout.txt";
open (OUT, ">$outfile") || die "Can't open $outfile for creation: $!\n";
my %mstrg;
while (<SEARCH>) {
chomp;
if (/^MSTRG/) {
chomp $_;
#id = split (/\t/, $_);
$mstrg{$id[1]}.="$id[1]";
}
}
while (<FILE>) # {#ffn=<FILE>};
{
chomp ($gene=$_);
if ($mstrg =~ /$gene/) {
print OUT "$id[1]\t$id[2]";}
else {
#print OUT "$_\n";
}
next;
}
close FILE;
This is unfortunately where I get stuck and not sure how to proceed?
Thanks for any help given, appreciated and apologies for the previous post where I did not include any code to those that saw.
I would do this somewhat like:
use 5.014; #needed min 5.014 because the /r modifier
use warnings;
use Path::Tiny '0.077'; #added the min. req. version
my $file1='file1.txt';
my $file2='file2.txt';
my %mstmap = map { split /\s+/, s/.*?gene_id\s*"\s*(MSTRG\.\d+).*ref_gene_id\s*"\s*(Gene:\w+)".*/$1 $2/r }
grep { /ref_gene_id.*Gene:/ } path($file2)->lines({chomp => 1});
path($file1)->edit_lines( sub { s/^(MSTRG\.\d+)/exists($mstmap{$1}) ? $mstmap{$1} : $1/e });
for your input files produces
Gene:AT1G64700 MSTRG.5734 509 -4 0 -14 0 0
MSTRG.19266 MSTRG.19266 842 -4 0 -12 0 0
Gene:AT5G41000 MSTRG.26588 196 5 0 12 0 0
It create a hash for the pairs: MSTRG.number => Gene:String (from the file2), and using the Path::Tiny module editing fucntion doing the replaces in the file1.
After #Borodin comments the above could be reduced to:
use 5.014;
use warnings;
use Path::Tiny '0.077';
my $file1='f1';
my $file2='f2';
my %mstmap = map {
/.*?gene_id\s*"\s*(MSTRG\.\d+).*ref_gene_id\s*"\s*(Gene:\w+).*/
} path($file2)->lines({chomp => 1});
path($file1)->edit_lines( sub { s/^(MSTRG\.\d+)/exists($mstmap{$1}) ? $mstmap{$1} : $1/e });
There's no need to write impregnable code to achieve the result you want
This program reads through $file2 building hash %mstrg from all the lines that contain both an MSTRG. and a Gene: string. It then creates a regex in $re that will match any one of the MSTR. strings found
$file1 is opened, and that regex is used to replace any of the hash keys with the corresponding hash value it appears at the start of the line. The line is then printed
It isn't clear whether the first two fields of file1.txt are always the same, but I've opted to alter only the first field
I've used the autodie pragma to avoid having to explicitly check for success of any file IO operations
The program prints the output to STDOUT, so you can redirect it wherever you like on the command line
use strict;
use warnings 'all';
use autodie;
my ( $file1, $file2 ) = #ARGV;
my %mstrg;
{
open my $fh, '<', $file2;
while ( <$fh> ) {
$mstrg{$1} = $2 if /"(MSTRG.\d+)".*"(Gene:\w+)"/;
}
}
my $re = join '|', sort { length $b <=> length $a } keys %mstrg;
open my $fh, '<', $file1;
while ( <$fh> ) {
s/^($re)\b/$mstrg{$1}/;
print;
}
output
Gene:AT1G64700 MSTRG.5734 509 -4 0 -14 0 0
MSTRG.19266 MSTRG.19266 842 -4 0 -12 0 0
Gene:AT5G41000 MSTRG.26588 196 5 0 12 0 0
I have two files, which look like:
File1:
chr id position a0 a1
22 rs4820378:39869209:C:T 39869209 C T
22 22:16050075:A:G 16050075 A G
22 22:16050115:G:A 16050115 G A
22 rs199694733:39913976:C:CT 39913976 C CT
22 rs139408809:39937958:GC:G 39937958 GC G
File2:
SNP CHR BP A1 A2
rs4820378 22 39869209 C T
rs4821900 22 39869719 G A
rs1984662 22 39869997 T G
rs35629588 22 39913976 I2 D
rs139408809 22 39937958 D I2
I would like to find lines where
Fields 1 and 3 from File1 match fields 2 and 3 from File2
and also either
Fields 4 and 5 from File1 are matching fields 4 and 5 from File2
Field 4 from both files have more than 1 character
Field 5 from both files have more than 1 character
Then print out field 2 from File1, and fields 1 and 3 from File2
Code below
#! perl -w
use strict;
use warnings;
my %kglocus;
open( my $loci_in, "<", "File1" ) or die $!;
while ( <$loci_in> ) {
next if m/chr/;
my ( $CHR, $id, $BP, $A1, $A2 ) = split;
my $reg = "${CHR}_$BP";
$kglocus{$reg} = [ $CHR, $id, $BP, $A1, $A2 ];
}
close $loci_in;
my $filename = shift #ARGV;
open( my $input, "<", $filename ) or die $!;
while ( <$input> ) {
next if m/SNP/;
my ( $SNP, $CHR, $BP, $A1, $A2 ) = split;
my $reg = "${CHR}_$BP";
if ( $A1 eq $kglocus{$reg}->[3] and $A2 eq $kglocus{$reg}->[4] ) {
print "$kglocus{$reg}->[1] $SNP $BP\n";
}
elsif ( ( length( $A1 ) > 1 && length( $kglocus{$reg}->[3] ) > 1 ) ||
( length( $A2 ) > 1 && length( $kglocus{$reg}->[4] ) > 1 ) ) {
print "$kglocus{$reg}->[1] $SNP $BP\n";
}
}
close( $input );
I'm getting the error below for all input lines:
Use of uninitialized value in string eq at find_ID.hash.chr22.pl line 23
Use of uninitialized value in length at find_ID.hash.chr22.pl line 27
Can anyone point out the problem?
The problem is that the existence of the hash element $kglocus{$reg} forms the first test, that "Fields 1 and 3 from File1 match fields 2 and 3 from File2". But you are treating it as if that test always passes, and simply using it to access elements of the File1 record
You need something like a next unless $kglocus{$reg} in there to make it work correctly. I would also prefer to see that value pulled out as a separate variable to avoid indexing the hash over and over again
Here's a solution that will work for you
use strict;
use warnings;
use v5.10.1;
use autodie;
my %kglocus;
{
open my $in_fh, '<', 'File1';
while ( <$in_fh> ) {
next if /chr/;
my ( $chr, $id, $bp, $a1, $a2 ) = split;
my $key = "${chr}_$bp";
$kglocus{$key} = [ $chr, $id, $bp, $a1, $a2 ];
}
}
{
my ( $filename ) = #ARGV;
open my $in_fh, '<', $filename;
while ( <$in_fh> ) {
next if /SNP/;
my ( $snp, $chr, $bp, $a1, $a2 ) = split;
my $key = "${chr}_$bp";
next unless my $item = $kglocus{$key};
if ( $a1 eq $item->[3] and $a2 eq $item->[4]
or length $a1 > 1 and length $item->[3] > 1
or length $a2 > 1 and length $item->[4] > 1 ) {
print "$item->[1] $snp $bp\n";
}
}
}
I have written a small perl "hack" to replace 1's with alphabets in a range of columns in a tab delimited file. The file looks like this:
Chr Start End Name Score Strand Donor Acceptor Merged_Transcript Gencode Colon Heart Kidney Liver Lung Stomach
chr10 100177483 100177931 . . - 1 1 1 1 1 0 1 1 0 0
chr10 100178014 100179801 . . - 1 1 1 1 1 1 1 1 1 0
chr10 100179915 100182125 . . - 1 1 1 1 1 1 1 0 1 0
chr10 100182270 100183359 . . - 1 1 1 1 0 0 1 0 1 0
chr10 100183644 100184069 . . - 1 1 1 1 0 0 1 0 1 0
The gola is to take columns 11 through 16 and append letters A to Z if a value of 1 is seen in those columns. My code so far is producing an empty output and this is my first time doing regular expressions.
cat infile.txt \
| perl -ne '#alphabet=("A".."Z");
$is_known_intron = 0;
$is_known_donor = 1;
$is_known_acceptor = 1;
chomp;
$_ =~ s/^\s+//;
#d = split /\s+/, $_;
#d_bool=#d[$11-$16];
$ct=1;
$known_intron = $d[$10];
$num_of_overlapping_gene = $d[$9];
$known_acceptor = $d[$8];
$known_donor = $d[$7];
$k="";
if (($known_intron == $is_known_intron) and ($known_donor == $is_known_donor) and ($known_acceptor == $is_known_acceptor)) {
for ($i = 0; $i < scalar #d_bool; $i++){
$k.=$alphabet[$i] if ($d_bool[$i])
}
$alphabet_ct{$k}+=$ct;
}
END
{
foreach $k (sort keys %alphabet_ct){
print join("\t", $k, $alphabet_ct{$k}), "\n";
}
} '\
> Outfile.txt
What should I be doing instead?
Thanks!
* Edit *
Expected Output
ABCD 45
BCD 23
ABCDEF 1215
so on and so forth.
I converted your code into a script for ease of debugging. I've put comments in the code to point out dodgy bits:
use strict;
use warnings;
my %alphabet_ct;
my #alphabet = ( "A" .. "Z" );
my $is_known_intron = 0;
my $is_known_donor = 1;
my $is_known_acceptor = 1;
while (<DATA>) {
# don't process the first line
next unless /chr10/;
chomp;
# this should remove whitespace at the beginning of the line but is doing nothing as there is none
$_ =~ s/^\s+//;
my #d = split /\s+/, $_;
# the range operator in perl is .. (not "-")
my #d_bool = #d[ 10 .. 15 ];
my $known_intron = $d[9];
my $known_acceptor = $d[7];
my $known_donor = $d[6];
my $k = "";
# this expression is false for all the data in the sample you provided as
# $is_known_intron is set to 0
if ( ( $known_intron == $is_known_intron )
and ( $known_donor == $is_known_donor )
and ( $known_acceptor == $is_known_acceptor ) )
{
for ( my $i = 0; $i < scalar #d_bool; $i++ ) {
$k .= $alphabet[$i] if $d_bool[$i];
}
# it is more idiomatic to write $alphabet_ct{$k}++;
# $alphabet_ct{$k} += $ct;
$alphabet_ct{$k}++;
}
}
foreach my $k ( sort keys %alphabet_ct ) {
print join( "\t", $k, $alphabet_ct{$k} ) . "\n";
}
__DATA__
Chr Start End Name Score Strand Donor Acceptor Merged_Transcript Gencode Colon Heart Kidney Liver Lung Stomach
chr10 100177483 100177931 . . - 1 1 1 1 1 0 1 1 0 0
chr10 100178014 100179801 . . - 1 1 1 1 1 1 1 1 1 0
chr10 100179915 100182125 . . - 1 1 1 1 1 1 1 0 1 0
chr10 100182270 100183359 . . - 1 1 1 1 0 0 1 0 1 0
chr10 100183644 100184069 . . - 1 1 1 1 0 0 1 0 1 0
With $is_known_intron set to 1, the sample data gives the results:
ABCDE 1
ABCE 1
ACD 1
CE 2
i have a text file which looks like this.
Parameter 0:
Field 1 : 100
Field 2 : 0
Field 3 : 4
Parameter 1:
Field 1 : 873
Field 2 : 23
Field 3 : 89
I want to write a perl script that parses this file in the following format
Parameter Field1 Field2 Field3
0 100 0 4
1 873 23 89
Can anyone help me with this. Any help will be greatly appreciated.
i have tried the following so far
my %hash = ();
my $file = "sample.txt";
open (my $fh, "<", $file) or die "Can't open the file $file: ";
while (my $line =<$fh>)
{
chomp ($line);
my($key) = split(" : ", $line);
$hash{$key} = 1;
}
foreach my $key (sort keys %hash)
{
print "$key\n";
}
This Perl program does what you ask. It allows for any number of fields for each parameter (although there must be the same number of fields for every parameter) and takes the header labels for the fields from the data itself.
use strict;
use warnings;
my $file = 'sample.txt';
open my $fh, '<', $file or die qq{Can't open "$file" for input: $!};
my %data;
my #params;
my #fields;
while (<$fh>) {
next unless /\S/;
chomp;
my ($key, $val) = split /\s*:\s*/;
if ($val =~ /\S/) {
push #fields, $key if #params == 1;
push #{ $data{$params[-1]} }, $val if #params;
}
else {
die qq{Unexpected parameter format "$key"} unless $key =~ /parameter\s+(\d+)/i;
push #params, $1;
}
}
my #headers = ('Parameter', #fields);
my #widths = map length, #headers;
my $format = join(' ', map "%${_}s", #widths) . "\n";
printf $format, #headers;
for my $param (#params) {
printf $format, $param, #{ $data{$param} };
}
output
Parameter Field 1 Field 2 Field 3
0 100 0 4
1 873 23 89
use warnings; use strict;
my $file = "sample.txt";
open (my $fh, "<", $file) or die "Can't open the file $file: ";
print "Parameter Field1 Field2 Field3\n";
while (my $line=<$fh>) {
process_parameter($1) if $line =~ /Parameter (\d+):/;
}
sub process_parameter {
my $parameter = shift;
my ($field_1) = (<$fh> =~ /(\d+) *$/);
my ($field_2) = (<$fh> =~ /(\d+) *$/);
my ($field_3) = (<$fh> =~ /(\d+) *$/);
printf " %-2d %-6d %-6d %-6d\n", $parameter, $field_1, $field_2, $field_3;
}
#!/usr/bin/perl
my %hash = ();
my %fields;
my $param;
while ( chomp( my $line = <DATA> ) ) {
if ( $line =~ /Parameter (\d+):/ ) {
$param = $1;
}
next unless ( defined $param );
if ( my ( $f, $v ) = $line =~ /(Field \d+)[\s\t]*: (\d+)/ ) {
$hash{$param} ||= {};
$hash{$param}->{$f} = $v;
$fields{$f} ||= 1;
}
}
my #fields = sort keys %fields;
print join( ',', 'Parameter', #fields ), "\n";
foreach my $param ( sort { $a <=> $b } keys %hash ) {
print join( ',', $param, #{ $hash{$param} }{#fields} ), "\n";
}
__DATA__
Parameter 0:
Field 1 : 100
Field 2 : 0
Field 3 : 4
Parameter 1:
Field 1 : 873
Field 2 : 23
Field 3 : 89
Here is a way that accepts any number of fields for each parameter:
my $par;
my %out;
my $max = 0;
while(<DATA>) {
chomp;
next if /^\s*$/;
if (/Parameter\s*(\d+)/) {
$par = $1;
next;
}
my ($k, $v) = $_ =~/Field\s+(\d+)\s*:\s*(\d+)/;
$out{$par}[$k] = $v;
$max = $k if $k > $max;
}
my $cols = 'Param';
$cols .= "\tField $_" for (1..$max);
say $cols;
foreach my $par(sort (keys %out)) {
my $out = $par;
$out .= "\t".($out{$par}[$_]//' ') for (1..$max);
say $out;
}
__DATA__
Parameter 0:
Field 1 : 100
Field 2 : 0
Field 3 : 4
Field 5 :18
Parameter 1:
Field 1 : 873
Field 2 : 23
Field 3 : 89
Field 4 : 123
output:
Param Field 1 Field 2 Field 3 Field 4 Field 5
0 100 0 4 18
1 873 23 89 123
I have a file which looks like the following:
File
variableStep chrom=chr1 span=25
10076 0.84
10101 1
10126 1
10151 1
10176 1
10201 1
10226 1.72
variableStep chrom=chr1 span=25
10251 2
10276 1.16
10301 1
10326 1
10351 1
10376 1
10401 1
10426 0.28
11451 0.04
variableStep chrom=chr2 span=25
9781451 2
19781476 2
19781501 2
19781526 2
19781551 1
19781576 1
19781601 0.48
variableStep chrom=chr2 span=25
19781826 0.28
19781851 1
19781876 1
19781901 1
19781926 1
19781951 1.48
19781976 3.68
19782001 4.56
19782026 4
variableStep chrom=chr3 span=25
4813476 1
24813501 1
24813526 1
24813551 1
24813576 1.88
24813601 2
variableStep chrom=chr3 span=25
24813626 1.4
24813651 1.48
24813676 2
24813701 2
24813726 2
24813751 2
variableStep chrom=chr4 span=25
24815401 2.24
24815426 3
24815451 3
24815476 3
24815501 3
24815526 2.04
variableStep chrom=chr4 span=25
24815551 2
24815576 1.76
24815601 0.76
24815951 0.48
24815976 1
24816001 1
24816026 1
24816051 1
variableStep chrom=chr5 span=25
24817226 0.92
24817251 1.48
24817276 3
24817301 3
variableStep chrom=chr5 span=25
24817326 3
24817351 3
24817376 3
24817401 3.04
24817426 3.08
What Is Needed
What I need to do is, for all instances of say variableStep chrom=chr1 span=25, print out the subsequent n lines to an output file. n I must mention, is highly variable. It can vary anywhere from 300,000 to 500,000+ in the actual file.
Desired Output
1.Output_file_1_for_variableStep chrom=chr1 span=25
10076 0.84
10101 1
10126 1
10151 1
10176 1
10201 1
10226 1.72
10251 2
10276 1.16
10301 1
10326 1
10351 1
10376 1
10401 1
10426 0.28
11451 0.04
2._Output_file_2_for_variableStep chrom=chr2 span=25
9781451 2
19781476 2
19781501 2
19781526 2
19781551 1
19781576 1
19781601 0.48
19781826 0.28
19781851 1
19781876 1
19781901 1
19781926 1
19781951 1.48
19781976 3.68
19782001 4.56
19782026 4
3._Output_file_3_for_variableStep chrom=chr3 span=25
4813476 1
24813501 1
24813526 1
24813551 1
24813576 1.88
24813601 2
24813626 1.4
24813651 1.48
24813676 2
24813701 2
24813726 2
24813751 2
4._Output_file_4_for_variableStep chrom=chr4 span=25
24815401 2.24
24815426 3
24815451 3
24815476 3
24815501 3
24815526 2.04
24815551 2
24815576 1.76
24815601 0.76
24815951 0.48
24815976 1
24816001 1
24816026 1
24816051 1
5._Output_file_5_for_variableStep chrom=chr5 span=25
24817226 0.92
24817251 1.48
24817276 3
24817301 3
24817326 3
24817351 3
24817376 3
24817401 3.04
24817426 3.08
Background
I still consider myself a Perl newbie so the code I've written doesn't quite completely accomplish the task.
In fact the following code depicts the 3 ways in which I was trying to get it to work. For the code with pattern variableStep chrom=chr1 span=25 I tried to print the subsequent lines after the regex match manually.
From that I figured, I need a loop to run through all the subsequent lines,which is what i've written with the pattern variableStep chrom=chr1 span=25. But then, I realized I need an exit mechanism as otherwise all the subsequent lines were getting printed.
It is this exit pattern written as last if /^v.*$/which I need to figure out. As the one I have is currently printing only the very first instance of the specific pattern. There is no blank line either on which I could exit. Had I had a blank line, this piece of code is working perfectly fine( modifying to last if /^$/ ). I even tried using a non decimal character as /^\D.*$/, but it isn't working. What exit pattern should I use?
The remaining part of the code was my baby attempts to get the program to work and it only prints the single subsequent line after the pattern match.
Code
#Trial code to parse main file
use 5.014;
use warnings;
#Assign filename
my $file = 'trial.txt';
#Open filename
open my $fh, '<' , $file || die $!;
#Open output
open OUT1, ">Trial_chr1.out" || die $!;
open OUT2, ">Trial_chr2.out" || die $!;
open OUT3, ">Trial_chr3.out" || die $!;
open OUT4, ">Trial_chr4.out" || die $!;
open out5, ">Trial_chr5.out" || die $!;
#Read in file
while(<$fh>){
chomp;
if (/^variableStep chrom=chr1 span=25/){
my $nextline1 = <$fh>;#means next line after pattern match
my $nextline2 = <$fh>;
my $nextline3 = <$fh>;
my $nextline4 = <$fh>;
my $nextline5 = <$fh>;
my $nextline6 = <$fh>;
my $nextline7 = <$fh>;
print OUT1 $nextline1;
print OUT1 $nextline2;
print OUT1 $nextline3;
print OUT1 $nextline4;
print OUT1 $nextline5;
print OUT1 $nextline6;
print OUT1 $nextline7;
}elsif(/^variableStep chrom=chr2 span=25/){
my #grabbed_lines; #Initialize array to store lines after pattern match
while (<$fh>){ #Read subsequent lines while in a loop
last if /^v.*$/; #Break out of the loop if line encountered begins with v
push #grabbed_lines, $_;# As long as the above condition is false, push the lines into the array
}print OUT2 #grabbed_lines; # Print the grabbed lines
}elsif(/^variableStep chrom=chr3 span=25/){
my $nextline = <$fh>;
print OUT3 $nextline;
}elsif(/^variableStep chrom=chr4 span=25/){
my $nextline = <$fh>;
print OUT4 $nextline;
}elsif(/^variableStep chrom=chr5 span=25/){
my $nextline = <$fh>;
print out5 $nextline;
}
}
#Exit
exit;
Thank you for taking the time to go through my question. I'd be grateful for any tips and suggestions.
Ok I missunderstood the n part, it is different for each match, this is tested and working:
my $found = 0;
while (<$fh>) {
if ( $found && /^\d/ ) {
print $_;
}
else {
$found = 0;
}
if (/^variableStep chrom=chr2 span=25/) {
$found = 1;
}
}
this way it prints all following rows that start with a digit.
Explanation:
The problem here is, that each time you are calling <$fh> it reads the next row, so if you test the row content and your test fails, you should not do the next loop, because then the next row is read and you lost the row where the test failed.
So I came to this solution:
I use a flag to know in which mode I am, am I searching for rows to print or not?
The first if is only entered
if I have been in the second if if in a loop before and the flag has been set to "1"
AND the row start with a digit.
When this test fails, i.e. there is no row with a digit at the beginning, I reset the flag and have then the chance to look again at the same row if it starts with "variableStep ..."
Oneliner below should do the trick (assuming that output files doesn't already exist):
perl -lne '/variableStep/ && open($fh, ">>", $_) && next; print $fh $_;' input.txt
--
btw: || operator has high priority (man perlop), so command:
open OUT1, ">Trial_chr1.out" || die $!;
is understood by perl as
open OUT1, (">Trial_chr1.out" || die $!);
To do error checking you should use and operator instead or use parentheses to enforce desired behaviour
I love perl, but awk is more suitable in this case, see :
$ awk '
{if ($0 ~ /^variableStep/) {file="output_file_"++c"_"$1"_"$2"_"$3}
else{print $0 > file}}
' file.txt
$ ls -l output_file_*
Using perl and File::Slurp useful module :
use strict; use warnings;
use File::Slurp;
my ($c, $file);
while (<>) {
if (/^variableStep\s+chrom=\w+\s+span=\d+/) {
$c++;
$file = $&;
$file =~ s/\s/_/g;
$file = "output_file_${c}_" . $file;
}
else {
append_file $file, $_;
}
}
Usage :
$ perl ./script.pl file.txt
$ ls -l output_file_*