Pattern counter in fasta file

Pattern counter in fasta file - regex

I am trying to get the count of matching patterns in the fasta file. I am starting from a fasta file containing 57k sequences. I want to pull out the count of my matching pattern sequence and show the starting position of the pattern
Input file:
chr1 ATTAG**CAGAT**GTGACGTCGATGT**CAGAT**TG
chr2 TGAGCTG**CAGAT**CGTAGATGATTCTGCAGGAACCT
chr3 TCTTT**CAGAT**GCCTCTG**CAGAT**TC
Searching pattern "CAGAT"
Output Required:
chr Count P1 P2
chr1 -- 2 -- 6 -- 25
chr2 -- 1 -- 8
chr3 -- 2 -- 6 -- 19
Thanks in advance

I am assuming your file to be seperated with tab as delimiter.
open(my $in,"<:utf8","in.txt") or die "Cannot open FILE in.txt : $!\n";
while(<$in>) {
chomp($_);
my $cur = $_;
#$print "iam $cur\n";
my #tt = split(/\t/,$cur); #assuming you file tobe tab seperated
my $s1 = $tt[1];
my $s2 = "CAGAT";
my #val;
print "$cur\t";
while ($s1 =~ /($s2)/g) {
push(#val, $-[0]); #$-[0] is the offset of the start of the last successful match.
}
my $count = #val;
#val = join(",",#val);
print " No of Matches:$count Starting positions:#val\n";
}
in.txt
chr1 ATTAG**CAGAT**GTGACGTCGATGT**CAGAT**TG
chr2 TGAGCTG**CAGAT**CGTAGATGATTCTGCAGGAACCT
chr3 TCTTT**CAGAT**GCCTCTG**CAGAT**TC

Related

Perl:How to insert line in a file?

I am reading from two files. I'm trying to insert a line from file2 to file1 whenever column 1 contents matches.
##FILE1
1 wr 5769 78670002 fqefq
3 wr 5769 78650003 hfhhg
5 wr 5769 88990001 dfdsv
##FILE2
1 Step1
3 Step3
5 Step5
Desired Output:
1 wr 5769 78670002 fqefq
Step1
3 wr 5769 78650003 hfhhg
Step3
5 wr 5769 88990001 dfdsv
Step5
Code tried:
my $rk="rk.log";
open(my $tt, "<$rk" ) or die "Could not open file $trk: $!";
while (<$tt>) {
if ($_ =~ /^(\d+)\s+wr\s+5769\s+(\w+)\s+\.*/gm) {
open(p1,"<$temp1") or die "Could not open file $temp1: $!";
while (my $newl = <p1>) {
my #fs1 = split " ", $newl;
if ($fs1[0] eq $1){
print "#fs1\n";
print "step $2\n";
} else {
print "#fs1\n";
}
}
}
}
close p1;
close $tt;
Above code doesn't giving the desired output. Can anyone suggest me better way to do it?
Update ##FILE2
2 Step1
4 Step3
6 Step5

Hopefully, a bit of pseudocode will be enough to get you on the right track.
Read file2 into a hash (where the key is the integer and the value is the whole line)
Open file1
Read file1 a line at a time
Print the line from file1
Extract the integer from the start of the line from line1
If that integer exists in your hash
Print the line from file2

I believe the simplest method would be to import the two files into separate strings, then create a loop which:
Finds match in file 1. (Include line breaks in your matches)
Appends match to third string.
Deletes match from file 1. (Replace with nothing)
Finds match from file 2. (Include line breaks in your matches)
Appends match to third string.
Deletes match from file 2.
This way you will sequentially order all of your matches from the two files into a string that you can export as a file.

This is works for me:
use Tie::File;
my $fle1 = $ARGV[0]; my $fle2 = $ARGV[1];
open(FL2, $fle2) || die "Couldn't read file $fle2\: $!\n";
my $flecnt2 = do { local $/; <FL2>; };
close(FL2);
my #array;
tie #array, 'Tie::File', $fle1 || die "Error: Couldn't read and write \"$fle1\" file: $!";
my $str = join "\n", #array;
$str=~s#^([^\s]+)\s(.+)$# my $fulcnt=$&;
if($flecnt2=~m/^$1\s+(.+)$/m)
{
$fulcnt .= "\n$&";
}
($fulcnt);
#egm;
#array = split/\n/, $str;
untie #array;

search and replace between two files -post#2

I need to replace a variable in one file with information/regex found in another file. For example, I need to grab the first string/variable (that starts with MSTRG) from each row in file1:
MSTRG.5734 MSTRG.5734 509 -4 0 -14 0 0
MSTRG.19266 MSTRG.19266 842 -4 0 -12 0 0
MSTRG.26588 MSTRG.26588 196 5 0 12 0 0
and use this to search in file2 that will look something like this:
Chr1 StringTie transcript 24039360 24041181 1000 - . gene_id "MSTRG.5734"; transcript_id "MSTRG.5734.1";
Chr1 StringTie transcript 24039810 24040595 1000 - . gene_id "MSTRG.5734"; transcript_id "Transcript:AT1G64700.1"; ref_gene_id "Gene:AT1G64700"
Chr1 StringTie exon 24040560 24041181 1000 - . gene_id "MSTRG.19266"; transcript_id "MSTRG.19266.1"; exon_number "2";
Chr1 StringTie exon 24040560 24041181 1000 - . gene_id "MSTRG.26588"; transcript_id "MSTRG.26588.1"; exon_number "2";
Chr1 StringTie transcript 24039810 24040595 1000 - . gene_id "MSTRG.26588"; transcript_id "Transcript:AT5G41000.1"; ref_gene_id "Gene:AT5G41000";
Ideally when e.g. MSTRG.5734 is found on a line in file2 that also contains the string e.g. Gene:AT1G64700, it will grab the information Gene:AT1G64700 and replace MSTRG.5734 in file1. So, every MSTRG on each row in file1 is unique and it will theoretically match a unique Gene in file2. If it does not match a Gene then I need the original row in file1 to be maintained.
File1 output should then look like:
Gene:AT1G64700 MSTRG.5734 509 -4 0 -14 0 0
MSTRG.19266 MSTRG.19266 842 -4 0 -12 0 0
Gene:AT5G41000 MSTRG.26588 196 5 0 12 0 0
My current perl code is:
use strict;
use warnings;
use vars qw($outfile #id $mstrg $gene);
open(SEARCH, $ARGV[0]) or die "Couldn't open $ARGV[0]: $!";
open(FILE, $ARGV[1]) or die "Couldn't open $ARGV[1]: $!";
$outfile = "testout.txt";
open (OUT, ">$outfile") || die "Can't open $outfile for creation: $!\n";
my %mstrg;
while (<SEARCH>) {
chomp;
if (/^MSTRG/) {
chomp $_;
#id = split (/\t/, $_);
$mstrg{$id[1]}.="$id[1]";
}
}
while (<FILE>) # {#ffn=<FILE>};
{
chomp ($gene=$_);
if ($mstrg =~ /$gene/) {
print OUT "$id[1]\t$id[2]";}
else {
#print OUT "$_\n";
}
next;
}
close FILE;
This is unfortunately where I get stuck and not sure how to proceed?
Thanks for any help given, appreciated and apologies for the previous post where I did not include any code to those that saw.

I would do this somewhat like:
use 5.014; #needed min 5.014 because the /r modifier
use warnings;
use Path::Tiny '0.077'; #added the min. req. version
my $file1='file1.txt';
my $file2='file2.txt';
my %mstmap = map { split /\s+/, s/.*?gene_id\s*"\s*(MSTRG\.\d+).*ref_gene_id\s*"\s*(Gene:\w+)".*/$1 $2/r }
grep { /ref_gene_id.*Gene:/ } path($file2)->lines({chomp => 1});
path($file1)->edit_lines( sub { s/^(MSTRG\.\d+)/exists($mstmap{$1}) ? $mstmap{$1} : $1/e });
for your input files produces
Gene:AT1G64700 MSTRG.5734 509 -4 0 -14 0 0
MSTRG.19266 MSTRG.19266 842 -4 0 -12 0 0
Gene:AT5G41000 MSTRG.26588 196 5 0 12 0 0
It create a hash for the pairs: MSTRG.number => Gene:String (from the file2), and using the Path::Tiny module editing fucntion doing the replaces in the file1.
After #Borodin comments the above could be reduced to:
use 5.014;
use warnings;
use Path::Tiny '0.077';
my $file1='f1';
my $file2='f2';
my %mstmap = map {
/.*?gene_id\s*"\s*(MSTRG\.\d+).*ref_gene_id\s*"\s*(Gene:\w+).*/
} path($file2)->lines({chomp => 1});
path($file1)->edit_lines( sub { s/^(MSTRG\.\d+)/exists($mstmap{$1}) ? $mstmap{$1} : $1/e });

There's no need to write impregnable code to achieve the result you want
This program reads through $file2 building hash %mstrg from all the lines that contain both an MSTRG. and a Gene: string. It then creates a regex in $re that will match any one of the MSTR. strings found
$file1 is opened, and that regex is used to replace any of the hash keys with the corresponding hash value it appears at the start of the line. The line is then printed
It isn't clear whether the first two fields of file1.txt are always the same, but I've opted to alter only the first field
I've used the autodie pragma to avoid having to explicitly check for success of any file IO operations
The program prints the output to STDOUT, so you can redirect it wherever you like on the command line
use strict;
use warnings 'all';
use autodie;
my ( $file1, $file2 ) = #ARGV;
my %mstrg;
{
open my $fh, '<', $file2;
while ( <$fh> ) {
$mstrg{$1} = $2 if /"(MSTRG.\d+)".*"(Gene:\w+)"/;
}
}
my $re = join '|', sort { length $b <=> length $a } keys %mstrg;
open my $fh, '<', $file1;
while ( <$fh> ) {
s/^($re)\b/$mstrg{$1}/;
print;
}
output
Gene:AT1G64700 MSTRG.5734 509 -4 0 -14 0 0
MSTRG.19266 MSTRG.19266 842 -4 0 -12 0 0
Gene:AT5G41000 MSTRG.26588 196 5 0 12 0 0

add plus or minus in awk if no match

I am trying to match all the lines in the below file to match. The awk will do that the problem is that the lines that do not match should be within plus or minus 10. I am not sure how to tell awk that the if a match is not found then use either plus or minus the coordinates in file. If no match is found after that then no match is in the file. Thank you :).
file
955763
957852
976270
bigfile
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2
chr1 970621 970740 chr1:970621-970740 AGRN-8|gc=57.1
awk
awk 'NR==FNR{A[$1];next}$3 in A' file bigfile > output
desired output (same as bigfile)
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2

If there's no difference between a row that matches and one that's close, you could just set all of the keys in the range in the array:
awk 'NR == FNR { for (i = -10; i <= 10; ++i) A[$1+i]; next }
$3 in A' file bigfile > output
The advantage of this approach is that only one lookup is performed per line of the big file.

You need to run a loop on array a:
awk 'NR==FNR {
a[$1]
next
}
{
for (i in a)
if (i <= $3+10 && i >= $3-10)
print
}' file bigfile > output

Your data already produces the desired output (all exact match).
$ awk 'NR==FNR{a[$1];next} $3 in a{print; next}
{for(k in a)
if((k-$3)^2<=10^2) {print $0, " --> within 10 margin"; next}}' file bigfile
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2
chr1 976251 976261 chr1:976251-976261 AGRN-8|gc=57.1 --> within 10 margin
I added a fake 4th row to get the margin match

Regular expression not returning first number from 3 lines

Hi I am new with Perl programming, I wrote a code to store a first number from a scalar variable using regular expression but i am getting first number from last line but I need number from first line.
For example in the following code I need $num = 22 but code returns 656.
my $num ;
my $sample = "fd 22 sdf sdf 96
dsf6 66s sd6 7777 sd
656 dd 55 ";
my #sentences = split(/\n/, $sample);
for my $line(#sentences)
{
($num )= $line =~ /([0-9]+) .*/ ;
}
print $num;
Can some one tell me whats wrong with my logic?

Your code overwrites the first match in the following iterations of the loop, 22 matches but 666 replaces it. Just break after the first match:
($num )= $line =~ /([0-9]+) .*/ and last;
or remove the loop and match against the sample:
($num )= $sample =~ /([0-9]+)/;

I think the pattern as written won't filter out entries like "s67" in the following
my $sample = "fd 66s s67 22 sdf sdf 96
dsf6 66s 656 dd 55 ";
and so it needs something like
($num) = $line =~ /\b([0-9]+)\b.*/ and last;

Or try
($num) = $sample =~ /[0-9]+/g

How to print next N lines for all instances of a string pattern?

I have a file which looks like the following:
File
variableStep chrom=chr1 span=25
10076 0.84
10101 1
10126 1
10151 1
10176 1
10201 1
10226 1.72
variableStep chrom=chr1 span=25
10251 2
10276 1.16
10301 1
10326 1
10351 1
10376 1
10401 1
10426 0.28
11451 0.04
variableStep chrom=chr2 span=25
9781451 2
19781476 2
19781501 2
19781526 2
19781551 1
19781576 1
19781601 0.48
variableStep chrom=chr2 span=25
19781826 0.28
19781851 1
19781876 1
19781901 1
19781926 1
19781951 1.48
19781976 3.68
19782001 4.56
19782026 4
variableStep chrom=chr3 span=25
4813476 1
24813501 1
24813526 1
24813551 1
24813576 1.88
24813601 2
variableStep chrom=chr3 span=25
24813626 1.4
24813651 1.48
24813676 2
24813701 2
24813726 2
24813751 2
variableStep chrom=chr4 span=25
24815401 2.24
24815426 3
24815451 3
24815476 3
24815501 3
24815526 2.04
variableStep chrom=chr4 span=25
24815551 2
24815576 1.76
24815601 0.76
24815951 0.48
24815976 1
24816001 1
24816026 1
24816051 1
variableStep chrom=chr5 span=25
24817226 0.92
24817251 1.48
24817276 3
24817301 3
variableStep chrom=chr5 span=25
24817326 3
24817351 3
24817376 3
24817401 3.04
24817426 3.08
What Is Needed
What I need to do is, for all instances of say variableStep chrom=chr1 span=25, print out the subsequent n lines to an output file. n I must mention, is highly variable. It can vary anywhere from 300,000 to 500,000+ in the actual file.
Desired Output
1.Output_file_1_for_variableStep chrom=chr1 span=25
10076 0.84
10101 1
10126 1
10151 1
10176 1
10201 1
10226 1.72
10251 2
10276 1.16
10301 1
10326 1
10351 1
10376 1
10401 1
10426 0.28
11451 0.04
2._Output_file_2_for_variableStep chrom=chr2 span=25
9781451 2
19781476 2
19781501 2
19781526 2
19781551 1
19781576 1
19781601 0.48
19781826 0.28
19781851 1
19781876 1
19781901 1
19781926 1
19781951 1.48
19781976 3.68
19782001 4.56
19782026 4
3._Output_file_3_for_variableStep chrom=chr3 span=25
4813476 1
24813501 1
24813526 1
24813551 1
24813576 1.88
24813601 2
24813626 1.4
24813651 1.48
24813676 2
24813701 2
24813726 2
24813751 2
4._Output_file_4_for_variableStep chrom=chr4 span=25
24815401 2.24
24815426 3
24815451 3
24815476 3
24815501 3
24815526 2.04
24815551 2
24815576 1.76
24815601 0.76
24815951 0.48
24815976 1
24816001 1
24816026 1
24816051 1
5._Output_file_5_for_variableStep chrom=chr5 span=25
24817226 0.92
24817251 1.48
24817276 3
24817301 3
24817326 3
24817351 3
24817376 3
24817401 3.04
24817426 3.08
Background
I still consider myself a Perl newbie so the code I've written doesn't quite completely accomplish the task.
In fact the following code depicts the 3 ways in which I was trying to get it to work. For the code with pattern variableStep chrom=chr1 span=25 I tried to print the subsequent lines after the regex match manually.
From that I figured, I need a loop to run through all the subsequent lines,which is what i've written with the pattern variableStep chrom=chr1 span=25. But then, I realized I need an exit mechanism as otherwise all the subsequent lines were getting printed.
It is this exit pattern written as last if /^v.*$/which I need to figure out. As the one I have is currently printing only the very first instance of the specific pattern. There is no blank line either on which I could exit. Had I had a blank line, this piece of code is working perfectly fine( modifying to last if /^$/ ). I even tried using a non decimal character as /^\D.*$/, but it isn't working. What exit pattern should I use?
The remaining part of the code was my baby attempts to get the program to work and it only prints the single subsequent line after the pattern match.
Code
#Trial code to parse main file
use 5.014;
use warnings;
#Assign filename
my $file = 'trial.txt';
#Open filename
open my $fh, '<' , $file || die $!;
#Open output
open OUT1, ">Trial_chr1.out" || die $!;
open OUT2, ">Trial_chr2.out" || die $!;
open OUT3, ">Trial_chr3.out" || die $!;
open OUT4, ">Trial_chr4.out" || die $!;
open out5, ">Trial_chr5.out" || die $!;
#Read in file
while(<$fh>){
chomp;
if (/^variableStep chrom=chr1 span=25/){
my $nextline1 = <$fh>;#means next line after pattern match
my $nextline2 = <$fh>;
my $nextline3 = <$fh>;
my $nextline4 = <$fh>;
my $nextline5 = <$fh>;
my $nextline6 = <$fh>;
my $nextline7 = <$fh>;
print OUT1 $nextline1;
print OUT1 $nextline2;
print OUT1 $nextline3;
print OUT1 $nextline4;
print OUT1 $nextline5;
print OUT1 $nextline6;
print OUT1 $nextline7;
}elsif(/^variableStep chrom=chr2 span=25/){
my #grabbed_lines; #Initialize array to store lines after pattern match
while (<$fh>){ #Read subsequent lines while in a loop
last if /^v.*$/; #Break out of the loop if line encountered begins with v
push #grabbed_lines, $_;# As long as the above condition is false, push the lines into the array
}print OUT2 #grabbed_lines; # Print the grabbed lines
}elsif(/^variableStep chrom=chr3 span=25/){
my $nextline = <$fh>;
print OUT3 $nextline;
}elsif(/^variableStep chrom=chr4 span=25/){
my $nextline = <$fh>;
print OUT4 $nextline;
}elsif(/^variableStep chrom=chr5 span=25/){
my $nextline = <$fh>;
print out5 $nextline;
}
}
#Exit
exit;
Thank you for taking the time to go through my question. I'd be grateful for any tips and suggestions.

Ok I missunderstood the n part, it is different for each match, this is tested and working:
my $found = 0;
while (<$fh>) {
if ( $found && /^\d/ ) {
print $_;
}
else {
$found = 0;
}
if (/^variableStep chrom=chr2 span=25/) {
$found = 1;
}
}
this way it prints all following rows that start with a digit.
Explanation:
The problem here is, that each time you are calling <$fh> it reads the next row, so if you test the row content and your test fails, you should not do the next loop, because then the next row is read and you lost the row where the test failed.
So I came to this solution:
I use a flag to know in which mode I am, am I searching for rows to print or not?
The first if is only entered
if I have been in the second if if in a loop before and the flag has been set to "1"
AND the row start with a digit.
When this test fails, i.e. there is no row with a digit at the beginning, I reset the flag and have then the chance to look again at the same row if it starts with "variableStep ..."

Oneliner below should do the trick (assuming that output files doesn't already exist):
perl -lne '/variableStep/ && open($fh, ">>", $_) && next; print $fh $_;' input.txt
--
btw: || operator has high priority (man perlop), so command:
open OUT1, ">Trial_chr1.out" || die $!;
is understood by perl as
open OUT1, (">Trial_chr1.out" || die $!);
To do error checking you should use and operator instead or use parentheses to enforce desired behaviour

I love perl, but awk is more suitable in this case, see :
$ awk '
{if ($0 ~ /^variableStep/) {file="output_file_"++c"_"$1"_"$2"_"$3}
else{print $0 > file}}
' file.txt
$ ls -l output_file_*

Using perl and File::Slurp useful module :
use strict; use warnings;
use File::Slurp;
my ($c, $file);
while (<>) {
if (/^variableStep\s+chrom=\w+\s+span=\d+/) {
$c++;
$file = $&;
$file =~ s/\s/_/g;
$file = "output_file_${c}_" . $file;
}
else {
append_file $file, $_;
}
}
Usage :
$ perl ./script.pl file.txt
$ ls -l output_file_*

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Pattern counter in fasta file - regex

Related

Perl:How to insert line in a file?

search and replace between two files -post#2

add plus or minus in awk if no match

Regular expression not returning first number from 3 lines

How to print next N lines for all instances of a string pattern?

Categories

Resources