find many matches in nucleotide sequence with a regex - regex

I have some gene sequence (see below), and I want to find all open reading frame (start with ATG and stop TAG).
I have tried this:
my $file = ('ACCCTGCCCAAAATCCCCCCGATCGATAGAGCTAAATGGCCCATGATGCATCGACTAGCTAGCTAAAATGTCGATCGATACAGCTAATAG');
while($file =~ /(ATG\w+?TAG)/g){
print $1;
}
but it only gives
ATGGCCCATGATGCATCGACTAGATGTCGATCGATACAGCTAATAG
how can i get every one?

The trick to find all occurences is to use a zero-width assertion, this will prevent "the eating" of our characters: (?=ATG\w+?TAG).
The problem with this is that we'll get empty matches, so the solution is to use a group:
(?=(ATG\w+?TAG)). You will find all occurences in group 1.
Group 1 output:
ATGGCCCATGATGCATCGACTAG
ATGATGCATCGACTAG
ATGCATCGACTAG
ATGTCGATCGATACAGCTAATAG
Online demo

Result is ok, simply separate them in output:
print "$1\n";

You are getting two matches. To see them, I suggest you print some separator between them:
print "$1\n";
Then we get the output:
ATGGCCCATGATGCATCGACTAG
ATGTCGATCGATACAGCTAATAG
If you want to find frames that also occur inside another, then you must make sure to not consume too many characters. Work around that via a looahead:
/ATG(?=([ACTG]*+TAG))/g;
Then print "ATG$1\n", Output:
ATGGCCCATGATGCATCGACTAG
ATGATGCATCGACTAG
ATGCATCGACTAG
ATGTCGATCGATACAGCTAATAG

If you want to have the start and stop codons in the same frame don't forget to filter the results to the only ones with a length multiple of 3:
print "ATG$1\n" if (length($1)%3) == 0 ;
If you want to check the six frames available in one sequence, don't forget to check also the complementary chain:
$comp_chain = reverse($chain) ;
$comp_chain =~ tr/ATCG/TAGC/ ;
You will then obtain the open reading frames from the six reading frames available in a single sequence.

Related

Get error result when search match for only one time

I wanted to search a string that matched exact times in another string, while I met some problem.
use strict;
use warnings;
my $test="abc1234abc5678abcdef910";
my $cut_seq="abc";
print $test,"\tone time\n" if($test=~/$cut_seq{1}/);
print $test,"\tmore than one times\n" if($test=~/$cut_seq{1,}/);
I expected the result:
abc1234abc5678abcdef910 more than one times
But the result showed as:
abc1234abc5678abcdef910 one time
abc1234abc5678abcdef910 more than one times
I also tried this:
print $test,"\tone time\n" if($test=~/$cut_seq{0,1}/);
print $test,"\tone time\n" if($test=~/$cut_seq{1,1}/);
print $test,"\tmore than one times\n" if($test=~/$cut_seq{1,}/);
But nothing changed. I just wonder why it can't match exact times. If something matches two times it will also match one time, then what's the difference of {1}, {1,}, {1,1}, {1,2}. I don't get the point to create these different forms.
If something matches two times it also matches one time. That's why your "one time" match always kicks in.
The easiest approach, I think, is to simply split at your $cut_seq and check the number of splitted elements.
my $test="abc1234abc5678abcdef910";
my $cut_seq="abc";
my #elts= split /$cut_seq/, $test;
print scalar(#elts)-1," times\n";
P.S. This does not count '$cut_seq` at the end of the string - sorry! You'll have to append something which will not be part of your sequence like:
my $test="abc1234abc5678abcdef910";
my $cut_seq="abc";
my #elts= split /$cut_seq/, $test . chr(0);
print scalar(#elts)-1," times\n";

Perl regex from file.txt, match columns greater than x

I have a file containing several rows of code, like this:
160101, 0100, 58.8,
160101, 0200, 59.3,
160101, 0300, 59.5,
160101, 0400, 59.1,
I'm trying to print out the third column with a regex, like this:
# Read the text file.
open( IN, "file.txt" ) or die "Can't read words file: $!";
# Print out.
while (<IN>) {
print "Number: $1\n"
while s/[^\,]+\,[^\,]+\,([^\,]+)\,/$1/g;
}
And it works fairly well, however, I'm trying to only fetch the numbers that are greater than or equal to 59 (that includes numbers like 59.1 and 59.0). I've tried several numeric regex combinations (the one below will not give me the right number, obviously, but just making a point), including:
while s/[^\,]+\,[^\,]+\,([^\,]+)\,^[0-9]{3}$/$1/g;
but none seem to work. Any ideas?
My first idea would be to split that line and then pick and choose
while (my $line = <IN>) {
my #nums = split ',\s*', $line;
print "$nums[2]\n" if $nums[2] >= $cutoff;
}
If you insist on doing it all in the regex then you may want to use /e modifier, so in the substitution part you can run code. Then you can test the particular match and print it there.
Assuming that the numbers can't reach 100 (three digits) you could use
[^\,]+\,[^\,]+\,\s*(59\.\d+|[6-9]\d\.\d+)\,
which uses your regex except for the capture group which captures the number 59 and it's decimals, or two digit numbers from 60-99 and it's decimals.
Regards
Edit:
To go above 100 you can add another alternative in the capture group:
[^\,]+\,[^\,]+\,\s*(59\.\d+|[6-9]\d\.\d+|[1-9]\d{2,}\.\d+)\,
which allows larger numbers (>=100.0).
Why do you use while? Is it possible to have more than one third column on a line? A simple if will work the same, comunicating the intent more clearly.
Also, if you want to extract, you don't need to substitute. Use m// instead of s///.
Regexes aren't the right tool to do numberic comparisons. Use >= instead:
print "Number: $1\n" if /[^\,]+\,[^\,]+\,([^\,]+)\,/
&& $1 >= 59
Assuming the line ends with a comma :
print foreach map{s/.+?(\d+.\d+),$/$1/;$_} ;
In case there might be someting after the rightmost comma :
print foreach map{s/.+?(\d+.\d+),[^,]*$/$1/;$_} ;
But i wouldn't use regexp in that case :
print foreach map{(split, ',')[-2]} ;
I would suggest not using a regex when split is a better tool for the job. Likewise - regex is very bad at detecting numeric values - it works on text based patterns.
But how about:
while ( <> ) {
print ((split /,\s*/)[2],"\n");
}
If you want to test a conditional:
while ( <> ) {
my #fields = split /,\s*/;
print $fields[2],"\n" if $fields[2] >= 59;
}
Or perhaps:
print join "\n", grep { $_ >= 59 } map { (split /,\s*/)[2] } <>;
map takes your input, and extracts the third field (returning a list). grep then applies a filter condition to every element. And then we print it.
Note - in the above, I use <> which is the magic file handle (reads files specified on command line, or STDIN) but you can use your filehandle.
However it's probably worth noting - 3 argument open with lexical file handles are recommended now.
open ( my $input, '<', 'file.txt' ) or die $!;
It has a number of advantages and is generally good style.

Stopping regex at the first match, it shows two times

I am writing a perl script and I have a simple regex to capture a line from a data file. That line starts with IG-XL Version:, followed by the data, so my regex matches that line.
if($row =~/IG-XL Version:\s(.*)\;/)
{
print $1, "\n";
}
Let's say $1 prints out 9.0.0. That's my desired outcome. However in another part of the same data file also has a same line IG-XL Version:. $1 now prints out two of the data 9.0.0.
I only want it to match the first one so I can only get the one value. I have tried /IG-XL Version:\s(.*?)\;/ which is the most suggested solution by adding a ? so it'll be .*? but it still outputs two. Any help?
EDIT:
The value of $row is:
Current IG-XL Version: 8.00.01_uflx (P7); Build: 11.10.12.01.31
Current IG-XL Version: 8.00.01_uflx (P7); Build: 11.10.12.01.31
The desired value I want is 8.00.01_uflx (P7) which I did get, but two times.
The only way to do this while reading the file line by line is to keep a status flag that records whether you have already found that pattern. But if you are storing the data in a hash, as you were in your previous question, then it won't matter as you will just overwrite the hash element with the same value
if ( $row =~ /IG-XL Version:\s*([^;]+)/ and not $seen_igxl_vn ) {
print $1, "\n";
$seen_igxl_vn = 1;
}
Or, if the file is reasonably small, you could read the whole thing into memory and search for just the first occurrence of each item
I suggest you should post a question showing your complete program, your input data, and your required output, so that we can give you a complete solution rather than seeing your problem bit by bit

regex maching after new line in perl

i am trying to match with regex in perl different parts of a text which are not in the same line.
I have a file sized 200 mb aprox with all cases similar to the following example:
rewfww
vfresrgt
rter
*** BLOCK 049 Aeee/Ed "ewewew"U 141202 0206
BLAH1
BLAH2
END
and i want to extract all what is in the same line after the "***" in $1, BLAH1 in $2 and BLAH2 in $3.
i have tried the following without success:
open(archive, "C:/Users/g/Desktop/blahs.txt") or die "die\n";
while(< archive>){
if($_ =~ /^\*\*\*(.*)\n(.*)/s){
print $1;
print $2;
}
}
One more complexity: i don´t know how many BLAH´s are in each case. Perhaps one case have only BLAH1, other case with BLAH1, BLAH2 and BLAH3 etc. The only thing thats sure is the final "END" who separates the cases.
Regards
\*\*\*([^\n]*)\n|(?!^)\G\s*(?!\bEND\b)([^\n]+)
Try this.See demo.
https://regex101.com/r/vN3sH3/17
How about:
#!/usr/bin/perl
use strict;
use warnings;
open(my $archive, '<', "C:/Users/g/Desktop/blahs.txt") or die "die: $!";
while(<$archive>){
if (/^\*{3}/ .. /END/) {
s/^\*{3}//;
print unless /END/;
}
}
As far as I understand your question the following works for me. Please update or provide feedback if you are looking for something more or less strict (or spot any mistakes!).
^(\*{3}.*\n{2})(([a-zA-Z])*([0-9]*)\n{2})*(END)$
^(\*{3}\n{2}) - Find line consisting of three *s followed by two newlines - You could repeat this by adding * after the last closing parenthesis if you want/need to check for a "false" start. While it looks like you may have data in the file before this but this is the start of the data you actually care about/want to capture.
(([a-zA-Z])*([0-9]*)\n{2})* -The desired word characters followed by a number (or numbers if your BLAH count >9) and also check for two trailing spaces. The * at the end denotes that this can repeat zero or more times which accounts for the case where you have no data. If you want a fail if there is not data use ? instead of * to denote it must repeat 1 or more times. this segment assumes you wanted to check for data in the format word+number. If that is not the case this part can be easily modified to accept a wider range of data - let me know if you want/need a more or less strict case
(END)$ - The regex ends with sequence "END". If it is permissible for the data to continue and you just want to stop capture at this point do not include the $
I don't have permissions to post pics yet but a great site to check and to see a visual representation of your regex imo is https://www.debuggex.com/

find ORF with minimal size of 45 bases using perl regular expression - why this regex doesn't work

I am using perl and regular expression to find an ORF (open reading frame) with a minimal size of 45 bases using.
Basically it means:
Find a substring a string that is composed ONLY of the letters ATGC (no spaces or new lines) that:
Starts with "ATG"
ends with "TAG" or "TAA" or "TGA",
is at least 39 chars long
is dividable by 3
My first code was:
$CDSString = "ATGCACACACACACACACACACACACACACACACACACACACACACACACACACACATGA";
if($CDSString =~ m/(ATG.{45,}(TAG|TAA|TGA))/)
{
my $CDSCurrent = $1;
if ((length($CDSCurrent) % 3) == 0)
{
# do something
}
}
which works fine, but I thought there might be a better way.
So I tried:
$CDSString = "ATGCACACACACACACACACACACACACACACACACACACACACACACACACACACATGA";
if ($CDSString =~ m/ATG(...){13,}(TAG|TAA|TGA)/ )
{
# do something
}
but for some reason it doesn't match the string above it, and I can't figure out why.
Can anyone figure it out? Thank you in advance.
Your regex is not making sure that everything between the start and stop codons is in fact composed of the letters ATGC only. You should be using:
if ($CDSString =~ m/ATG(?:[ATGC]{3}){13,}(?:TAG|TAA|TGA)/i) {...}
(But your original regex works, too, it just won't reject invalid matches. So there may be another problem somewhere else.)
There is a problem with the code thus far. What you should be looking for is the FIRST instance of a stop codon. If your CDS is no good, it might contain internal stops. Internal stop codons make an invalid ORF, so you need something more finessed:
if($CDSString =~ m/ATG(?:[ATGC]{3}(?<!TAG|TAA|TGA)){13,}(?:TAG|TAA|TGA)/i) {...}
This will return a sequence without internal stops that has at least 13 codons between the start and the first stop.
This portion of the code: (?:[ATGC]{3}(?<!TAG|TAA|TGA)) says "match three nucleotides that are not TAG, TAA, or TGA". The (?
Here's how it looks in action:
perl -e '$CDSString = "ATGCACACACACACACACACACACACACACACACACACACACACACACACACACACATAGTAGTAGTGA";if ($CDSString =~ m/(ATG(?:[ATGC]{3}(?<\!TAG|TAA|TGA)){13,}(TAG|TAA|TGA))/ ){print "$1\n"}'
ATGCACACACACACACACACACACACACACACACACACACACACACACACACACACATAG
Note, the last 3 stop codons (TAGTAGTGA) are not returned as part of the sequence.