Please give me some advice on removing newline characters before alphabets and ignoring the lines starting with >.
eg:
>gi|16802049|ref|NP_463534.1| chromosomal replication initiation protein [Listeria monocytogenes EGD-e]
MQSIEDIWQETLQIVKKNMSKPSYDTWMKSTTAHSLEGNTFIISAPNNFVRDWLEKSYTQFIANILQEIT
GRLFDVRFIDGEQEENFEYTVIKPNPALDEDGIEIGKHMLNPRYVFDTFVIGSGNRFAHAASLAVAEAPA
KAYNPLFIYGGVGLGKTHLMHAVGHYVQQHKDNAKVMYLSSEKFTNEFISSIRDNKTEEFRTKYRNVDVL
LIDDIQFLAGKEGTQEEFFHTFNTLYDEQKQIIISSDRPPKEIPTLEDRLRSRFEWGLITDITPPDLETR
IAILRKKAKADGLDIPNEVMLYIANQIDSNIRELEGALIRVVAYSSLVNKDITAGLAAEALKDIIPSSKS
QVITISGIQEAVGEYFHVRLEDFKAKKRTKSIAFPRQIAMYLSRELTDASLPKIGDEFGGRDHTTVIHAH
EKISQLLKTDQVLKNDLAEIEKNLRKAQNMF
>gi|16802050|ref|NP_463535.1| DNA polymerase III subunit beta [Listeria monocytogenes EGD-e]
MKFVIERDRLVQAVNEVTRAISARTTIPILTGIKIVVNDEGVTLTGSDSDISIEAFIPLIENDEVIVEVE
SFGGIVLQSKYFGDIVRRLPEENVEIEVTSNYQTNISSGQASFTLNGLDPMEYPKLPEVTDGKTIKIPIN
VLKNIVRQTVFAVSAIEVRPVLTGVNWIIKENKLSAVATDSHRLALREIPLETDIDEEYNIVIPGKSLSE
LNKLLDDASESIEMTLANNQILFKLKDLLFYSRLLEGSYPDTSRLIPTDTKSELVINSKAFLQAIDRASL
LARENRNNVIKLMTLENGQVEVSSNSPEVGNVSENVFSQSFTGEEIKISFNGKYMMDALRAFEGDDIQIS
FSGTMRPFVLRPKDAANPNEILQLITPVRTY
should come in a straight line and while the newline before lines starting with '>' should not be removed. I tried
\n^[a-z]
but it also removes the first alphabet of each line. Is it possible for it to do the same without removing the first alphabet of each line and ignore lines starting with '>'. thax in advance. Iam looking for a code for textpad.
You can use this regex
[\r\n]+(?=[a-zA-Z])
and replace it with empty string
OR
[\r\n]+([a-zA-Z])
and replace it with \1 or $1 whichever works
I have solved this by using regular expressions in perl. for anyone who needs something like this in the future
use warnings;
print "Please enter the name of the file\n";
my $n =<STDIN>;
print "Please enter the name of the output file\n";
my $n1=<STDIN>;
open(INFO,"$n") or die "cannot open";
#a = <INFO>;
#print #a;
foreach(#a)
{
$_ =~ s/\n//g;
$_ =~ s/>/\n>/g;
}
#print #a;
open (MYFILE, ">$n1");
print MYFILE #a;
close(MYFILE);
close(INFO);
It's extremely simple.
Related
I'm a regex newbie, and I am trying to use a regex to return a list of dates from a text file. The dates are in mm/dd/yy format, so for years it would be '55' for '1955', for example. I am trying to return all entries from years'50' to '99'.
I believe the problem I am having is that once my regex finds a match on a line, it stops right there and jumps to the next line without checking the rest of the line. For example, I have the dates 12/12/12, 10/10/57, 10/09/66 all on one line in the text file, and it only returns 10/10/57.
Here is my code thus far. Any hints or tips? Thank you
open INPUT, "< dates.txt" or die "Can't open input file: $!";
while (my $line = <INPUT>){
if ($line =~ /(\d\d)\/(\d\d)\/([5-9][0-9])/g){
print "$&\n" ;
}
}
A few points about your code
You must always use strict and use warnings 'all' at the top of all your Perl programs
You should prefer lexical file handles and the three-parameter form of open
If your regex pattern contains literal slashes then it is clearest to use a non-standard delimiter so that they don't need to be escaped
Although recent releases of Perl have fixed the issue, there used to be a significant performance hit when using $&, so it is best to avoid it, at least for now. Put capturing parentheses around the whole pattern and use $1 instead
This program will do as you ask
use strict;
use warnings 'all';
open my $fh, '<', 'dates.txt' or die "Can't open input file: $!";
while ( <$fh> ) {
print $1, "\n" while m{(\d\d/\d\d/[5-9][0-9])}g
}
output
10/10/57
10/09/66
You are printing $& which gets updated whenever any new match is encountered.
But in this case you need to store the all the previous matches and the updated one too, so you can use array for storing all the matches.
while(<$fh>) {
#dates = $_ =~ /(\d\d)\/(\d\d)\/([5-9][0-9])/g;
print "#dates\n" if(#dates);
}
You just need to change the 'if' to a 'while' and the regex will take up where it left off;
open INPUT, "< a.dat" or die "Can't open input file: $!";
while (my $line = <INPUT>){
while ($line =~ /(\d\d)\/(\d\d)\/([5-9][0-9])/g){
print "$&\n" ;
}
}
# Output given line above
# 10/10/57
# 10/09/66
You could also capture the whole of the date into one capture variable and use a different regex delimiter to save escaping the slashes:
while ($line =~ m|(\d\d/\d\d/[5-9]\d)|g) {
print "$1\n" ;
}
...but that's a matter of taste, perhaps.
You can use map also to get year range 50 to 99 and store in array
open INPUT, "< dates.txt" or die "Can't open input file: $!";
#as = map{$_ =~ m/\d\d\/\d\d\/[5-9][0-9]/g} <INPUT>;
$, = "\n";
print #as;
Another way around it is removing the dates you don't want.
$line =~ s/\d\d\/\d\d\/[0-4]\d//g;
print $line;
I have two files, the first (file1) contains several rexeges, while the other(file2) contains FASTA sequences . My intention is to use the regex in file1 to check if they match any Fasta sequences in file2, and print any regexes that match atleast one sequence, with the number of sequences they match. I would have liked to provide my sample code but i couldn't even begin. Please help.
file1 is structured in such a way that each line has an ID, followed by '>>', then the regex;
e.g FGER_HWW_PRT >> ..DW[ALK]..[^P]..[VI]{2,4}
TKAR_GLW_NQW >> [^VKR]{0,2}..FP[D].T.N.Q.
etc...
file2 has an idenfier of a sequence on one line and the sequence on the next line;
e.g >lac9_B: details details
GFVTSDRWPALKMSRWSLEMVWASRGYPLVNDRMWSWSDDDP
>serP_A: otherdetails details2
GFVLSDPPPPALKMSRWSLEMVWASRGYPLVNDPWQRTKRKRKDRTCWASNYIHDRP
etc...
Thanks in advance.
This might get you started. If you think it might be useful for you, let me know and I can explain what's going on:
#!/usr/bin/perl
use warnings;
use strict;
(Using your .fasta file as input):
my $infile = 'in.txt';
open my $input, '<', $infile or die "Can't open to $infile: $!";
my (#head, #seq, %hash);
Set a 'match' variable to test your headers for:
my $match = "details2";
while (<$input>) {
chomp;
push #head, $_ if /^>/;
push #seq, $_ if /^[A-Z]/;
#hash{#head} = #seq;
}
Cycle through the keys (headers) of your hash, and test print the header and sequence if they match your match variable:
foreach my $header (keys %hash){
if ($header =~ /$match/){
print "Name: $header\tcontains: '$match'\nSequence: $hash{$header}\n" ;
}
}
Output:
Name: >serP_A: otherdetails details2 contains: 'details2'
Sequence: GFVLSDPPPPALKMSRWSLEMVWASRGYPLVNDPWQRTKRKRKDRTCWASNYIHDRP
I have been given some DNA sequences by collaborators in a word document that I'd like to convert into a series of fasta sequences in one file.
I've made it into a text file and I figured that using regular expressions to extract the gene name and the sequence:
use warnings;
use strict;
die "usage: make_fasta.pl <sequence file>" unless (#ARGV == 1);
my $seq_filename = shift;
my $fasta_db_name = $seq_filename . "_db.fa";
open(my $seq_file, '<', $seq_filename)
or die "can't open file $seq_filename, $!";
open(my $fasta_file, '>', $fasta_db_name)
or die "can't open file $fasta_db_name, $!";
while (my $line = <$seq_file>) {
chomp $line;
if ($line =~ /^[ATCG]+$/) { # if the line is entirely DNA seqence
print $fasta_file "$line\n";
} elsif ($line =~ /Full-length (\w+) cDNA/) { # if the line has gene info
print $fasta_file ">$1\n";
} else {
next;
}
}
But that just gave me the name of the first gene. Clearly I've done something wrong with the DNA regular expression but I can't for the life of me work it out. To my eyes it's exactly the same as other suggested DNA tests I've found on this site and others.
The file I'm trying to parse is configured like so:
Collaborators name
title of gene set
Full-length clock cDNA coding sequence
ATGGTAGGATGTGTAATGCGTACGTGATCGT
Full-length per cDNA coding sequence
ATGCTAGCTACGTACGTAGCTACGTAGTACG
I want the output to be a fasta file so:
>clock
ATGGTAGGATGTGTAATGCGTACGTGATCGT
>per
ATGCTAGCTACGTACGTAGCTACGTAGTACG
The first few lines of the actual input file are:
Dr Lin Zhang (Leicester University 10/2012)
Canonical clock genes
Full-length per cDNA coding seq (3693bp)
ATGGACACAGGAACACCCCATGAAGATGTGCCCTCAGAGGACCACACCTTGGAAGAAGGGGACAGCAAGAACCCCTCGTGCCAGCAAGAGTCAGCCTACGGCTCCCTCGAGTCATCCTCCAATGGACAGTCTCAGAAAAGTTTCGGAGGAAGTGGAAGCAAAAGCTTAAATAGTGGTTCGAGTCACAGCAGCGGCTTTGGGGACCAAAATGATTTCAAGGGTATCCATCTTCACGAAGCGAAACACATAGCGTTGAAGAAGAAGAAAACTGGGAAAGGAGGTGAAAAGGTAGCAGAAATCCCCTTTCAAACTGCCTCTGAGGCAGAACTGTCCTCCAAAGGAAACGAAACAGAAAAGGAGAAAGAAACAAGCCTCGAGGAGTCTCCTGCTGCAAAAGAGGAAGCAATTATCGAAAAGGAGTCTCGTTACATCCACCCGAGGAACT
Kind of hard to answer this question without seeing part of the actual input file.
There is a mis-match between your example input and your REGEX:
# looking for verbatim('Full-length') then <space> then one WORD_WITH_ALPHNUMERICS then <space> and then verbatim 'cDNA'
$line =~ /Full-length (\w+) cDNA/;
Your example input line has 'Full length' without a dash, multiple words for the gene name not just one and no 'cDNA' at the end.
If your input line has 'Full-length gene name with multiple words cDNA', your REGEX can be:
$line=~/Full-length\s+(.*?)\s+cDNA/;
The problem is apparently with your input data. I modified the code you posted to produce the following program:
#!/usr/bin/env perl
use warnings;
use strict;
while (my $line = <DATA>) {
chomp $line;
if ($line =~ /^[ATCG]+$/) { # if the line is entirely DNA seqence
print "$line\n";
} elsif ($line =~ /Full-length (\w+) cDNA/) { # if the line has gene info
print ">$1\n";
}
}
__DATA__
Collaborators name
title of gene set
Full-length clock cDNA coding sequence
ATGGTAGGATGTGTAATGCGTACGTGATCGT
Full-length per cDNA coding sequence
ATGCTAGCTACGTACGTAGCTACGTAGTACG
and it produces the output you specified:
~$ src/tmp/cdna
>clock
ATGGTAGGATGTGTAATGCGTACGTGATCGT
>per
ATGCTAGCTACGTACGTAGCTACGTAGTACG
My modifications were only to make it self-contained and did not change any of the flow control or logic, aside from removing the useless else { next } clause.
Can you find and post a few lines of actual data which fails for you, since the dummy data provided seems to work correctly?
I'm just learning perl and I'm trying to learn regular expressions at the same time. Basically I'm trying to open a log file and print out any lines that match user input to a new file. Using the following code I get no output at all if I type in the word "Clinton". But if I replace
print MYFILE if /\$string\;
with
print MYFILE if /\Clinton\;
it runs as expected. Any ideas? I know it is something simple that I am missing.
print "Enter a word to look up: ";
$string = <>;
print "You put $string";
open(LOG,"u_ex121011.log") or die "Unable to open logfile:$!\n";
open (MYFILE, '>>data2.txt');
while(<LOG>){
print MYFILE if /\Q($string)\E/;
}
close (MYFILE);
close(LOG);
print "Check data2.txt";
In Perl, unlike in some languages, the input operator doesn't silently remove a trailing newline. So your $string is actually "Clinton\n" rather than than "Clinton". To fix it, use the chomp function:
$string = <>;
chomp $string;
print "You put $string\n";
You should also use the 3 argument version of open.
open( my $LOG, '<', 'u_ex121011.log' ) or die "Unable to open file:$!\n";
open http://perldoc.perl.org/functions/open.html
In addition to what ruakh said, you should check if the string is on the line by using the $_ variable and the =~ operator.
print MYFILE "$_\n" if $_ =~ /\Q$string\E/;
Going off of your comment, you can split the line up surprisingly enough using split.
Here is an example of what you could do:
my #lines = split( ' ', $_ );
print MYFILE "$lines[0] $lines[1] $lines[2] $lines[3]\n";
Here is documentation of split: http://perldoc.perl.org/functions/split.html
I am writing a script to read a LOG file. I want the user to type a word and then look it up and print the line (from a string) matching the word.
I'm just learning Perl so please be very specific and simple so that I can understand it.
print "Please Enter the word to find: ";
chomp ($userInput = <STDIN>);
while ($line = <INPUT>)
if ($line =~ /userInput/)
print $line;
I know that this is not perfect but I'm just learning.
You were close. You need to expand the variable in the pattern match.
print "Please Enter the word to find: ";
chomp ($userInput = <STDIN>);
while ($line = <INPUT>) {
if ($line =~ /$userInput/) { # note extra dollar sign
print $line;
}
}
Be aware that that is a pattern match, so you are searching with a string that potentially contains wildcards in it. If you want a literal string, put a \Q in front of the variable as you interpolate it: /\Q$userInput/.
Something like .\bWORD\b. might work (thou it is not tested)
print $line if ($line =~ /.*\bWORD\b/)
#NewLearner
\b is for word boundaries
http://www.regular-expressions.info/wordboundaries.html
If you're doing just one loopup, using a while loop is fine. Though of course you'll need to fix your syntax.
You could also use grep:
print grep /$userInput/, <INPUT>;
If you want to do multiple lookups, you can either reopen the file handle (if the file is large), or store it in an array:
print grep /$userInput/, #array;
You'll have meta characters in your input, of course. This can be a good thing, or bad, depending on your users. For example, an experienced user would recognize the option to refine his search by entering a search term such as ^foo(?=bar), whereas other people may get very confused when they can't find the string foo+bar.
A way to escape meta characters is by using quotemeta on your input. Another is to use \Q ... \E inside your regex.
$userInput = quotemeta($userInput);
# or
print grep /\Q$userInput\E/, <INPUT>;
I believe if I were you, I would use a subroutine for the lookup. That way you can perform as many lookups as you like rather handily.
use strict;
use warnings; # ALWAYS use these
print "Please Enter the word to find: ";
chomp (my $userInput = <>); # <> is a more flexible handle
print lookup($userInput);
sub lookup {
my $word = shift;
open my $fh, "<", $inputfile or die $!;
my #hits;
while (<$fh>) {
push #hits, $_ if /\Q$word\E/;
}
return #hits;
}