regular expression code - regex

I need to find match between two tab delimited files files like this:
File 1:
ID1 1 65383896 65383896 G C PCNXL3
ID1 2 56788990 55678900 T A ACT1
ID1 1 56788990 55678900 T A PRO55
File 2
ID2 34 65383896 65383896 G C MET5
ID2 2 56788990 55678900 T A ACT1
ID2 2 56788990 55678900 T A HLA
what I would like to do is to retrive the matching line between the two file. What I would like to match is everyting after the gene ID
So far I have written this code but unfortunately perl keeps giving me the error:
use of "Use of uninitialized value in pattern match (m//)"
Could you please help me figure out where i am doing it wrong?
Thank you in advance!
use strict;
open (INA, $ARGV[0]) || die "cannot to open gene file";
open (INB, $ARGV[1]) || die "cannot to open coding_annotated.var files";
my #sample1 = <INA>;
my #sample2 = <INB>;
foreach my $line (#sample1) {
my #tab = split (/\t/, $line);
my $chr = $tab[1];
my $start = $tab[2];
my $end = $tab[3];
my $ref = $tab[4];
my $alt = $tab[5];
my $name = $tab[6];
foreach my $item (#sample2){
my #fields = split (/\t/,$item);
if ( $fields[1] =~ m/$chr(.*)/
&& $fields[2] =~ m/$start(.*)/
&& $fields[4] =~ m/$ref(.*)/
&& $fields[5] =~ m/$alt(.*)/
&& $fields[6] =~ m/$name(.*)/
) {
print $line, "\n", $item;
}
}
}

On its surface your code seems to be fine (although I didn't debug it). If you don't have an error I cannot spot, could be that the input data has RE special character, which will confuse the regular expression engine when you put it as is (e.g. if any of the variable has the '$' character). Could also be that instead of tab you have spaces some where, in which case you'll indeed get an error, because your split will fail.
In any case, you'll be better off composing just one regular expression that contains all the fields. My code below is a little bit more Perl Idiomatic. I like using the implicit $_ which in my opinion makes the code more readable. I just tested it with your input files and it does the job.
use strict;
open (INA, $ARGV[0]) or die "cannot open file 1";
open (INB, $ARGV[1]) or die "cannot open file 2";
my #sample1 = <INA>;
my #sample2 = <INB>;
foreach (#sample1) {
(my $id, my $chr, my $start, my $end, my $ref, my $alt, my $name) =
m/^(ID\d+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)/;
my $rex = "^ID\\d+\\s+$chr\\s+$start\\s+$end\\s+$ref\\s+$alt\\s+$name\\s+";
#print "$rex\n";
foreach (#sample2) {
if( m/$rex/ ) {
print "$id - $_";
}
}
}
Also, how regular is the input data? Do you have exactly one tab between the fields? If that is the case, there is no point to split the lines into 7 different fields - you only need two: the ID portion of the line, and the rest. The first regex would be
(my $id, my $restOfLine) = m/^(ID\d+)\s+(.*)$/;
And you are searching $restOfLine within the second file in a similar technique as above.
If your files are huge and performance is an issue, you should consider putting the first regular expressions (or strings) in a map. That will give you O(n*log(m)) where n and m are the number of lines in each file.
Finally, I have a similar challenge when I need to compare logs. The logs are supposed to be identical, with the exception of a time mark at the beginning of each line. But more importantly: most lines are the same and in order. If this is what you have, and it make sense for you, you can:
First remove the IDxxx from each line: perl -pe "s/ID\d+ +//" file >cleanfile
Then use BeyondCompare or Windiff to compare the files.

I played a bit with your code. What you wrote there was actually three loops:
one over the lines of the first file,
one over the lines of the second file, and
one over all fields in these lines. You manually unrolled this loop.
The rest of this answer assumes that the files are strictly tab-seperated and that any other whitespace matters (even at the end of fields and lines).
Here is a condensed version of the code (assumes open filehandles $file1, $file2, and use strict):
my #sample2 = <$file2>;
SAMPLE_1:
foreach my $s1 (<$file1>) {
my (undef, #fields1) = split /\t/, $s1;
my #regexens = map qr{\Q$_\E(.*)}, #fields1;
SAMPLE_2:
foreach my $s2 (#sample2) {
my (undef, #fields2) = split /\t/, $s2;
for my $i (0 .. $#regexens) {
$fields2[$i] =~ $regexens[$i] or next SAMPLE_2;
}
# only gets here if all regexes matched
print $s1, $s2;
}
}
I did some optimisations: precompiling the various regexes and storing them in an array, quoting the contents of the fields etc. However, this algorithm is O(n²), which is bad.
Here is an elegant variant of that algorithm that knows that only the first field is different — the rest of the line has to be the same character for character:
my #sample2 = <$file2>;
foreach my $s1 (<$file1>) {
foreach my $s2 (#sample2) {
print $s1, $s2 if (split /\t/, $s1, 2)[1] eq (split /\t/, $s2, 2)[1];
}
}
I just test for string equality of the rest of the line. While this algorithm is still O(n²), it outperforms the first solution roughly by an order of magnitude simply by avoiding braindead regexes here.
Finally, here is an O(n) solution. It is a variant of the previous one, but executes the loops after each other, not inside each other, therefore finishing in linear time. We use hashes:
# first loop via map
my %seen = map {reverse(split /\t/, $_, 2)}
# map {/\S/ ? $_ : () } # uncomment this line to handle empty lines
<$file1>;
# 2nd loop
foreach my $line (<$file2>) {
my ($id2, $key) = split /\t/, $line, 2;
if (defined (my $id1 = $seen{$key})) {
print "$id1\t$key";
print "$id2\t$key";
}
}
%seen is a hash that has the rest of the line as a key and the first field as a value. In the second loop, we retrieve the rest of the line again. If this line was present in the first file, we reconstruct the whole line and print it out. This solution is better than the others and scales well up- and downwards, because of its linear complexity

How about:
#!/usr/bin/perl
use File::Slurp;
use strict;
my ($ina, $inb) = #ARGV;
my #lines_a = File::Slurp::read_file($ina);
my #lines_b = File::Slurp::read_file($inb);
my $table_b = {};
my $ln = 0;
# Store all lines in second file in a hash with every different value as a hash key
# If there are several identical ones we store them also, so the hash values are lists containing the id and line number
foreach (#lines_b) {
chomp; # strip newlines
$ln++; # count current line number
my ($id, $rest) = split(m{[\t\s]+}, $_, 2); # split on whitespaces, could be too many tabs or spaces instead
if (exists $table_b->{$rest}) {
push #{ $table_b->{$rest} }, [$id, $ln]; # push to existing list if we already found an entry that is the same
} else {
$table_b->{$rest} = [ [$id, $ln] ]; # create new entry if this is the first one
}
}
# Go thru first file and print out all matches we might have
$ln = 0;
foreach (#lines_a) {
chomp;
$ln++;
my ($id, $rest) = split(m{[\t\s]+}, $_, 2);
if (exists $table_b->{$rest}) { # if we have this entry print where it is found
print "$ina:$ln:\t\t'$id\t$rest'\n " . (join '\n ', map { "$inb:$_->[1]:\t\t'$_->[0]\t$rest'" } #{ $table_b->{$rest} }) . "\n";
}
}

Related

Preventing "foo" from matching "foo-bar" with grep -w

I am using grep inside my Perl script and I am trying to grep the exact keyword that I am giving. The problem is that "-w" doesn't recognize the "-" symbol as a separator.
example:
Let's say that I have these two records:
A1BG 0.0767377011073753
A1BG-AS1 0.233775553296782
if I give
grep -w "A1BG"
it returns both of them but I want only the exact one.
Any suggestions?
Many thanks in advance.
PS.
Here is my whole code.
The input file is a two-columns tab separated. So, I want to keep a unique value for each gene. In cases that I have more than one record, I calculate the average.
#!/usr/bin/perl
use strict;
use warnings;
#Find the average fc between common genes
sub avg {
my $total;
$total += $_ foreach #_;
return $total / #_;
}
my #mykeys = `cat G13_T.txt| awk '{print \$1}'| sort -u`;
foreach (#mykeys)
{
my #TSS = ();
my $op1 = 0;
my $key = $_;
chomp($key);
#print "$key\n";
my $command = "cat G13_T.txt|grep -E '([[:space:]]|^)$key([[:space:]]|\$)'";
#my $command = "cat Unique_Genes/G13_T.txt|grep -w $key";
my #belongs= `$command`;
chomp(#belongs);
my $count = scalar(#belongs);
if ($count == 1) {
print "$belongs[0]\n";
}
else {
for (my $i = 0; $i < $count; $i++) {
my #token = split('\t', $belongs[$i]);
my $lfc = $token[1];
push (#TSS, $lfc);
}
$op1 = avg(#TSS);
print $key ."\t". $op1. "\n";
}
}
If I got clarifications in comments right, the objective is to find the average of values (second column) for unique names in the first column. Then there is no need for external tools.
Read the file line by line and add up values for each name. The name uniqueness is granted by using a hash, with names being keys. Along with this also track their counts
use warnings;
use strict;
use feature 'say';
my $file = shift // die "Usage: $0 filename\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my %results;
while (<$fh>) {
#my ($name, $value) = split /\t/;
my ($name, $value) = split /\s+/; # used for easier testing
$results{$name}{value} += $value;
++$results{$name}{count};
}
foreach my $name (sort keys %results) {
$results{$name}{value} /= $results{$name}{count}
if $results{$name}{count} > 1;
say "$name => $results{$name}{value}";
}
After the file is processed each accumulated value is divided by its count and overwritten by that, so by its average (/= divides and assigns), if count > 1 (as a small measure of efficiency).
If there is any use in knowing all values that were found for each name, then store them in an arrayref for each key instead of adding them
while (<$fh>) {
#my ($name, $value) = split /\t/;
my ($name, $value) = split /\s+/; # used for easier testing
push #{$results{$name}}, $value;
}
where now we don't need the count as it is given by the number of elements in the array(ref)
use List::Util qw(sum);
foreach my $name (sort keys %results) {
say "$name => ", sum(#{$results{$name}}) / #{$results{$name}};
}
Note that a hash built this way needs memory comparable to the file size (or may even exceed it), since all values are stored.
This was tested using the shown two lines of sample data, repeated and changed in a file. The code does not test the input in any way, but expects the second field to always be a number.
Notice that there is no reason to ever step out of our program and use external commands.
You may use a POSIX ERE regex with grep like this:
grep -E '([[:space:]]|^)A1BG([[:space:]]|$)' file
To return matches (not matching lines) only:
grep -Eo '([[:space:]]|^)A1BG([[:space:]]|$)' file
Details
([[:space:]]|^) - Group 1: a whitespace or start of line
A1BG - a substring
([[:space:]]|$) - Group 2: a whitespace or end of line

In array listitem how to store values with comma separation and end with dot using perl

use strict;
use warnings;
my $tmp = join "\n", <DATA>;
my #biblables = ();
List items will be fetched and storing into #biblables in a while loop
while($tmp=~m/\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/g)
{
push(#biblables, "\\cite{$2}, ");
}
print #biblables;
While printing this we are getting the output like as:
\cite{BuI2001},\cite{BuI2002},\cite{BuI2003},\cite{BuI2004},\cite{BuI2005},\cite{BuI2006},
However we need the output like this
\cite{BuI2001},\cite{BuI2002},\cite{BuI2003},\cite{BuI2004},\cite{BuI2005},\cite{BuI2006}.
Hence we can use post regex to insert dot at the end of the listitem in array
while($tmp=~m/\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/g)
{
my $post = $';
if($post!~m/\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/)
{ push(#biblables, "\\cite{$2}."); }
else { push(#biblables, "\\cite{$2}, "); }
}
print #biblables;
Could you please advise me if there is short way to get this output
#
__DATA__
\bibitem[{BuI (2001)}]{BuI2001}
\bibitem[{BuII (2002)}]{BuI2002}
\bibitem[{BuIII (2003)}]{BuI2003}
\bibitem[{BuIV (2004)}]{BuI2004}
\bibitem[{BuV (2005)}]{BuI2005}
\bibitem[{BuVI (2006)}]{BuI2006}
You can add the comma and period after the fact:
while($tmp=~m/\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/g){
push(#biblables, "\\cite{$2}");
}
print join ', ', #biblables;
print ".\n";
If you read from a filehandle you can use eof to determine that you are on the last line, at which point you replace the comma by the dot in the last element. This allows you to build the array completely in the loop, as required.
use warnings;
use strict;
open my $fh, '<', 'bibitems.txt';
my #biblabels;
while (<$fh>) {
push #biblabels, "\\cite{$2}," if /\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/;
$biblabels[-1] =~ tr/,/./ if eof;
}
print "$_ " for #biblabels;
print "\n";
This prints your desired output.
The oef returns true if the next read will return end-of-file. This means that you've just read the last line, which got put on the array if it matched. This function is rarely needed but here it seems to find a fitting purpose. Note that eof and eof() behave a little differently. Please see the eof page.
If the other capture in the regex is meant to be used change the above to if (...) { ... }. Note that what is in {} is in Latex called citation keys, while the (optional) labels are things inside []. I'd go with the array name of #citkeys for clarity.
If you're determine to add the comma's and dots to the elements when
matching in the regex while loop, it can be done like this.
Since you don't know the total matches yet, just keep a reference to
the most recently pushed element.
Then append the , or . as needed.
Code
use strict;
use warnings;
$/ = undef;
my $tmp = <DATA>;
my #biblables = ();
my $ref = undef;
while( $tmp =~ /\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/g )
{
$$ref .= ", " if defined $ref;
$ref = \$biblables[ push(#biblables,"\\cite{$2}") ];
}
$$ref .= "." if defined $ref;
print #biblables;
__DATA__
\bibitem[{BuI (2001)}]{BuI2001}
\bibitem[{BuII (2002)}]{BuI2002}
\bibitem[{BuIII (2003)}]{BuI2003}
\bibitem[{BuIV (2004)}]{BuI2004}
\bibitem[{BuV (2005)}]{BuI2005}
\bibitem[{BuVI (2006)}]{BuI2006}
Output
\cite{BuI2001}, \cite{BuI2002}, \cite{BuI2003}, \cite{BuI2004}, \cite{BuI2005}, \cite{BuI2006}.

Why does my program crash after one line of input?

I am writing a simple program which capitalizes each word in a sentence. It gets a multi-line input. I then loop through the input lines, split each word in the line, capitalize it and then join the line again. This works fine if the input is one sentence, but as soon as I input two lines my program crashes (and if I wait too long my computer freezes.)
Here is my code
#input = <STDIN>;
foreach(#input)
{
#reset #words
#words= ();
#readability
$lines =$_;
#split sentence
#words = split( / /, $lines );
#capitalize each word
foreach(#words){
$words[$k] = ucfirst;
$k++;
}
#join sentences again
$lines = join(' ', #words);
#create output line
$output[$i]=$lines;
$i++;
}
#print the result
print "\nResult:\n";
foreach(#output){
print $output[$j],"\n";
$j++;
}
Could someone please tell me why it crashes?
use strict (and be told about not properly handled variables like your indices)
use for var (array) to get a usable item without an index (Perl isn't Javascript)
What isn't there can't be wrong (e.g. push instead of index)
In code:
use strict; # always!
my #input = <STDIN>; # the loop need in- and output
my #output = ();
for my $line (#input) # for makes readability *and* terseness easy
{
chomp $line; # get rid of eol
#split sentence
my #words = split( / /, $line );
#capitalize each word
for my $word (#words){ # no danger of mishandling indices
$word = ucfirst($word);
}
#join sentences again
$line = join(' ', #words);
#create output line
push #output, $line;
}
#print the result
print "\nResult:\n";
for my $line (#output){
print $line, "\n";
}
The problem is that you are using global variables throughout, so they are keeping their values across iterations of the loop. You have reset #words to an empty list even though you didn't need to - it is overwritten when you assign the result of split to it - but $k is increasing endlessly.
$k is initially set to undef which evaluates as zero, so for the first sentence everything is fine. But you leave $k set to the number of elements in #words so it starts from there instead of from zero for the next sentence. Your loop over #words becomes endless because you are assigning to (and so creating) $words[$k] so the array is getting longer as fast as you are looping through it.
The same problem applies to $i and $j, but execution never gets as far as reusing those.
Alshtough this was the only way of working in Perl 4, over twenty years ago, Perl 5 has made programming very much nicer to write and debug. You can now declare variables with my, and you can use strict which (among other things) insists that every variable you use must be declared, otherwise your program won't compile. There is also use warnings which is just as invaluable. In this case it would have warned you that you were using an undefined variable $k etc. to index the arrays.
If I apply use strict and use warnings, declare all of your variables and initialise the counters to zero then I get a working program. It's still not very elegant, and there are much better ways of doing it, but the error has gone away.
use strict;
use warnings;
my #input = <STDIN>;
my #output;
my $i = 0;
foreach (#input) {
# readability
my $lines = $_;
# split sentence
my #words = split ' ', $lines;
# capitalize each word
my $k = 0;
foreach (#words) {
$words[$k] = ucfirst;
$k++;
}
# join sentences again
$lines = join ' ', #words;
#create output line
$output[$i] = $lines;
$i++;
}
print "\nResult:\n";
my $j = 0;
foreach (#output) {
print $output[$j], "\n";
$j++;
}

Is there any better way to "grep" from a large file than using `grep` in perl?

The $rvsfile is the path of a file about 200M. I want to count the number of line which has $userid in it. But using grep in a while loop seems very slowly. So is there any efficient way to do this? Because the $rvsfile is very large, I can't read it into memory using #tmp = <FILEHANDLE>.
while(defined($line = <SRCFILE>))
{
$line =~ /^([^\t]*)\t/;
$userid = $1;
$linenum = `grep '^$userid\$' $rvsfile | wc -l`;
chomp($linenum);
print "$userid $linenum\n";
if($linenum == 0)
{
print TARGETFILE "$line";
}
}
And how can I get the part before \t in a line without regex? For example, the line may like this:
2013123\tsomething
How can I get 2013123 without regex?
Yes, you are forking a shell on each loop invocation. This is slow. You also read the entire $rsvfile once for every user. This is too much work.
Read SRCFILE once and build a list of #userids.
Read $rvsfile once keeping a running count of each user id as you go.
Sketch:
my #userids;
while(<SRCFILE>)
{
push #userids, $1 if /^([^\t]*)\t/;
}
my $regex = join '|', #userids;
my %count;
while (<RSVFILE>)
{
++$count{$1} if /^($regex)$/o
}
# %count has everything you need...
Use hashes:
my %count;
while (<LARGEFILE>) {
chomp;
$count{$_}++;
};
# now $count{userid} is the number of occurances
# of $userid in LARGEFILE
Or if you fear using too much memory for the hash (i.e. you're interested in 6 users, and there are 100K more in the large file), do it another way:
my %count;
while (<SMALLFILE>) {
/^(.*?)\t/ and $count{$_} = 0;
};
while (<LARGEFILE>) {
chomp;
$count{$_}++ if defined $count{$_};
};
# now $count{userid} is the number of occurances
# of $userid in LARGEFILE, *if* userid is in SMALLFILE
You can search for the location of the first \t using index which will be faster. You could then use splice to get the match.
Suggest you benchmark various approaches.
If I read you correctly you want something like this:
#!/usr/bin/perl
use strict;
use warnings;
my $userid = 1246;
my $count = 0;
my $rsvfile = 'sample';
open my $fh, '<', $rsvfile;
while(<$fh>) {
$count++ if /$userid/;
}
print "$count\n";
or even, (and someone correct me if I am wrong, but this don't think this reads the whole file in):
#!/usr/bin/perl
use strict;
use warnings;
my $userid = 1246;
my $rsvfile = 'sample';
open my $fh, '<', $rsvfile;
my $count = grep {/$userid/} <$fh>;
print "$count\n";
If <SRCFILE> is relatively small, you could do it the other way round. Read in the larger file one line at a time, and check each userid per line, keeping a count of each userid using a hash sructure. Something like:
my %userids = map {($_, 0)} # use as hash key with init value of 0
grep {$_} # only return mataches
map {/^([^\t]+)/} <SRCFILE>; # extract ID
while (defined($line = <LARGEFILE>)) {
for (keys %userids) {
++$userids{$_} if $line =~ /\Q$_\E/; # \Q...\E escapes special chars in $_
}
}
This way, only the smaller data is read repeatedly and the large file is scanned once. You end up with a hash of every userid, and the value is the number of lines it occurred in.
if you have a choice, try it with awk
awk 'FNR==NR{a[$1];next} { for(i in a) { if ($0 ~ i) { print $0} } } ' $SRCFILE $rsvfile

Perl program to mimic restriction enzymes using references, hash tables and subs

I'm a student in an intro Perl class. I'm looking for suggestions on how to approach an assignment. My professor encourages forums. The assignment is:
Write a Perl program that will take two files from the command line, an enzyme file and a DNA file. Read the file with restriction enzymes and apply the restriction enzymes to the DNA file.
The output will be fragments of DNA arranged in the order they occur in the dna file. The name of the output files should be constructed by appending the name of the restriction enzyme to the name of the DNA file, with an underscore between them.
For example, if the enzyme is EcoRI and the DNA file is named BC161026, the output file should be named BC161026_EcoRI.
My approach is to create a main program and two subs as follows:
Main:
Not sure how to tie my subs together?
Sub program $DNA:
Take a DNA file and remove any new lines to make a single string
Sub program Enzymes:
Read and store the lines from the enzyme file which is from the command line
Parse the file in a way that it separates the enzyme acronym from the position of the cut.
Store the position of the cut as a regular expression in a hash table
Store the name of the acronym in a hash table
Note on enzyme file format:
The enzyme file follows a format known as Staden. Examples:
AatI/AGG'CCT//
AatII/GACGT'C//
AbsI/CC'TCGAGG//
The enzyme acronym consists of the characters before the first slash (AatI, in the first example. The recognition sequence is everything between the first and second slash (AGG'CCT, again, in the first example). The cut point is denoted by an apostrophe in the recognition sequence
There are standard abbreviations for dna within enzymes as follows:
R = G or A
B = not A (C or G or T)
etc...
Along with a recommendation for a main chunk, do you see any missing pieces that I've omitted? Can you recommend specific tools that you think would be useful in patching this program together?
Example input enzyme: TryII/RRR'TTT//
Example string to read: CCCCCCGGGTTTCCCCCCCCCCCCAAATTTCCCCCCCCCCCCAGATTTCCCCCCCCCCGAGTTTCCCCC
The output should be:
CCCCCCGGG
TTTCCCCCCCCCCCCAAA
TTTCCCCCCCCCCCCAGA
TTTCCCCCCCCCCGAG
TTTCCCCC
Ok, I know I shouldn't just do your homework, but there were some fun tricks to this one, so I played with it. Learn from this, not just copy. I didn't comment very well, so if there is something you don't understand, please ask. There is some slight magic in this that if you haven't covered it in your class, your prof will know, so be sure you understand.
#!/usr/bin/env perl
use strict;
use warnings;
use Getopt::Long;
my ($enzyme_file, $dna_file);
my $write_output = 0;
my $verbose = 0;
my $help = 0;
GetOptions(
'enzyme=s' => \$enzyme_file,
'dna=s' => \$dna_file,
'output' => \$write_output,
'verbose' => \$verbose,
'help' => \$help
);
$help = 1 unless ($dna_file && $enzyme_file);
help() if $help; # exits
# 'Main'
my $dna = getDNA($dna_file);
my %enzymes = %{ getEnzymes($enzyme_file) }; # A function cannot return a hash, so return a hashref and then store the referenced hash
foreach my $enzyme (keys %enzymes) {
print "Applying enzyme " . $enzyme . " gives:\n";
my $dna_holder = $dna;
my ($precut, $postcut) = ($enzymes{$enzyme}{'precut'}, $enzymes{$enzyme}{'postcut'});
my $R = qr/[GA]/;
my $B = qr/[CGT]/;
$precut =~ s/R/${R}/g;
$precut =~ s/B/${B}/g;
$postcut =~ s/R/${R}/g;
$postcut =~ s/B/${B}/g;
print "\tPre-Cut pattern: " . $precut . "\n" if $verbose;
print "\tPost-Cut pattern: " . $postcut . "\n" if $verbose;
#while(1){
# if ($dna_holder =~ s/(.*${precut})(${postcut}.*)/$1/ ) {
# print "\tFound section:" . $2 . "\n" if $verbose;
# print "\tRemaining DNA: " . $1 . "\n" if $verbose;
# unshift #{ $enzymes{$enzyme}{'cut_dna'} }, $2;
# } else {
# unshift #{ $enzymes{$enzyme}{'cut_dna'} }, $dna_holder;
# print "\tNo more cuts.\n" if $verbose;
# print "\t" . $_ . "\n" for #{ $enzymes{$enzyme}{'cut_dna'} };
# last;
# }
#}
unless ($dna_holder =~ s/(${precut})(${postcut})/$1'$2/g) {
print "\tHas no effect on given strand\n" if $verbose;
}
#{ $enzymes{$enzyme}{'cut_dna'} } = split(/'/, $dna_holder);
print "\t$_\n" for #{ $enzymes{$enzyme}{'cut_dna'} };
writeOutput($dna_file, $enzyme, $enzymes{$enzyme}{'cut_dna'}) if $write_output; #Note that $enzymes{$enzyme}{'cut_dna'} is an arrayref already
print "\n";
}
sub getDNA {
my ($dna_file) = #_;
open(my $dna_handle, '<', $dna_file) or die "Cannot open file $dna_file";
my #dna_array = <$dna_handle>;
chomp(#dna_array);
my $dna = join('', #dna_array);
print "Using DNA:\n" . $dna . "\n\n" if $verbose;
return $dna;
}
sub getEnzymes {
my ($enzyme_file) = #_;
my %enzymes;
open(my $enzyme_handle, '<', $enzyme_file) or die "Cannot open file $enzyme_file";
while(<$enzyme_handle>) {
chomp;
if(m{([^/]*)/([^']*)'([^/]*)//}) {
print "Found Enzyme " . $1 . ":\n\tPre-cut: " . $2 . "\n\tPost-cut: " . $3 . "\n" if $verbose;
$enzymes{$1} = {
precut => $2,
postcut => $3,
cut_dna => [] #Added to show the empty array that will hold the cut DNA sections
};
}
}
print "\n" if $verbose;
return \%enzymes;
}
sub writeOutput {
my ($dna_file, $enzyme, $cut_dna_ref) = #_;
my $outfile = $dna_file . '_' . $enzyme;
print "\tSaving data to $outfile\n" if $verbose;
open(my $outfile_handle, '>', $outfile) or die "Cannot open $outfile for writing";
print $outfile_handle $_ . "\n" for #{ $cut_dna_ref };
}
sub help {
my $filename = (split('/', $0))[-1];
my $enzyme_text = <<'END';
AatI/AGG'CCT//
AatII/GACGT'C//
AbsI/CC'TCGAGG//
TryII/RRR'TTT//
Test/AAA'TTT//
END
my $dna_text = <<'END';
CCCCCCGGGTTTCCCCCCC
CCCCCAAATTTCCCCCCCCCCCCAGATTTC
CCCCCCCCCGAGTTTCCCCC
END
print <<END;
Usage:
$filename --enzyme (-e) <enzyme-filename> --dna (-d) <dna-filename> [options] (files may come in either order)
$filename -h (shows this help)
Options:
--verbose (-v) print additional (debugging) information
--output (-o) output final data to files
Files:
The DNA file contains one DNA string which may be broken over many lines. E.G.:
$dna_text
The enzymes file constains enzyme definitions, one per line. E.G.:
$enzyme_text
END
exit;
}
Edit: Added cut_dna initialization explicitly because this is the final result holder for each enzyme, so I thought it would be good to see it more clearly.
Edit 2: Added output routine, call, flag and corresponding help.
Edit 3: Changed main routine to incorporate the best of canavanin's method while removing loops. Now its a global replace to add temporary cut mark (') and then split on cut mark into array. Left old method as comment, new method is the 5 lines following.
Edit 4: Additional test case for writing to multiple files. (Below)
my #names = ('cat','dog','sheep');
foreach my $name (#names) { #$name is lexical, ie dies after each loop
open(my $handle, '>', $name); #open a lexical handle for the file, also dies each loop
print $handle $name; #write to the handle
#$handles closes automatically when it "goes out of scope"
}
Note that in Enzymes, when you store an enzyme in the hash the name of the enzyme should be the key and the site should be the value (since in principle two enzymes could have identical sites).
In the Main routine, you can iterate through the hash; for each enzyme produce one output file. The most direct way is to translate the site to a regex (by means of other regexs) and apply it to the DNA sequence, but there are other ways. (It is probably worth splitting this off into at least one other sub.)
Here is how I have gone about trying to solve the problem (code below).
1) The file names are picked up from the arguments and respective filehandles are created.
2) A new file handle is created for the output file which in the specified format
3) The "cut points" are extracted from the first file
4) The DNA Sequences in the second file are looped over the cut points obtained in step 3.
#!/usr/bin/perl
use strict;
use warnings;
my $file_enzyme=$ARGV[0];
my $file_dna=$ARGV[1];
open DNASEQ, ">$file_dna"."_"."$file_enzyme";
open ENZYME, "$file_enzyme";
open DNA, "$file_dna";
while (<ENZYME>) {
chomp;
if( /'(.*)\/\//) { # Extracts the cut point between ' & // in the enzyme file
my $pattern=$1;
while (<DNA>) {
chomp;
#print $pattern;
my #output=split/$pattern/,;
print DNASEQ shift #output,"\n"; #first recognized sequence being output
foreach (#output) {
print DNASEQ "$pattern$_\n"; #prefixing the remaining sequences with the cut point pattern
}
}
}
}
close DNA;
close ENZYME;
close DNASEQ;
I know there have been several answers already, but hey... I just felt like trying my luck, so here's my suggestion:
#!/usr/bin/perl
use warnings;
use strict;
use Getopt::Long;
my ($enz_file, $dna_file);
GetOptions( "e=s" => \$enz_file,
"d=s" => \$dna_file,
);
if (! $enz_file || ! $dna_file) {
# some help text
print STDERR<<EOF;
Usage: restriction.pl -e enzyme_file -d DNA_file
The enzyme_file should contain one enzyme entry per line.
The DNA_file may contain the sequence on one single or on
several lines; all lines will be concatenated to yield a
single string.
EOF
exit();
}
my %enz_and_patterns; # stores enzyme name and corresponding pattern
open ENZ, "<$enz_file" or die "Could not open file $enz_file: $!";
while (<ENZ>) {
if (m#^(\w+)/([\w']+)//$#) {
my $enzyme = $1; # could also remove those two lines and use
my $pattern = $2; # the match variables directly, but this is clearer
$enz_and_patterns{$enzyme} = $pattern;
}
}
close ENZ;
my $dna_sequence;
open DNA, "<$dna_file" or die "Could not open file $dna_file: $!";
while (my $line = <DNA>) {
chomp $line;
$dna_sequence .= $line; # append the current bit to the sequence
# that has been read so far
}
close DNA;
foreach my $enzyme (keys %enz_and_patterns) {
my $dna_seq_processed = $dna_sequence; # local copy so that we retain the original
# now translate the restriction pattern to a regular expression pattern:
my $pattern = $enz_and_patterns{$enzyme};
$pattern =~ s/R/[GA]/g; # use character classes
$pattern =~ s/B/[^A]/g;
$pattern =~ s/(.+)'(.+)/($1)($2)/; # remove the ', but due to the grouping
# we "remember" its position
$dna_seq_processed =~ s/$pattern/$1\n$2/g; # in effect we are simply replacing
# each ' with a newline character
my $outfile = "${dna_file}_$enzyme";
open OUT, ">$outfile" or die "Could not open file $outfile: $!";
print OUT $dna_seq_processed , "\n";
close OUT;
}
I've tested my code with your TryII example, which worked fine.
As this is a task which can be handled by writing just a few lines of non-repetitive code I didn't feel creating separate subroutines would have been justified. I hope I will be forgiven... :)