I am working on a program that take user input for two file names. Unfortunately, the program can easily break if the user does not follow the specified format of the input. I want to write code that improves its resiliency against these types of errors. You'll understand when you see my code:
# Ask the user for the filename of the qseq file and barcode.txt file
print "Please enter the name of the qseq file and the barcode file separated by a comma:";
# user should enter filenames like this: sample1.qseq, barcode.txt
# remove the newline from the qseq filename
chomp ($filenames = <STDIN>);
# an empty array
my #filenames;
# remove the ',' and put the files into an array separated by spaces; indexes the files
push #filename, join(' ', split(',', $filenames))
# the qseq file
my $qseq_filename = shift #filenames;
# the barcode file.
my barcode = shift #filenames;
Obviously this code runs can run into errors if the user enters the wrong type of filename (.tab file instead of .txt or .seq instead of .qseq). I want code that can do some sort of check to see that the user enters the appropriate file type.
Another error that could break the code is if the user enters too many spaces before the filenames. For example: sample1.qseq,(imagine 6 spaces here) barcode.txt (Notice the numerous spaces after the comma)
Another example: (imagine 6 spaces here) sample1.qseq,barcode.txt (This time notice the number of spaces before the first filename)
I also want lines of code that can remove extra spaces so that the program doesn't break. I think the user input has to be in the following kind of format: sample1.qseq, barcode.txt. The user input has to be in this format so that I can properly index the filenames into an array and shift them out later.
Thanks any help or suggestions are greatly appreciated!
The standard way to deal with this kind of problem is utilising command-line options, not gathering input from STDIN. Getopt::Long comes with Perl and is servicable:
use strict; use warnings FATAL => 'all';
use Getopt::Long qw(GetOptions);
my %opt;
GetOptions(\%opt, 'qseq=s', 'barcode=s') or die;
die <<"USAGE" unless exists $opt{qseq} and $opt{qseq} =~ /^sample\d[.]qseq$/ and exists $opt{barcode} and $opt{barcode} =~ /^barcode.*\.txt$/;
Usage: $0 --qseq sample1.qseq --barcode barcode.txt
$0 -q sample1.qseq -b barcode.txt
printf "q==<%s> b==<%s>\n", $opt{qseq}, $opt{barcode};
The shell will deal with any extraneous whitespace, try it and see. You need to do the validation of the file names, I made up something with regex in the example. Employ Pod::Usage for a fancier way to output helpful documentation to your users who are likely to get the invocation wrong.
There are dozens of more advanced Getopt modules on CPAN.
First, put use strict; at the top of your code and declare your variables.
Second, this:
# remove the ',' and put the files into an array separated by spaces; indexes the files
push #filename, join(' ', split(',', $filenames))
Is not going to do what you want. split() takes a string and turns it into an array. Join takes a list of items and returns a string. You just want to split:
my #filenames = split(',', $filenames);
That will create an array like you expect.
This function will safely trim white space from the beginning and end of a string:
sub trim {
my $string = shift;
$string =~ s/^\s+//;
$string =~ s/\s+$//;
return $string;
Access it like this:
my $file = trim(shift #filenames);
Depending on your script, it might be easier to pass the strings as command line arguments. You can access them through the #ARGV array but I prefer to use GetOpt::Long:
use strict;
use Getopt::Long;
my ($qseq_filename, $barcode);
GetOptions (
'q|qseq=s' => \$qseq_filename,
'b|bar=s' => \$barcode,
You can then call this as:
./script.pl -q sample1.qseq -b barcode.txt
And the variables will be properly populated without a need to worry about trimming white space.
You'll need to trim spaces before handling the filename data in your routine, you could check the file extension with yet another regular expression, as nicely described in Is there a regular expression in Perl to find a file's extension?. If it's the actual type of file that matters to you, then it might be more worthwile to check for that instead with File::LibMagicType.
While I think your design is a little iffy, the following will work?
my #fileNames = split(',', $filenames);
foreach my $fileName (#fileNames) {
if($fileName =~ /\s/) {
print STDERR "Invalid filename.";
exit -1;
my ($qsec, $barcode) = #fileNames;
And here is one more way you could do it with regex (if you are reading the input from STDIN):
# read a line from STDIN
my $filenames = <STDIN>;
# parse the line with a regex or die with an error message
my ($qseq_filename, $barcode) = $filenames =~ /^\s*(\S.*?)\s*,\s*(\S.*?)\s*$/
or die "invalid input '$filenames'";
This code grabs a keyword 'fun' from text files that I have and then prints the 20 characters before and after the keyword. However, I also want it to print the previous 2 lines and the next two lines, and I'm not sure how to do that. I wasn't sure if it is easier to change the code with this or just read the whole file at one time.
{my $inputfile = "file";
$searchword = 'fun';
open (INPUT, '<', $inputfile) or die "fatal error reading the file \n";
while ($line1=<INPUT>)
#read in a line of the file
if ($line1 =~m/$searchword/i)
{print "searchword found\n";
$keepline = $line1;
$goodline =1;
$keepline =~/(.{1,20})(fun)(.{1,20})/gi;
if ($goodline==1)
$goodline =0;
Your code as is seems to
Take 20 chars each side of 'pledge' not $searchword;
Have an unmatched '{' at the start;
Doesn't print any file contents save for &write_excel which we can't examine; and
Has a logic problem in that if $searchword is found, $goodline is unconditionally set to '1' and then tested to see if its '1' and finally reset to '0'
Putting that aside, the question as to whether to read in the whole file depends on your circumstances some what - how big are the files you're going to be searching, does your machine have plenty of memory; is the machine a shared resource and so on. I'm going to presume you can read in the whole file as that's the more common position in my experience (those who disagree please keep in mind (a) I've acknowledge that its debatable; and (b) its very dependant on the circumstances that only the OP knows)
Given that, there are several ways to read in a whole file but the consensus seems to be to go with the module File::Slurp. Given those parameters, the answer looks like this;
#!/usr/bin/env perl
use v5.12;
use File::Slurp;
my $searchword = 'fun';
my $inputfile = "file.txt";
my $contents = read_file($inputfile);
my $line = '\N*\n';
if ( $contents =~ /(
\N* $searchword \N* \n?
)/x) {
say "Found:\n" . $1 ;
else {
say "Not found."
File::Slurp prints a reasonable error message if the file isn't present (or something else goes wrong), so I've left out the typical or die.... Whenever working with regexes - particularly if your trying to match stuff on multiple lines, it pays to use "extended mode" (by putting an 'x' after the final '/') to allow insignificant whitespace in the regex. This allows a clearer layout.
I've also separated out the definition of a line for added clarity which consists of 0, 1 or more non-newlines characters, \N*, followed by a new line, \n. However, if your target is on the first, second, second-last or last line I presume you still want the information, so the requested preceding and following pairs of lines are optionally matched. $line?
Please note that regular expressions are pedantic and there are inevitably 'fine details' that effect the definition of a successful match vs an unwanted match - ie. Don't expect this to do exactly what you want in all circumstances. Expect that you'll have to experiment and tweek things a bit.
I'm not sure I understand your code block (what purpose does "pledge" have? what is &write_excel?), but I can answer your question itself.
First, is this grep command acceptable? It's far faster and cleaner:
grep -i -C2 --color "fun" "file"
The -C NUM flag tells grep to provide NUM lines of context surrounding each pattern match. Obviously, --color is optional, but it may help you find the matches on really long lines.
Otherwise, here's a bit of perl:
my $searchword = "fun";
my $inputfile = "file";
my $blue = "\e[1;34m"; # change output color to blue
my $green = "\e[1;32m"; # change output color to green
my $nocolor = "\e[0;0m"; # reset output to no color
my $prev1 = my $prev2 = my $result = "";
open (INPUT, '<', $inputfile) or die "fatal error reading the file \n";
while(<INPUT>) {
if (/$searchword/i) {
$result .= $prev2 . $prev1 . $_; # pick up last two lines
$prev2 = $prev1 = ""; # prevent reusing last two lines
for (1..2) { # for two more non-matching lines
while (<INPUT>) { # parse them to ensure they don't match
$result .= $_; # pick up this line
last unless /$searchword/i; # reset counting if it matched
} else {
$prev2 = $prev1; # save last line as $prev2
$prev1 = $_; # save current line as $prev1
close $inputfile;
exit 1 unless $result; # return with failure if without matches
$result =~ # add colors (okay to remove this line)
print "$result"; # print the result
print "\n" unless $result =~ /\n\Z/m; # add newline if there wasn't already one
Bug: this assumes that the two lines before and the two lines after are actually 20+ characters. If you need to fix this, it goes in the else stanza.
I am a first year grad student who's relatively new in computational biology. I recently started using Perl and it's not the easiest language to learn, at least not for me.
I need help applying my idea/logic the right way to figure out the solution to my problem.
I have a dna string and I want to split it at specific sites to get multiple fragments using information from an enzyme file that contains lines of recognition sites. Once the fragments are obtained, I want to output the list of dna fragments in an output file. I want to create an output file for every line in the enzyme file I am going to extract the information from, to apply it to the dna string.
Here's what I mean exactly:
Hypothetical scenario:
Enzyme.File contains:
abc/at'gtct// (abc is the name of the enzyme. (atgtct) is the recognition site.)
def/cgg'ataaa// ........
Suppose the dna string is: $dna = "accggttatgtctaaacggataaagtctcggataaattt" (recognition sites are bolded)
For line 1
When I extract the info from the first line/enzyme(abc) from the enzyme file and apply it to this string, the output should be:
(split between cgg'ataaa) the apostrophe represents the cut point
(note: Even though there is another gtct in the string, it does not split it because at ought to precede it.)
For line 2
$dna = accggttatgtctaaacggataaagtctcggataaattt (Info is applied to same dna string)
Info from line/enzyme 2 (def) would split the dna as follow:
accggttatgtctaaacgg (split between cgg'ataaa)
I want to put each output from the different lines in separate file with distinct names. (I can take care of assigning the names)
So in conclusion, this example would create two new files, one name "abc_whatever" and "def_whatever". Important: If the enzyme file had 8 lines with different enzymes, I would get 8 new output files with their distinct dna fragments."
Here's what I've tried so far:
use warnings;
use strict;
open(ENZ,$ARGV[0]) || die; # ENZ(file handle for enzyme file)
my $dna = "accggttatgtctaaacggataaagtctcggataaattt";
while (<ENZ>) {
if ( match pattern etc..) { # I took care of that and created captured groups of
$1 = holds "abc" # the info I needed from the line e.g. I captured
$2 = ..."at" # (abc)/(at)'(gtct)//, so they are stored in $1,$2,$3
$3 = ..."gtct" # respectively
while (<$dna>){
my #fragments_array = split(/$3/, $dna);
open (OutFile, ">$dna"."_"."$1")
print OutFile shift #fragments_array,"\n";
foreach (#fragments_array) {
print OutFile "$3$_\n";
close OutFile;
close ENZ;
I can only create an output only for the 1st line in the Enzyme file. I want to create and output file for all the lines.
I am not properly cutting the dna. From other examples I have seen online, it looks like I am gonna have to use the following functions to properly apply the enzyme information on the dna. The functions include:
the for loop, length and substr(),
If you can, please demonstrate your work in the simplest form (no extravagant, impressing codes lol :-) since I am just learning this language)
Thanks in advance!
FIRST I can only create an output only for the 1st line in the Enzyme file. I want to create and output file for all the lines.
That's simply because you put close OutFile; into the foreach (#fragments_array) loop, instead of placing the close after the loop body.
SECOND I am not properly cutting the dna.
That's because you forgot to include $2, the head of the recognition site (e. g. the at of atgtct) in the split pattern as well as in the output.
The problem is solved easier if we just insert the splitting new-line character everywhere between the head and the tail:
use warnings;
use strict;
open(ENZ, $ARGV[0]) || die; # ENZ (file handle for enzyme file)
my $dna = "accggttatgtctaaacggataaagtctcggataaattt";
while (<ENZ>)
if (m-(.*)/(.*)'(.*)//-)
my ($head, $tail) = ($2, $3); # $2$3 is the recognition site; save it
open(OutFile, ">${dna}_$1");
(my $fragments = $dna) =~ s/$head$tail/$head\n$tail/g; # insert NLs
print OutFile $fragments, "\n";
close OutFile;
close ENZ;
I changed your code a bit, hope it works now
use warnings;
use strict;
open(ENZ, $ARGV[0]);
my $dna = "accggttatgtctaaacggataaagtctcggataaattt";
my ($enzyme, $first, $second) = ("", "", "");
for my $line (<ENZ>) {
chomp($line); # remove \n at the end of string
my #elements = split(/\/|'/, $line); # split string into tokens (e.g. abc/at'gtct => array(abc, at, gtct))
$elements[2] = substr($elements[2], 0, -2); # remove the last "//"
my ($firstPart, $secondPart) = ($elements[1], $elements[2]);
if ($dna =~ /(.*)$firstPart$secondPart(.*)/) {
$first = $1 . $firstPart;
$second = $2 . $secondPart;
$enzyme = $elements[0];
open(OUTPUT, ">$enzyme" . "_something");
print OUTPUT "$first\n$second\n";
close ENZ;
EDIT: this is the working version. I suggest you learn how to use Regular Expression if you want to use Perl for your study. It is the strongest tool in Perl.
I have a file with strings in each row as follows
the next line could look like
I'm trying to parse this text in Perl. Note: quotes are present in the strings when there are several of them in a row, but not present if there is only item
I would like to parse each item into an array. I tried the following regex
#fields = ($_ =~ /(\d+\_\d+),*/g);
but it is missing the last 2714. How do I capture that edge case? Any help appreciated. Thanks in advance
It looks like you have a CSV File, so use an actual CSV parser for it like Text::CSV.
After you parse the columns, you can separate your first field into the array:
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new ( { binary => 1 } ) # should set binary attribute.
or die "Cannot use CSV: ".Text::CSV->error_diag ();
my $line = qq{"229269_2,190594_2,94552_2,266076_2,269628_2,165328_2,99319_2,263339_2,263300_2,99315_2,271509_2,2714",A,1 the next line could look like 84545,X,2};
if ($csv->parse($line)) {
my #columns = $csv->fields();
my #nums = split ',', $columns[0];
print "#nums\n";
229269_2 190594_2 94552_2 266076_2 269628_2 165328_2 99319_2 263339_2 263300_2 99315_2 271509_2 2714
Why not a regex ?
Yes, of course it's possible to use a regex for practically anything. But what you need to understand is that this will make your code extremely fragile and difficult to maintain.
Even if you want to use a regular expression, you should STILL do this in two steps. First separate the initial column(s) of your CSV, and then process the specific column that you're worried about.
Because you're just working with the first column, you could use code like the following:
use strict;
use warnings;
my $line = qq{"229269_2,190594_2,94552_2,266076_2,269628_2,165328_2,99319_2,263339_2,263300_2,99315_2,271509_2,2714",A,1 the next line could look like 84545,X,2};
if ($line =~ /^"(.*?)"|^([^,]*)/) {
my $column0 = $1 // $2;
my #nums = split ',', $column0;
print "#nums\n";
The above happens to accomplish the same thing as the previous code. However, it has one big flaw, it's not nearly as obvious to the maintaining programmer what's going on.
Whenever a new coder, or even yourself in 6 months, views the first set of code, it is extremely obvious what format your data is in. You're working with a CSV file, and the first column is a list separated by commas. The second code also works, but the new maintainer must actually read the regex and figure out what's going on to understand both what format the data is in, and whether the code is actually doing it correctly.
Anyway, do whatever you will, but I strongly advise you to use an actual CSV Parser for parsing csv files.
If all you want is all but the last two fields...
my $string = qq("229269_2,190594_2,94552_2,266076_2,269628_2,165328_2,99319_2,263339_2,263300_2,99315_2,271509_2,2714",A,1);
$string =~ s/"//g; # delete the quotes
my #f = split (/,/, $string); # split on the comma
pop #f; pop #f; # jettison the last two columns
# #f contains what you're looking for
scenario: I am a Jr. C# developer, but recently (3 days) began learning Perl for batch files. I have a requirement to parse through a text file, extract some key data, then output the key data to a new text file. As seems to always be the case, there are butt loads of fragmented examples on the net regarding how to 'read' from a file, 'write' to a file, 'store' line by line into an array, 'filter' this and that, yadda yadda, but nothing discussing the entire process of read, filter, write. Trying to splice examples from the net together is no good, because none seem to work together as coherent code. Coming from C#, Perl's syntax structure is hella confusing. I just need some advice on this process.
My objective is to parse a text file, single out all lines similar to the one below, by date, and output only the first 8 digits of the 2nd number group and 5 digits from the 3rd number group to a new text file.
11122 20100223454345 ....random text..... [keyword that identifies all the
entries I need]... random text 0.0034543345
I know regex is likely the best option, and have most of the expression written, but it does not work in Perl!
Question: Could someone please show a simple (dummy) example of how to read from, filter (using dummy regex) the file, then output the (dummy) results to a new file? I'm not concerned with functional details, I can learn those, I just need the syntax structure Perl uses. For example:
open(FH, '<', 'dummy1.txt')
open(NFH, '>', 'dummy2.txt')
#array; or $dumb;
filter each line [REGEX] and shove it into [#array or $dumb scalar]
print(join(',', #array)) to dummy2.txt
close FH;
close NFH;
Note: For various reasons, I cannot paste my source code in here, sorry. Any help is appreciated.
Much thanks to all those who provided insight into my issue. After reading through you replies, as well as conducting further research, I learned that there are dozens of ways to accomplish the same task in Perl(which I am not a fan of). In the end, this is how I solved the problem, and IMO it's the cleanest, and most succinct, solution for those having similar struggles. Thanks again for all the help.
# 1. READ FILE: inputFile.txt
# 2. CREATE FILE: outputFile.txt
# 3. WRITE TO: outputFile.txt IF line matches REGEX constraints
# 4. CLOSE FILES: outputFile.txt & inputFile.txt
$readFile = 'C:/.../.../inputFile.txt';
open(FH, '<', $readFile) or Error("Could not read file ($!)");
$writeFile = 'C:/.../.../outputFile.txt';
open(NFH, '>', $writeFile) or Error("Cannot write to file ($!)");
#lines = <FH>;
LINE: foreach $line (#lines)
if ($line =~ m/(201403\d\d).*KEYWORD.*time was (\d+\.\d+)/)
$date = $1;
$elapsedtime = $2;
print NFH "$date,$elapsedtime\n";
close NFH;
close FH;
perlfaq5 - How do I change, delete, or insert a line in a file, or append to the beginning of a file? covers most of the different scenarios for how to use files.
However, I will add to that by saying that always start your scripts with use strict; and use warnings;, and because you're doing file processing, use autodie; will serve you as well.
With that in mind, a quick stub would be the following:
use strict;
use warnings;
use autodie;
open my $infh, '<', 'dummy1.txt';
open my $outfh, '>', 'dummy2.txt';
while (my $line = <$infh>) {
chomp $line; # Remove \n
if (Whatever magically processing here) {
print $outfh, "your new data";
# variable $_ contains the current line
if(m/regex_goes_here/) #by default, the regex match operator m// attempts to match the default $_ variable
#do actions
Also note, m/regex/ is the same as /regex/
Refer to:
For capturing variables from regex match, THIS might help
If you want a different variable than the default $_, as #Miller suggested, use while($line = <FH>) followed by if($line =~ m/regex_goes_here/)
=~ is the Binding Operator
One tip. Don't explicitly open filehandles to your input and output files. Instead read from STDIN and write to STDOUT. Your program will be far more flexible and easier to use as you'll be able to treat it like a Unix filter.
$ your_filter_program < your_input.txt > your_output.txt
And doing this actually makes your program simpler to write too.
while (<>) { # <> reads from STDIN
# transform your data (which is in $_) in some way
print; # prints $_ to STDOUT
You might find the first few chapters of Data Munging with Perl are useful.
use strict;
use warnings;
use autodie;
use feature qw(say);
use constant {
FILTER => qr/regex_for_line_to_filter/,
open my $in_fh, "<", INPUT_FILE;
open my $out_fh, ">", OUTPUT_FILE;
while ( my $line = <$in_fh> ) {
chomp $line;
next unless $line =~ FILTER;
$line =~ s/regular_expression/replacement/;
say {$out_fh} $line;
close $in_file;
close $out_file;
The $in_file is your input file, and $out_fh is your output file. I basically open both, and loop through the input. The chomp removes the \n from the end. I always recommend doing that.
The next goes to the next iteration of the loop unless I match FILTER which is a regular expression matching lines you want to keep. This is identical to:
if ( $line !~ FILTER ) {
I then use the substitution command to get the parts of the line I want, and munge them into the output I want. I maybe better off expanding this a bit. Maybe using split to split up my line into various pieces, the only using the pieces I want. I could then use substr to pull out the substring from the select pieces.
The say command is like print except it automatically adds in a NL on the end. This is how you write a line to a file.
Now, get Learning Perl and read it. If you know any programming. it shouldn't take you more than a week to go through the first half of the book. That should be more than enough to be able to write a program like this. The more complex stuff like references and object orientation might take a bit longer.
On line documentation can be found at http://perldoc.perl.org. You can look up the use statements which are called pragmas over there. Documentation on the individual functions are also available.
If I understood well, this one liner will do the job:
perl -ane 'print substr($F[1],0,8),"\t",substr($F[-1],0,5),"\n" if /keyword/' in.txt
Assuming in.txt is:
11122 20100223454345 ....random text..... [keyword that identifies all the entries I need]... random text 0.0034543345
11122 30100223454345 ....random text..... [ that identifies all the entries I need]... random text 0.124543345
11122 40100223454345 ....random text..... [keyword that identifies all the entries I need]... random text 0.65487
11122 50100223454345 ....random text..... [ that identifies all the entries I need]... random text 0.6215
20100223 0.003
40100223 0.654
I have a giant text data file (~100MB) that is a concatenation of a bunch of data files with various header information then some columns of data. Here's the problem. I want to extract a particular number from the header info before each of these data sets and then append that to another column in the data (and write out that data to a different file).
The header info that I want is of the format ex: BGA 1
Where what I want for that extra data column is the # after word BGA. It will be a number between 1 and maybe 20000. I can write the regex to pull the word BGA, but I don't seem to be able to figure out how to just get the digit after it.
To add EXTRA fun, that text "BGA 1" is repeated in each data section TWICE.
Here's what I have so far, which actually doesn't work... I want it to at least print "BGA" everytime it encounters the word BGA, but it prints nothing.... Any help would be appreciated.
use strict;
use warnings;
my $file = 'alldata.txt';
open my $info, $file or die "Could not open $file: $!";
while(my $line = <$info>){
if ($line eq "/BGA/"){
print <>,"\n";
close $file;
if ($line =~ /BGA\s(\d+)/){
#your code
print "BGA number $1 \n";
#your code
And $1 variable will have the number you want
If there is more than one BGA per line, you'll need to allow the regex to match more than once per line:
while (my $line = <$info>) {
while ( $line =~ /BGA\s(\d+)/g ) {
print "$1\n";
This should print out all the BGA numbers as a single column. Without any further information it's hard to answer this any better.
First, a 100 MB file is not giant. Don't be so defeatist. You could even slurp it into memory:
Let's look at the few critical places in your code:
while(my $line = <$info>) {
if ($line eq "/BGA/") {
Your condition $line eq "/BGA/" tests if the line literally consists of the string "/BGA/". But, that can never be true for the line with at least have the input record separator, i.e. the contents of $/ at the end because you did not chomp it. In any case, what you want is to match lines that contain "BGA" anywhere and the proper Perl syntax to do that is
if ($line =~ /BGA/) {
Now, once you fix that, you are going to run into a problem with the following statement:
print <>,"\n";
What you really want is print $line;. The diamond operator, <>, in list context is going to try to slurp from STDIN or any files specified as arguments on the command line. Not a good idea.
Others have pointed out how to match the string "BGA" followed by a digit. For better answers, you are going to need to show examples of input and expected output.