Perl multiple line matching while reading from file, line by line - regex

Lets say I open a file like this:
#!/usr/bin/perl
open FILE, "8882099";
while ($line = <FILE>) {
if ($line =~ /accepted by(.*?)\./s) {
print "accepted by: $1";
}
}
the problem is the regex is working, but since the file is read line by line, how should I go about matching this string which continues on to a new line?
Thank you

It's often simplest just to read the entire file at once.
my $file;
{
local $/;
$file = <$fh>;
}

First of all, you must always use strict and use warnings at the top of every program. That way simple mistakes you would otherwise overlook will be pointed out to you.
Secondly, you should use lexical file names, the three-parameter form of open, and always check the status of every open call.
To solve you problem, you should simply look for a line that contains the prefix accepted by, and then append lines to the string you have found until you see the complete string match. It is also better to use an explicit [^.]+ than the non-greedy .*? to avoid backtracking.
Note that you should reinstate the file open that I have commented out, and remove the assignment to $file, as I have written the program this way for test purposes.
This solution will have a problem if accepted by is split over multiple lines. If you expect this then something slightly different will have to be coded.
use strict;
use warnings;
# open my $file, '<', '8882099' or die $!;
my $file = \*DATA;
my $line;
while ($line = <$file>) {
if ($line =~ /accepted by/) {
$line .= <$file> until $line =~ /accepted by\s*([^.]*)\./;
print "accepted by: $1\n";
}
}
__DATA__
accepted by Tim.
accepted by The
Financial
Director.
Today
output
accepted by: Tim
accepted by: The
Financial
Director

Related

Perl capture and add to end of string

I have a file with a lot of lines like this:
ChrVIII_A_nidulans_FGSC_A4 AspGD gene 3861520 3863875 . + . ID=AN0338;Name=AN0338;Gene=CYP680A1;Note=Putative%20cytochrome%20P450;orf_classification=Uncharacterized;Alias=ANIA_00338,ANID_00338
My region of interest is ;Gene=_____; -- the stuff between the = and ;.
If this region exists, I want to append it to the end of the line with a , attached to the front. If it does not exist I want to print the line anyway!
ChrVIII_A_nidulans_FGSC_A4 AspGD gene 3861520 3863875 . + . ID=AN0338;Name=AN0338;Gene=CYP680A1;Note=Putative%20cytochrome%20P450;orf_classification=Uncharacterized;Alias=ANIA_00338,ANID_00338,CYP680A1
This is what I tried in Perl and I don't know why it doesn't work.
use strict;
use warnings;
open(SOURCE,"<annotation.gff") or die "Source file not found!\n";
my $line1;
foreach $line1(<SOURCE>) #iterating over SOURCE file
{
if($line1=~/Gene\=([a-zA-Z0-9\-]+)\;/)
printf "$line1,$1";
}
else {printf "$line1";}
}
Can anyone show me what I am doing wrong?
Let's go through your code:
use strict;
use warnings;
Good. However, trying to run your code gives:
syntax error at ss.pl line 9, near ")
printf"
syntax error at ss.pl line 11, near "else"
which means you did not post the code you ran, so we can't really trust it. Don't do that. Reduce your problem to a small, self-contained script others can run.
open(SOURCE,"<annotation.gff") or die "Source file not found!\n";
Don't use bareword filehandles such as SOURCE. Instead, use lexical filehandles.
Don't hard code the name of the file you are trying to open. Doing so makes it hard to accurately convey the name of the file your program failed to open in case of a failure.
In the error message, include actual error your program encountered, rather than hardcoding your unwarranted assumptions.
Don't use the two argument form of open, especially if you are going to want the flexibility to specify file names as command line arguments instead of having to edit the script every time you get a new input file. That is, use
my $annotation_file = 'annotation.gff';
open my $source, '<', $annotation_file
or die "Failed to open annotation source '$annotation_file': $!";
Don't declare the iteration variable for a loop outside the scope of the loop.That is, instead of:
my $line1;
foreach $line1 ( ... )
use
foreach my $line1 ( ... )
But, of course, you should not use a for loop to iterate over the contents of a file because doing so makes your program slurp (i.e. read the entire contents of) the file into memory as a list of lines. This makes the memory footprint of your program depend on the size of its input instead of the size of the longest line. Also, drop the 1 suffix: You are iterating through every line in the file, not just the first one.
while (my $line = <$source>) {
Don't use printf if you are just printing plain strings. That is, instead of printf "$line1,$1", use print "$line,$1\n".
And, that brings us to another problem. When you read the line, you never remove the newline off its end. Therefore, the string you print is "...\n..." which creates the effect of prepending the captured string to the beginning of the following line.
That brings us to something that works:
use strict;
use warnings;
my $annotation_file = 'annotation.gff';
open my $source, '<', $annotation_file
or die "Cannot open annotation source '$annotation_file': $!";
while (my $line = <$source>) {
if( $line =~ /Gene = ( [^;]+ ) ;/x ) {
chomp $line;
print join(',' => $line, $1), "\n";
}
else {
print $line;
}
}
Try this:
use strict;
use warnings;
open(my $fh, '<', 'annotation.gff') or die $!;
while (<$fh>) {
chomp;
/Gene=([a-zA-Z0-9\-]+)\;/ and $_ .= ",$1";
print "$_\n";
}
close $fh;

perl regex: searching thru entire line of file

I'm a regex newbie, and I am trying to use a regex to return a list of dates from a text file. The dates are in mm/dd/yy format, so for years it would be '55' for '1955', for example. I am trying to return all entries from years'50' to '99'.
I believe the problem I am having is that once my regex finds a match on a line, it stops right there and jumps to the next line without checking the rest of the line. For example, I have the dates 12/12/12, 10/10/57, 10/09/66 all on one line in the text file, and it only returns 10/10/57.
Here is my code thus far. Any hints or tips? Thank you
open INPUT, "< dates.txt" or die "Can't open input file: $!";
while (my $line = <INPUT>){
if ($line =~ /(\d\d)\/(\d\d)\/([5-9][0-9])/g){
print "$&\n" ;
}
}
A few points about your code
You must always use strict and use warnings 'all' at the top of all your Perl programs
You should prefer lexical file handles and the three-parameter form of open
If your regex pattern contains literal slashes then it is clearest to use a non-standard delimiter so that they don't need to be escaped
Although recent releases of Perl have fixed the issue, there used to be a significant performance hit when using $&, so it is best to avoid it, at least for now. Put capturing parentheses around the whole pattern and use $1 instead
This program will do as you ask
use strict;
use warnings 'all';
open my $fh, '<', 'dates.txt' or die "Can't open input file: $!";
while ( <$fh> ) {
print $1, "\n" while m{(\d\d/\d\d/[5-9][0-9])}g
}
output
10/10/57
10/09/66
You are printing $& which gets updated whenever any new match is encountered.
But in this case you need to store the all the previous matches and the updated one too, so you can use array for storing all the matches.
while(<$fh>) {
#dates = $_ =~ /(\d\d)\/(\d\d)\/([5-9][0-9])/g;
print "#dates\n" if(#dates);
}
You just need to change the 'if' to a 'while' and the regex will take up where it left off;
open INPUT, "< a.dat" or die "Can't open input file: $!";
while (my $line = <INPUT>){
while ($line =~ /(\d\d)\/(\d\d)\/([5-9][0-9])/g){
print "$&\n" ;
}
}
# Output given line above
# 10/10/57
# 10/09/66
You could also capture the whole of the date into one capture variable and use a different regex delimiter to save escaping the slashes:
while ($line =~ m|(\d\d/\d\d/[5-9]\d)|g) {
print "$1\n" ;
}
...but that's a matter of taste, perhaps.
You can use map also to get year range 50 to 99 and store in array
open INPUT, "< dates.txt" or die "Can't open input file: $!";
#as = map{$_ =~ m/\d\d\/\d\d\/[5-9][0-9]/g} <INPUT>;
$, = "\n";
print #as;
Another way around it is removing the dates you don't want.
$line =~ s/\d\d\/\d\d\/[0-4]\d//g;
print $line;

open file for each element in array and check against a regex Perl

I have an array filled with 4 digit numbers (#nums) that correspond
to conf files which use the numbers as the file name, like so: 0000.conf
I am reading a file foreach element in the array and checking
the file for a pattern like this :
use strict;
use warnings;
foreach my $num (#nums) {
open my $fh, "<", "$num.conf"
or warn "cannot open $num.conf : $!";
while(<$fh>) {
if (/^SomePattern=(.+)/) {
print "$num : $1\n";
}
}
}
I am extracting the part of the pattern I want using () and the
special var $1.
This seems to be working except it only prints the results of the last file
that is opened, instead of printing the results each time the foreach loop
passes and opens a file, which is what I expected.
I am still learning Perl, so any detailed explanations of what I missing here
will be greatly appreciated.
use v5.16;
use strict;
use warnings;
my #nums = qw/ 0000 0200 /;
for my $num (#nums){
open my $fh, "<", "$num.conf" or die;
while (<$fh>) {
chomp;
if( /^somePattern=(.+)/ ) {
say "$1";
}
}
close $fh;
}
this seems to be working for me..You are missing the close $fh; in your code, maybe that is wrong. Secondly, maybe only one of your files matches you regex, check the content for typos. I myself don't use foreach, maybe you are missing 'my' before $num. Depending of your regex, it might be useful to strim newline characters from the end of line with 'chomp'.
Your code is excellent for a learner.
The problem is that, using "$num.conf", you are trying to open files named 0.conf etc. instead of 0000.conf.
You should also use the value of $! in your die string so that you know why the open failed.
Write this instead
my $file = sprintf '%04d.conf', $num;
open my $fh, '<', $file or die "Unable to open '$file': $!";
I have left my previous answer as it may be useful to someone. But I missed your opening "I have an array filled with 4 digit numbers".
Doubtless you are populating your array wrongly.
If you are reading from a file then most usually you have forgotten to chomp the newline from the end of the lines you have read.
You may also have non-printable characters (usually tabs or spaces) in each number.
You should use Data::Dumper or the better and more recent Data::Dump to reveal the contents of your array.

help with perl code to parse a file

I am new to Perl and have a question about the syntax. I received this code for parsing a file containing specific information. I was wondering what the if (/DID/) part of the subroutine get_number is doing? Is this leveraging regular expressions? I'm not quite sure because regular-expression matches look like $_ =~ /some expression/. Finally, is the while loop in the get_number subroutine necessary?
#!/usr/bin/env perl
use Scalar::Util qw/ looks_like_number /;
use WWW::Mechanize;
# store the name of all the OCR file names in an array
my #file_list=qw{
blah.txt
};
# set the scalar index to zero
my $file_index=0;
# open the file titled 'outputfile.txt' and write to it
# (or indicate that the file can't be opened)
open(OUT_FILE, '>', 'outputfile.txt')
or die "Can't open output file\n";
while($file_index < 1){
# open the OCR file and store it in the filehandle IN_FILE
open(IN_FILE, '<', "$file_list[$file_index]")
or die "Can't read source file!\n";
print "Processing file $file_list[$file_index]\n";
while(<IN_FILE>){
my $citing_pat=get_number();
get_country($citing_pat);
}
$file_index=$file_index+1;
}
close IN_FILE;
close OUT_FILE;
The definition of get_number is below.
sub get_number {
while(<IN_FILE>){
if(/DID/){
my #fields=split / /;
chomp($fields[3]);
if($fields[3] !~ /\D/){
return $fields[3];
}
}
}
}
Perl has a variable $_ that is sort of the default dumping ground for a lot of things.
In get_number, while(<IN_FILE>){ is reading a line into $_, and the next line is checking if $_ matches the regular expression DID.
It's also common to see chomp; which also operates on $_ when no argument is given.
In that case, if (/DID/) by default searches the $_ variable, so it is correct. However, it is a rather loose regex, IMO.
The while loop in the sub may be necessary, it depends on what your input looks like. You should be aware that the two while loops will cause some lines to get completely skipped.
The while loop in the main program will take one line, and do nothing with it. Basically, this means that the first line in the file, and every line directly following a matching line (e.g. a line that contains "DID" and the 4th field is a number), will also be discarded.
In order to answer that question properly, we'd need to see the input file.
There are a number of issues with this code, and if it works as intended, it's probably due to a healthy amount of luck.
Below is a cleaned up version of the code. I kept the modules in, since I do not know if they are used elsewhere. I also kept the output file, since it might be used somewhere you have not shown. This code will not attempt to use undefined values for get_country, and will simply do nothing if it does not find a suitable number.
use warnings;
use strict;
use Scalar::Util qw/ looks_like_number /;
use WWW::Mechanize;
my #file_list=qw{ blah.txt };
open(my $outfile, '>', 'outputfile.txt') or die "Can't open output file: $!";
for my $file (#file_list) {
open(my $in_file, '<', $file) or die "Can't read source file: $!";
print "Processing file $file\n";
while (my $citing_pat = get_number($in_file)) {
get_country($citing_pat);
}
}
close $out_file;
sub get_number {
my $fh = shift;
while(<$fh>) {
if (/DID/) {
my $field = (split)[3];
if($field =~ /^\d+$/){
return $field;
}
}
}
return undef;
}

How to return only lines that do not match any values of an array?

I'm attempting to compare each line in a CSV file to each and every element (strings) I have stored in an array using Perl. I want to return/print-to-file the line from the CSV file only if it is not matched by any of the strings in the array. I've tried numerous kinds of loops to achieve this, but have not only not found a solution, but none of my attempts is really giving me clues as to where I'm going wrong. Below are a few samples of the loops I've tried:
while (<CSVFILE>) {
foreach $i (#lines) {
print OUTPUTFILE $_ if $_ !~ m/$i/;
}; #foreach
}; #while
AND:
foreach $i (#lines) {
open (CSVFILE , "< $csv") or die "Can't open $csv for read: $!";
while (<CSVFILE>) {
if ($_ !~ m/$i/) {
print OUTPUTFILE $_;
}; #if
}; #while
close (CSVFILE) or die "Cannot close $csv: $!";
}; #foreach
Here is a sample of the CSV file I am attempting:
1,c.03_05delAAG,null,71...
2,c.12T>G,null,24T->G,5...
3,c.87C>T,null,96C->T,82....
And the array elements (with regex escape characters):
c\.12T\>G
c\.97A\>C
Assuming only the above as input data, I would hope to get back:
1,c.03_05delAAG,null,71...
3,c.87C>T,null,16C->T....
since they do not contain any of the elements from the array. Is this a situation where Hashes come into play? I don't have a great handle on them yet, aside from the standard "dictionary" definition. If anyone could help me get my head around this problem it would be greatly appreciated. A this point I might just do it manually as there isn't that many and I need this out of the way ASAP, but since I wasn't able to find any answers searching anywhere else I figured it was worthwhile asking.
Use Perl 5.10.1 or better, so you can apply smart matching.
Also, don't use the implicit $_ when you're dealing with two loops, it gets too confusing and is error prone.
The following code (untested) might do the trick:
use 5.010;
use strict;
use warnings;
use autodie;
...
my #regexes = map { qr{$_} } #lines;
open my $out, '>', $outputfile;
open my $csv, '<', $csvfile;
while (my $line = <$csv>) {
print $out $line unless $line ~~ #regexes;
}
close $csv;
close $out;
The reason your code doesn't work, by the way, is that it will print a line if any of the elements in #lines don't match, and that will always be the case.