Perl, problems scrubbing a file with regex effeciently - regex

First I will explain the problem my program attempts to solve. I have two input files, one contains lines of "good" numbers:
100000
100001
100002
100003
100004
The other file is a file of "raw" numbers that I want to check each line of and see if the line contains one of the "good" numbers above followed by 4 more numbers, the additional 4 numbers can be anything. so if the file containing the raw numbers is :
8881000001234
1000014321
999991000021234567
00234100001
1000041234
100002123
1000029876
after scrubbing with regex the matched numbers would be
1000001234
1000014321
1000021234
1000041234
1000029876
The way I have done this so far is to store the "good" numbers in an array then slurped the "raw" numbers into a scalar
my $FH
my#good_nums
open $FH, '<', 'good_numbers' or die $!;
while(<$FH>) { chomp; push #good_nums, $_; }
close $FH;
open $FH, '<', 'raw_numbers' or die $!;
my $raw_nums = do { local $/; <$FH> };
close $FH;
then with those I can do this :
my #matches;
foreach my $num (#good_nums) {
push #matches, $raw_nums =~ /$num\d{4}/g;
}
So #matches contains the correct matches and this has been working well.
But now I have developed the need to also capture the lines from "raw" numbers that did not match. I can capture the non matching lines by putting the "raw" numbers into an array (instead of slurping them) and join the #good_nums array into a regex :
my $QRnums = '(?:' . (join '|', #good_nums) . ')';
$QRnums = qr/$QRnums/;
my #raw_nums;
open my $FH, '<', 'raw_numbers' or die $!;
while(<$FH>) { chomp; push #raw_nums, $_; }
my #matches;
my #junk;
for (#raw_nums) {
if ($_ =~ /($QRnums\d{4})/g) {
push #matches, $1;
} else {
push #junk, $_;
}
}
This is working but when I increase the number of lines in each file to 150,000 or more, the latter solution takes 4 or 5 times longer than the former solution. I know there must be another Perl solution that can solve my problem efficiently but I am at a loss. I am not very good with intermediate Perl and beyond.. Is there a better way to do this or can my first solution be rewritten so that I can obtain the non matches in an array too? Apart from needing to solve the problem explained in the opening of my post, I am open to anything.

Only cache your regex one time, and then compare using $_ =~ $QRnums.
Additionally, there's no need to slurp your other file, just do line-by-line process instead.
my $QRnums = '(?:' . (join '|', #good_nums) . ')';
$QRnums = qr/($QRnums\d{4})/;
my #matches;
my #junk;
open my $FH, '<', 'raw_numbers' or die $!;
while (<$FH>) {
chomp;
if ($_ =~ $QRnums) {
push #matches, $1;
} else {
push #junk, $_;
}
}
Also, if your regex should be bounded to the start of the string ^, then I would suggest that you add that: $QRnums = qr/^($QRnums\d{4})/;
Addendum
From perlop - Regexp Quote-Like Operators
Since Perl may compile the pattern at the moment of execution of the qr() operator, using qr() may have speed advantages in some situations, notably if the result of qr() is used standalone:
and later:
Precompilation of the pattern into an internal representation at the moment of qr() avoids a need to recompile the pattern every time a match /$pat/ is attempted. (Perl has many other internal optimizations, but none would be triggered in the above example if we did not use qr() operator.)
Basically, because your #good_nums list was potentially very large, it made sense to cache that if we could so that it the regex test only needed to be compiled once.

Related

Why does this regex in perl work for one word but not another?

I'm new to perl so please excuse me if my question seems obvious. I made a small perl script that just examines itself to extract a particular substring I'm looking for and I'm getting results that I can't explain. Here is the script:
use 5.006;
use strict;
use warnings;
use File::Find;
my #files;
find(
sub { push #files, $File::Find::name unless -d; },
"."
);
my #filteredfiles = grep(/.pl/, #files);
foreach my $fileName (#filteredfiles)
{
open (my $fh, $fileName) or die "Could not open file $fileName";
while (my $row = <$fh>)
{
chomp $row;
if ($row =~ /file/)
{
my ($substring) = $row =~ /file\(([^\)]*)\)/;
print "$substring\n" if $substring;
}
}
close $fh;
}
# file(stuff)
# directory(stuff)
Now, when I run this, I get the following output:
stuff
[^\
Why is it printing the lines out of order? Since the "stuff" line occurs later in the file, shouldn't it print later?
Why is it printing that second line wrong? It should be "\(([^\". It's missing the first 3 characters.
If I change my regex to the following: /directory\(([^\)]*)\)/, I get no output. The only difference is the word. It should be finding the second comment. What is going on here?
use 5.006 kind of odd if you are just beginning to learn Perl ... That is an ancient version.
You should not build a potentially huge list of all files in all locations under the current directory and then filter it. Instead, push only the files you want to the list.
Especially with escaped meta characters, regex patterns can be become hard to read very quickly, so use the /x modifier to insert some whitespace into those patterns.
You do not have to match twice: Just check & capture at the same time.
If open fails, include the reason in the error message.
Your second question above does not make sense. You seem to expect your pattern to match the literal string file\(([^\)]*)\)/, but it cannot.
use strict;
use warnings;
use File::Find;
my #files;
find(
sub {
return if -d;
return unless / [.] pl \z/x;
push #files, $File::Find::name;
},
'.',
);
for my $file ( #files ) {
open my $fh, '<', $file
or die "Could not open file $file: $!";
while (my $line = <$fh>) {
if (my ($substring) = ($line =~ m{ (?:file|directory) \( ([^\)]*) \) }x)) {
print "$substring\n";
}
}
close $fh;
}
# file(stuff)
# directory(other)
Output:
stuff
other

Having trouble capturing storing strings in Perl regular expression?

So I've been working around with this problem for a while now.
I have a file with one hundred FASTA sequences arranged like this:
>gi|192567|gb|AAA37417.1| cystic fibrosis transmembrane conductance regulator [Mus musculus]
MQKSPLEKASFISKLFFSWTTPILRKGYRHHLELSDIYQAPSADSADHLSEKLEREWDREQASKKNPQLIHALRRCFFWRFLFYGILLYLGEVTKAVQPVLLGRIIASYDPENKVERSIAIYLGIGLCLLFIVRTLLLHPAIFGLHRIGMQMRTAMFSLIYKKTLKLSSRVLDKISIGQLVSLLSNNLNKFDEGLALAHFIWIAPLQVTLLMGLLWDLLQFSAFCGLGLLIILVIFQAILGKMMVKYRDQRAAKINERLVITSEIIDNIYSVKAYCWESAMEKMIENLREVELKMTRKAAYMRFFTSSAFFFSGFFVVFLSVLPYTVINGIVLRKIFTTISFCIVLRMSVTRQFPTAVQIWYDSFGMIRKIQDFLQKQEYKVLEYNLMTTGIIMENVTAFWEEGFGELLQKAQQSNGDRKHSSDENNVSFSHLCLVGNPVLKNINLNIEKGEMLAITGSTGLGKTSLLMLILGELEASEGIIKHSGRVSFCSQFSWIMPGTIKENIIFGVSYDEYRYKSVVKACQLQQDITKFAEQDNTVLGEGGVTLSGGQRARISLARAVYKDADLYLLDSPFGYLDVFTEEQVFESCVCKLMANKTRILVTSKMEHLRKADKILILHQGTSYFYGTFSELQSLRPSFSSKLMGYDTFDQFTEERRSSILTETLRRFSVDDSSAPWSKPKQSFRQTGEVGEKRKNSILNSFSSVRKISIVQKTPLCIDGESDDLQEKRLSLVPDSEQGEAALPRSNMIATGPTFPGRRRQSVLDLMTFTPNSGSSNLQRTRTSIRKISLVPQISLNEVDVYSRRLSQDSTLNITEEINEEDLKECFLDDVIKIPPVTTWNTYLRYFTLHKGLLLVLIWCVLVFLVEVAASLFVLWLLKNNPVNSGNNGTKISNSSYVVIITSTSFYYIFYIYVGVADTLLALSLFRGLPLVHTLITASKILHRKMLHSILHAPMSTISKLKAGGILNRFSKDIAILDDFLPLTIFDFIQLVFIVIGAIIVVSALQPYIFLATVPGLVVFILLRAYFLHTAQQLKQLESEGRSPIFTHLVTSLKGLWTLRAFRRQTYFETLFHKALNLHTANWFMYLATLRWFQMRIDMIFVLFFIVVTFISILTTGEGEGTAGIILTLAMNIMSTLQWAVNSSIDTDSLMRSVSRVFKFIDIQTEESMYTQIIKELPREGSSDVLVIKNEHVKKSDIWPSGGEMVVKDLTVKYMDDGNAVLENISFSISPGQRVGLLGRTGSGKSTLLSAFLRMLNIKGDIEIDGVSWNSVTLQEWRKAFGVITQKVFIFSGTFRQNLDPNGKWKDEEIWKVADEVGLKSVIEQFPGQLNFTLVDGGYVLSHGHKQLMCLARSVLSKAKIILLDEPSAHLDPITYQVIRRVLKQAFAGCTVILCEHRIEAMLDCQRFLVIEESNVWQYDSLQALLSEKSIFQQAISSSEKMRFFQGRHSSKHKPRTQITALKEETEEEVQETRL
I've written a subroutine that opens the file, and reads each sequence one at a time. For each sequence I would like to add both the gi number at the beginning and the long sequence in capital letters as strings to a growing array. However, I'm having trouble writing a regular expression to store these values. Here is my current subroutine, which I tweaked to see if I was actually storing the gi number:
sub getFASTA {
my ($filename) = #_;
my #FASTA_arr;
$/ = "\n\n";
open (my $fh, '<', $filename) or
die ("Could not open file: $filename");
while (<$fh>) {
chomp $_;
$_ =~ /^>gi|(\d*?)|/s;
say "$1";
}
close $fh;
#say join(" ", #FASTA_arr);
}
However, trying to run this returns:
Use of uninitialized value $1 in string at sequenceAlignment.pl line 30, <$fh> chunk 1.
This is returned for each sequence, so 100 times in total.
So any idea of what is wrong? I'm almost certain that it is a problem with the regular expression, because when I changed it to "$_ =~ /(>gi|)/s;", it worked correctly, just with 100 ">gi|"s printing out.
| means OR in a regex. Escape it. (Seems like perl figured out what you "really" meant when it was at the end of the capture group and didn't have a 2nd operand)

perl regex: searching thru entire line of file

I'm a regex newbie, and I am trying to use a regex to return a list of dates from a text file. The dates are in mm/dd/yy format, so for years it would be '55' for '1955', for example. I am trying to return all entries from years'50' to '99'.
I believe the problem I am having is that once my regex finds a match on a line, it stops right there and jumps to the next line without checking the rest of the line. For example, I have the dates 12/12/12, 10/10/57, 10/09/66 all on one line in the text file, and it only returns 10/10/57.
Here is my code thus far. Any hints or tips? Thank you
open INPUT, "< dates.txt" or die "Can't open input file: $!";
while (my $line = <INPUT>){
if ($line =~ /(\d\d)\/(\d\d)\/([5-9][0-9])/g){
print "$&\n" ;
}
}
A few points about your code
You must always use strict and use warnings 'all' at the top of all your Perl programs
You should prefer lexical file handles and the three-parameter form of open
If your regex pattern contains literal slashes then it is clearest to use a non-standard delimiter so that they don't need to be escaped
Although recent releases of Perl have fixed the issue, there used to be a significant performance hit when using $&, so it is best to avoid it, at least for now. Put capturing parentheses around the whole pattern and use $1 instead
This program will do as you ask
use strict;
use warnings 'all';
open my $fh, '<', 'dates.txt' or die "Can't open input file: $!";
while ( <$fh> ) {
print $1, "\n" while m{(\d\d/\d\d/[5-9][0-9])}g
}
output
10/10/57
10/09/66
You are printing $& which gets updated whenever any new match is encountered.
But in this case you need to store the all the previous matches and the updated one too, so you can use array for storing all the matches.
while(<$fh>) {
#dates = $_ =~ /(\d\d)\/(\d\d)\/([5-9][0-9])/g;
print "#dates\n" if(#dates);
}
You just need to change the 'if' to a 'while' and the regex will take up where it left off;
open INPUT, "< a.dat" or die "Can't open input file: $!";
while (my $line = <INPUT>){
while ($line =~ /(\d\d)\/(\d\d)\/([5-9][0-9])/g){
print "$&\n" ;
}
}
# Output given line above
# 10/10/57
# 10/09/66
You could also capture the whole of the date into one capture variable and use a different regex delimiter to save escaping the slashes:
while ($line =~ m|(\d\d/\d\d/[5-9]\d)|g) {
print "$1\n" ;
}
...but that's a matter of taste, perhaps.
You can use map also to get year range 50 to 99 and store in array
open INPUT, "< dates.txt" or die "Can't open input file: $!";
#as = map{$_ =~ m/\d\d\/\d\d\/[5-9][0-9]/g} <INPUT>;
$, = "\n";
print #as;
Another way around it is removing the dates you don't want.
$line =~ s/\d\d\/\d\d\/[0-4]\d//g;
print $line;

open file for each element in array and check against a regex Perl

I have an array filled with 4 digit numbers (#nums) that correspond
to conf files which use the numbers as the file name, like so: 0000.conf
I am reading a file foreach element in the array and checking
the file for a pattern like this :
use strict;
use warnings;
foreach my $num (#nums) {
open my $fh, "<", "$num.conf"
or warn "cannot open $num.conf : $!";
while(<$fh>) {
if (/^SomePattern=(.+)/) {
print "$num : $1\n";
}
}
}
I am extracting the part of the pattern I want using () and the
special var $1.
This seems to be working except it only prints the results of the last file
that is opened, instead of printing the results each time the foreach loop
passes and opens a file, which is what I expected.
I am still learning Perl, so any detailed explanations of what I missing here
will be greatly appreciated.
use v5.16;
use strict;
use warnings;
my #nums = qw/ 0000 0200 /;
for my $num (#nums){
open my $fh, "<", "$num.conf" or die;
while (<$fh>) {
chomp;
if( /^somePattern=(.+)/ ) {
say "$1";
}
}
close $fh;
}
this seems to be working for me..You are missing the close $fh; in your code, maybe that is wrong. Secondly, maybe only one of your files matches you regex, check the content for typos. I myself don't use foreach, maybe you are missing 'my' before $num. Depending of your regex, it might be useful to strim newline characters from the end of line with 'chomp'.
Your code is excellent for a learner.
The problem is that, using "$num.conf", you are trying to open files named 0.conf etc. instead of 0000.conf.
You should also use the value of $! in your die string so that you know why the open failed.
Write this instead
my $file = sprintf '%04d.conf', $num;
open my $fh, '<', $file or die "Unable to open '$file': $!";
I have left my previous answer as it may be useful to someone. But I missed your opening "I have an array filled with 4 digit numbers".
Doubtless you are populating your array wrongly.
If you are reading from a file then most usually you have forgotten to chomp the newline from the end of the lines you have read.
You may also have non-printable characters (usually tabs or spaces) in each number.
You should use Data::Dumper or the better and more recent Data::Dump to reveal the contents of your array.

How to return only lines that do not match any values of an array?

I'm attempting to compare each line in a CSV file to each and every element (strings) I have stored in an array using Perl. I want to return/print-to-file the line from the CSV file only if it is not matched by any of the strings in the array. I've tried numerous kinds of loops to achieve this, but have not only not found a solution, but none of my attempts is really giving me clues as to where I'm going wrong. Below are a few samples of the loops I've tried:
while (<CSVFILE>) {
foreach $i (#lines) {
print OUTPUTFILE $_ if $_ !~ m/$i/;
}; #foreach
}; #while
AND:
foreach $i (#lines) {
open (CSVFILE , "< $csv") or die "Can't open $csv for read: $!";
while (<CSVFILE>) {
if ($_ !~ m/$i/) {
print OUTPUTFILE $_;
}; #if
}; #while
close (CSVFILE) or die "Cannot close $csv: $!";
}; #foreach
Here is a sample of the CSV file I am attempting:
1,c.03_05delAAG,null,71...
2,c.12T>G,null,24T->G,5...
3,c.87C>T,null,96C->T,82....
And the array elements (with regex escape characters):
c\.12T\>G
c\.97A\>C
Assuming only the above as input data, I would hope to get back:
1,c.03_05delAAG,null,71...
3,c.87C>T,null,16C->T....
since they do not contain any of the elements from the array. Is this a situation where Hashes come into play? I don't have a great handle on them yet, aside from the standard "dictionary" definition. If anyone could help me get my head around this problem it would be greatly appreciated. A this point I might just do it manually as there isn't that many and I need this out of the way ASAP, but since I wasn't able to find any answers searching anywhere else I figured it was worthwhile asking.
Use Perl 5.10.1 or better, so you can apply smart matching.
Also, don't use the implicit $_ when you're dealing with two loops, it gets too confusing and is error prone.
The following code (untested) might do the trick:
use 5.010;
use strict;
use warnings;
use autodie;
...
my #regexes = map { qr{$_} } #lines;
open my $out, '>', $outputfile;
open my $csv, '<', $csvfile;
while (my $line = <$csv>) {
print $out $line unless $line ~~ #regexes;
}
close $csv;
close $out;
The reason your code doesn't work, by the way, is that it will print a line if any of the elements in #lines don't match, and that will always be the case.