Perl read a large file for use with multi line regex - regex

I have a 4GB text file with highly variable length lines, this is only a sample file, production files will be much larger. I need to read the file and apply a multi line regex.
What is the best way to read such a large file for the multi line regex?
If I read it line by line, I don't think my multi line regex will work correctly. When I use the read function in 3 argument form my regex results vary as I change the size of length I specify in the the read statement. I believe that the file's size makes it too large to be read into an array or into memory.
Here is my code
package main;
use strict;
use warnings;
our $VERSION = 1.01;
my $buffer;
my $INFILE;
my $OUTFILE;
open $INFILE, '<', ... or die "Bad Input File: $!";
open $OUTFILE, '>',... or die "Bad Output File: $!";
while ( read $INFILE, $buffer, 512 ) {
if ($buffer =~ /(?m)(^[^\r\n]*\R+){1}^(B|BREAK|C|CLOSE|D|DO(?! NOT)|E|ELSE|F|FOR|G|GOTO|H|HALT|HANG|I|IF|J|JOB|K|KILL|L|LOCK|M|MERGE|N|O|OPEN|Q|QUIT|R|READ|S|SET|TC|TRE|TRO|TS|U|USE|V|VIEW|W|WRITE|X|XECUTE)( |:).*[^\r\n]/) {
print $OUTFILE $&;
print $OUTFILE "\n";
}
}
close( $INFILE );
close( $OUTFILE );
1;
Here is some sample data:
^%Z("EUD")
S %L=%LO,%N="E1"
^%Z("RT")
This is data that I don't want the regex to find
^%Z("EXY")
X ^%Z("EW2"),^%Z("ELONG"):$L(%L)>245 S %N="E1" Q:$L(%L)>255 X ^%ZOSF("EON") S DX=0,DY=%EY,X=%RM+1 X ^%ZOSF("RM"),XY K %EX,%EY,%E1,%E2,DX,DY,%N Q
^%Z("F12")
S %A=$P(^DIC(9.8,0),"^",3)+1,%C=$P(^(0),"^",4)+1 X "F %=0:0 Q:'$D(^DIC(9.8,%A,0)) S %A=%A+1" S $P(^DIC(9.8,0),"^",3,4)=%A_"^"_%C,^DIC(9.8,%A,0)=%X_"^R",^DIC(9.8,"B",%X,%A)=""
^%Z("F2")
S %=$H>21549+$H-.1,%Y=%\365.25+141,%=%#365.25\1,%D=%+306#(%Y#4=0+365)#153#61#31+1,%M=%-%D\29+1,%DT=%Y_"00"+%M_"00"+%D,%D=%M_"/"_%D_"/"_$E(%Y,2,3)
The lines above are paired, syntactically (line 1 and 2 go together, 3 and 4, etc). I need to find specific pairs, in the above data that's all of the pairs except for:
^%Z("RT")
This is data that I don't want the regex to find

The question is apparently about parsing a DSL, and it seems that in general regex isn't the right tool for that. A quick search did not yield an easy list of accepted approaches, except for pages of CPAN modules and posts like this article. Finding out the best approach is indeed the first step.
However, below is an answer to the question as stated in the title and in the clear description: how to parse a very large file where units to be processed spread over an unknown number of lines.
Keep assembling a 'buffer' and checking it. Once you find a match, process and clear it.
For instance, appeand a line to a variable and check (try to match if you use regex). Keep going and once it does match process and clear the variable.
my $unit;
while (<$fh>) {
# chomp; # if suitable, and then add a space
# $unit .= ' '.$_; # as a separator that newline was
$unit .= $_;
if ( test_unit($unit) ) {
# process ...
$unit = undef;
}
}
The test_unit() sub is a placeholder for code that would decide whether the assembled unit should be processed. If that is regex it can be defined before the loop, my $re = qr/.../; (see qr in perlop), and then test in the loop with if ($unit =~ $re)
A note in the question states that lines to be processed come in pairs, but it is clarificated in a comment that subsequent lines don't always pair up. Thus we can't process pairs of lines.

Related

Matching fields in a log file and transforming results

First a quick intro. I'm new here, so if I screw up a post, please let me know and I'll fix it.
I've been trying to accomplish my goal using perl, but I'm stuck. I don't need to use perl to accomplish it, but I figure it's that, or Excel and I like perl better. If you have a better method please share.
I start with a file (output from a log file). It is 1 line, fields delimitted by colon. Here is an example of the file:
RmDenySumm:SGID=46244:Req=15000:tsid=46244:AllocBw=38332:BwList=12456/12500/3750/5876/3750:tsid=63042:AllocBw=38750:BwList=15000/12500/3750/3750/3750:tsid=63043:AllocBw=36717:BwList=14706/12500/3750/5761:tsid=63044:AllocBw=37011:BwList=15000/12500/5761/3750:tsid=61741:AllocBw=38450:BwList=12339/3750/6501/12502/3357:tsid=61721:AllocBw=37460:BwList=12500/15000/4200/5760:tsid=2072:AllocBw=31975:BwList=12136/12339/3750/3750:tsid=2073:AllocBw=24260:BwList=14634/5876/3750:tsid=30842:AllocBw=38453:BwList=14634/12500/5761/5557:tsid=30843:AllocBw=37105:BwList=15000/15000/3750/3355:tsid=30844:AllocBw=38295:BwList=14706/12339/3750/3750/3750:tsid=30845:AllocBw=25601:BwList=5762/12339/3750/3750:tsid=30846:AllocBw=38455:BwList=15000/12136/5761/5557:tsid=30847:AllocBw=26974:BwList=14634/12339:tsid=30848:AllocBw=29634:BwList=14634/15000:tsid=30849:AllocBw=37338:BwList=14838/15000/3750/3750:tsid=60958:AllocBw=36898:BwList=12339/12500/6501/5557:tsid=60959:AllocBw=37178:BwList=12339/12500/12339:tsid=60960:AllocBw=27339:BwList=12339/15000:tsid=60962:AllocBw=34839:BwList=12339/3750/15000/3750:tsid=60963:AllocBw=37500:BwList=15000/15000/3750/3750:tsid=60964:AllocBw=38346:BwList=15000/3754/15000/4592:tsid=60965:AllocBw=24626:BwList=15000/5876/3750:tsid=60966:AllocBw=34513:BwList=12502/12500/5761/3750
I need to grab all of "AllocBW=######" fields, separate the number part from the "AllocBW", add them all together then subtract them from a set value.
In perl, I have this:
#!/usr/bin/perl -w
use Data::Dumper;
#
#
my $file = "/home/nick/perl/svcgroup.txt";
my #asplit;
my $c = 0;
open (FILE, "<", $file) or die "Can't open file".$!."\n";
while (<FILE>) {
$_ =~ s/\n//g;
push(#asplit, split (":", $_));
#print Dumper #asplit;
}
foreach $splits (#asplit) {
if ($splits =~ m/AllocBw/) {
print $splits."\n";
}
}
#print Dumper #asplit;
print "\n\n";
close FILE;
exit;
Which leaves me with:
AllocBw=38332
AllocBw=38750
AllocBw=36717
AllocBw=37011
AllocBw=38450
AllocBw=37460
AllocBw=31975
AllocBw=24260
AllocBw=38453
AllocBw=37105
AllocBw=38295
AllocBw=25601
AllocBw=38455
AllocBw=26974
AllocBw=29634
AllocBw=37338
AllocBw=36898
AllocBw=37178
AllocBw=27339
AllocBw=34839
AllocBw=37500
AllocBw=38346
AllocBw=24626
AllocBw=34513
This is where I get stuck. I'm not sure how to strip these values down to the number and add them up.
If someone can assist, I'd be grateful. If this is more easily accomplished using something other than Perl, that's fine too. My programming scope is limited, as I only make small scripts to accomplish small repetitive tasks at work.
EDIT FOR BORODIN
ie (not formatted like this, this is just for illustration):
AllocBw 12575+
AllocBw 12568+
AllocBw 12358 = TotAllocBw 37501
MaxBw 38800*3=116400
116400(MaxBw) - 37501(TotAllocBw) = TotAvaiBw 78899
This would just be a big bonus. The script you wrote works perfectly well for my purposes and I can adapt it as I need. Thanks again! Much appreciated. I was able to follow everything you did differently in the script and learned some new stuff.. Thanks for that as well.
It is simplest to use a global regular expression match to find all occurrences of AllocBw=... in each line of your input file.
This program's outer while loop iterates over all the lines in the input file, and so should be executed only once.
The inner while iterates over all instances of the regex pattern AllocBw=(\d+) (AllocBw= followed by any number of decimal digits) and captures the numeric value into $1.
The captured number is added to $total each time, and can simply be printed at the end.
use strict;
use warnings;
my $file = '/home/nick/perl/svcgroup.txt';
open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!};
my $total = 0;
while ( <$fh> ) {
$total += $1 while /AllocBw=(\d+)/g;
}
printf "Total: %d\n", $total;
output
Total: 826049

Perl - How to Read, Filter & Output results

scenario: I am a Jr. C# developer, but recently (3 days) began learning Perl for batch files. I have a requirement to parse through a text file, extract some key data, then output the key data to a new text file. As seems to always be the case, there are butt loads of fragmented examples on the net regarding how to 'read' from a file, 'write' to a file, 'store' line by line into an array, 'filter' this and that, yadda yadda, but nothing discussing the entire process of read, filter, write. Trying to splice examples from the net together is no good, because none seem to work together as coherent code. Coming from C#, Perl's syntax structure is hella confusing. I just need some advice on this process.
My objective is to parse a text file, single out all lines similar to the one below, by date, and output only the first 8 digits of the 2nd number group and 5 digits from the 3rd number group to a new text file.
11122 20100223454345 ....random text..... [keyword that identifies all the
entries I need]... random text 0.0034543345
I know regex is likely the best option, and have most of the expression written, but it does not work in Perl!
Question: Could someone please show a simple (dummy) example of how to read from, filter (using dummy regex) the file, then output the (dummy) results to a new file? I'm not concerned with functional details, I can learn those, I just need the syntax structure Perl uses. For example:
open(FH, '<', 'dummy1.txt')
open(NFH, '>', 'dummy2.txt')
#array; or $dumb;
while(<FH>)
{
filter each line [REGEX] and shove it into [#array or $dumb scalar]
}
print(join(',', #array)) to dummy2.txt
close FH;
close NFH;
Note: For various reasons, I cannot paste my source code in here, sorry. Any help is appreciated.
UPDATE: ANSWER:
Much thanks to all those who provided insight into my issue. After reading through you replies, as well as conducting further research, I learned that there are dozens of ways to accomplish the same task in Perl(which I am not a fan of). In the end, this is how I solved the problem, and IMO it's the cleanest, and most succinct, solution for those having similar struggles. Thanks again for all the help.
#======================================================================
# 1. READ FILE: inputFile.txt
# 2. CREATE FILE: outputFile.txt
# 3. WRITE TO: outputFile.txt IF line matches REGEX constraints
# 4. CLOSE FILES: outputFile.txt & inputFile.txt
#==========================================================================
#1
$readFile = 'C:/.../.../inputFile.txt';
open(FH, '<', $readFile) or Error("Could not read file ($!)");
#2
$writeFile = 'C:/.../.../outputFile.txt';
open(NFH, '>', $writeFile) or Error("Cannot write to file ($!)");
#3
#lines = <FH>;
LINE: foreach $line (#lines)
{
if ($line =~ m/(201403\d\d).*KEYWORD.*time was (\d+\.\d+)/)
{
$date = $1;
$elapsedtime = $2;
print NFH "$date,$elapsedtime\n";
}
}
#4
close NFH;
close FH;
perlfaq5 - How do I change, delete, or insert a line in a file, or append to the beginning of a file? covers most of the different scenarios for how to use files.
However, I will add to that by saying that always start your scripts with use strict; and use warnings;, and because you're doing file processing, use autodie; will serve you as well.
With that in mind, a quick stub would be the following:
use strict;
use warnings;
use autodie;
open my $infh, '<', 'dummy1.txt';
open my $outfh, '>', 'dummy2.txt';
while (my $line = <$infh>) {
chomp $line; # Remove \n
if (Whatever magically processing here) {
print $outfh, "your new data";
}
}
while(<FH>)
{
# variable $_ contains the current line
if(m/regex_goes_here/) #by default, the regex match operator m// attempts to match the default $_ variable
{
#do actions
}
}
Also note, m/regex/ is the same as /regex/
Refer to:
http://perldoc.perl.org/perlvar.html#General-Variables
http://perldoc.perl.org/perlre.html
For capturing variables from regex match, THIS might help
EDIT
If you want a different variable than the default $_, as #Miller suggested, use while($line = <FH>) followed by if($line =~ m/regex_goes_here/)
=~ is the Binding Operator
One tip. Don't explicitly open filehandles to your input and output files. Instead read from STDIN and write to STDOUT. Your program will be far more flexible and easier to use as you'll be able to treat it like a Unix filter.
$ your_filter_program < your_input.txt > your_output.txt
And doing this actually makes your program simpler to write too.
while (<>) { # <> reads from STDIN
# transform your data (which is in $_) in some way
...
print; # prints $_ to STDOUT
}
You might find the first few chapters of Data Munging with Perl are useful.
use strict;
use warnings;
use autodie;
use feature qw(say);
use constant {
INPUT_FILE => "NAME_OF_INPUT_FILE",
OUTPUT_FILE => "NAME_OF_OUTPUT_FILE",
FILTER => qr/regex_for_line_to_filter/,
};
open my $in_fh, "<", INPUT_FILE;
open my $out_fh, ">", OUTPUT_FILE;
while ( my $line = <$in_fh> ) {
chomp $line;
next unless $line =~ FILTER;
$line =~ s/regular_expression/replacement/;
say {$out_fh} $line;
}
close $in_file;
close $out_file;
The $in_file is your input file, and $out_fh is your output file. I basically open both, and loop through the input. The chomp removes the \n from the end. I always recommend doing that.
The next goes to the next iteration of the loop unless I match FILTER which is a regular expression matching lines you want to keep. This is identical to:
if ( $line !~ FILTER ) {
next;
}
I then use the substitution command to get the parts of the line I want, and munge them into the output I want. I maybe better off expanding this a bit. Maybe using split to split up my line into various pieces, the only using the pieces I want. I could then use substr to pull out the substring from the select pieces.
The say command is like print except it automatically adds in a NL on the end. This is how you write a line to a file.
Now, get Learning Perl and read it. If you know any programming. it shouldn't take you more than a week to go through the first half of the book. That should be more than enough to be able to write a program like this. The more complex stuff like references and object orientation might take a bit longer.
On line documentation can be found at http://perldoc.perl.org. You can look up the use statements which are called pragmas over there. Documentation on the individual functions are also available.
If I understood well, this one liner will do the job:
perl -ane 'print substr($F[1],0,8),"\t",substr($F[-1],0,5),"\n" if /keyword/' in.txt
Assuming in.txt is:
11122 20100223454345 ....random text..... [keyword that identifies all the entries I need]... random text 0.0034543345
11122 30100223454345 ....random text..... [ that identifies all the entries I need]... random text 0.124543345
11122 40100223454345 ....random text..... [keyword that identifies all the entries I need]... random text 0.65487
11122 50100223454345 ....random text..... [ that identifies all the entries I need]... random text 0.6215
output:
20100223 0.003
40100223 0.654

Perl Regex Match Text String and Extract Following Number

I have a giant text data file (~100MB) that is a concatenation of a bunch of data files with various header information then some columns of data. Here's the problem. I want to extract a particular number from the header info before each of these data sets and then append that to another column in the data (and write out that data to a different file).
The header info that I want is of the format ex: BGA 1
Where what I want for that extra data column is the # after word BGA. It will be a number between 1 and maybe 20000. I can write the regex to pull the word BGA, but I don't seem to be able to figure out how to just get the digit after it.
To add EXTRA fun, that text "BGA 1" is repeated in each data section TWICE.
Here's what I have so far, which actually doesn't work... I want it to at least print "BGA" everytime it encounters the word BGA, but it prints nothing.... Any help would be appreciated.
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'alldata.txt';
open my $info, $file or die "Could not open $file: $!";
$_="";
while(my $line = <$info>){
if ($line eq "/BGA/"){
print <>,"\n";
}
}
close $file;
if ($line =~ /BGA\s(\d+)/){
#your code
print "BGA number $1 \n";
#your code
}
And $1 variable will have the number you want
If there is more than one BGA per line, you'll need to allow the regex to match more than once per line:
while (my $line = <$info>) {
while ( $line =~ /BGA\s(\d+)/g ) {
print "$1\n";
}
}
This should print out all the BGA numbers as a single column. Without any further information it's hard to answer this any better.
First, a 100 MB file is not giant. Don't be so defeatist. You could even slurp it into memory:
Let's look at the few critical places in your code:
while(my $line = <$info>) {
if ($line eq "/BGA/") {
Your condition $line eq "/BGA/" tests if the line literally consists of the string "/BGA/". But, that can never be true for the line with at least have the input record separator, i.e. the contents of $/ at the end because you did not chomp it. In any case, what you want is to match lines that contain "BGA" anywhere and the proper Perl syntax to do that is
if ($line =~ /BGA/) {
Now, once you fix that, you are going to run into a problem with the following statement:
print <>,"\n";
What you really want is print $line;. The diamond operator, <>, in list context is going to try to slurp from STDIN or any files specified as arguments on the command line. Not a good idea.
Others have pointed out how to match the string "BGA" followed by a digit. For better answers, you are going to need to show examples of input and expected output.

Regular expression statement inside a while loop only matching and printing one of several expected matches

I've been struggling with this for a while and I was wondering if there was something obvious I've missed.
As programming learning/practice, I'm trying to put together a simple script for calculating the components of a restriction enzyme digest mix. However, first I need to get a list of enzyme stock concentrations.
I pulled all the individual pages from the New England Biolabs enzyme page, and my goal with this current script is to pull out the name of the enzyme and the concentrations available from the company.
This example works with a local copy of EcoRI (link included at bottom of submission).
use warnings;
use strict;
open(FILE,'productR0101.asp');
my $line;
my $counter;
my $array1;
my $array2;
my $array3;
my $concentration;
my #array4;
$counter = 1;
while ($line = <FILE>) {
chomp($line);
if ($counter == 6 ){
$array1 = $line;
$counter++;
}
else{
$counter++;
}
if ($line =~ m/.{8}units.ml/g) {
(#array4) =$line =~ m/.{8}units.ml/g;
print #array4;
}
}
print "\n".$array1;
exit;
Every file has the enzyme name on the sixth line of the file, so I just pulled that whole line. However, the concentrations are in different locations, so my approach was to read in the file one line at a time, and match to the units/ml tag.
My thinking was that it should print out the match for each line, if there was one, every time the while loop runs, effectively resulting in a string of separate print statements.
This is where I get messed up. There are six different locations in this file with a units/ml tag: three for 20,000 and three for 100,000.
I was expecting six different results printed, but when I run this, only one 100,000 units/ml result is returned.
I've tried all sorts of fixes. I tried concatenating strings, I tried storing it as a string, I tried concatenating it onto another array that never gets touched by the (#array4) = $line =~ m/.{8}units.ml/g line, and it either breaks it or gives the same result.
And finally, I apologize for any weird conventions. I'm still learning Perl, and my first experience programming was with MATLAB.
Also, the $array1, $array2, etc. exist because I was trying to keep track of exactly what was getting put where; my intention is to clean it up once I get it functional.
So does anyone have any ideas about what I'm doing wrong?
EDIT: the data source is the source code to each individual enzyme page. For this example, if you view the page source you get the complete input file I gave to the script.
Are the 20,000 units/ml at the start of the line? Because in that case, .{8} would fail to match - the dot doesn't match newlines, and 20,000_ is only 7 characters.
We really need to see the data you are processing, but it looks like you are storing only the last occurrence of /units.ml/ in #array4 because you are reading the file line by line.
I will add to this answer if you supplement your question, but for now I need to know
What your data looks like
What the mysterious /.{8}/ is for
Are you aware that $array1, $array2, and $array3, are scalars, as well as being very bad names for variables?
For now, here is a rewrite of your code using idiomatic Perl, and the $. variable that evaluates to the line number of the file most recently read
use strict;
use warnings;
open my $file, '<', 'productR0101.asp' or die $!;
my $array1;
my #array4;
while (my $line = <$file>) {
chomp $line;
$array1 = $line if $. == 6;
if ($line =~ m/.{8}units.ml/) {
#array4 = $line =~ m/.{8}units.ml/g;
print "#array4\n";
}
}
print "\n".$array1;
I can't exactly reproduce the behavior you've reported of only getting one of the 100,000 units/ml results, as I'm not exactly sure what your input data is. However, I think the problem is with the regular expression not having any captures. You should put parenthesis around the part of the regex match that you want to be returned to #array4. So instead of this:
#array4 = $line =~ m/.{8}units.ml/g;
Try this:
#array4 = $line =~ m/(.{8})units.ml/g;
#array4 = $line =~ /(.{8})units.ml/;
EDIT:
You also don't want to use the m/ and /g modifiers.

help with perl code to parse a file

I am new to Perl and have a question about the syntax. I received this code for parsing a file containing specific information. I was wondering what the if (/DID/) part of the subroutine get_number is doing? Is this leveraging regular expressions? I'm not quite sure because regular-expression matches look like $_ =~ /some expression/. Finally, is the while loop in the get_number subroutine necessary?
#!/usr/bin/env perl
use Scalar::Util qw/ looks_like_number /;
use WWW::Mechanize;
# store the name of all the OCR file names in an array
my #file_list=qw{
blah.txt
};
# set the scalar index to zero
my $file_index=0;
# open the file titled 'outputfile.txt' and write to it
# (or indicate that the file can't be opened)
open(OUT_FILE, '>', 'outputfile.txt')
or die "Can't open output file\n";
while($file_index < 1){
# open the OCR file and store it in the filehandle IN_FILE
open(IN_FILE, '<', "$file_list[$file_index]")
or die "Can't read source file!\n";
print "Processing file $file_list[$file_index]\n";
while(<IN_FILE>){
my $citing_pat=get_number();
get_country($citing_pat);
}
$file_index=$file_index+1;
}
close IN_FILE;
close OUT_FILE;
The definition of get_number is below.
sub get_number {
while(<IN_FILE>){
if(/DID/){
my #fields=split / /;
chomp($fields[3]);
if($fields[3] !~ /\D/){
return $fields[3];
}
}
}
}
Perl has a variable $_ that is sort of the default dumping ground for a lot of things.
In get_number, while(<IN_FILE>){ is reading a line into $_, and the next line is checking if $_ matches the regular expression DID.
It's also common to see chomp; which also operates on $_ when no argument is given.
In that case, if (/DID/) by default searches the $_ variable, so it is correct. However, it is a rather loose regex, IMO.
The while loop in the sub may be necessary, it depends on what your input looks like. You should be aware that the two while loops will cause some lines to get completely skipped.
The while loop in the main program will take one line, and do nothing with it. Basically, this means that the first line in the file, and every line directly following a matching line (e.g. a line that contains "DID" and the 4th field is a number), will also be discarded.
In order to answer that question properly, we'd need to see the input file.
There are a number of issues with this code, and if it works as intended, it's probably due to a healthy amount of luck.
Below is a cleaned up version of the code. I kept the modules in, since I do not know if they are used elsewhere. I also kept the output file, since it might be used somewhere you have not shown. This code will not attempt to use undefined values for get_country, and will simply do nothing if it does not find a suitable number.
use warnings;
use strict;
use Scalar::Util qw/ looks_like_number /;
use WWW::Mechanize;
my #file_list=qw{ blah.txt };
open(my $outfile, '>', 'outputfile.txt') or die "Can't open output file: $!";
for my $file (#file_list) {
open(my $in_file, '<', $file) or die "Can't read source file: $!";
print "Processing file $file\n";
while (my $citing_pat = get_number($in_file)) {
get_country($citing_pat);
}
}
close $out_file;
sub get_number {
my $fh = shift;
while(<$fh>) {
if (/DID/) {
my $field = (split)[3];
if($field =~ /^\d+$/){
return $field;
}
}
}
return undef;
}