Optimal way to split huge file - regex

I am trying to split a huge text file (~500 million lines of text) which is pretty regular and looks like this:
-- Start ---
blah blah
-- End --
-- Start --
blah blah
-- End --
...
where ... implies a repeating pattern and "blah blah" is of variable length ~ 2000 lines. I want to split off the first
-- Start --
blah blah
-- End --
block into a separate file and delete it from the original file in the FASTEST (runtime, given I will run this MANY times) possible way.
The ideal solution would cut the initial block from the original file and paste it into the new file without loading the tail of the huge initial file.
I attempted csplit in the following way:
csplit file.txt /End/+1
which is a valid way of doing this, but not very efficient in time.
EDIT: Is there a solution if we remove the last "start-end" block from file instead of the first one?

If you want the beginning removed from the original file, you have no choice but to read and write the whole rest of the file. To remove the end (as you suggest in your edit) it can be much more efficient:
use File::ReadBackwards;
use File::Slurp 'write_file';
my $fh = File::ReadBackwards->new( 'inputfile', "-- End --\n" )
or die "couldn't read inputfile: $!\n";
my $last_chunk = $fh->readline
or die "file was empty\n";
my $position = $fh->tell;
$fh->close;
truncate( 'inputfile', $position );
write_file( 'lastchunk', $last_chunk );

Perhaps something like the following will help you:
Split the file after every -- End -- marker. Create new files with a simple incremented suffix.
use strict;
use warnings;
use autodie;
my $file = shift;
my $i = 0;
my $fh;
open my $infh, '<', $file;
while (<$infh>) {
open $fh, '>', $file . '.' . ++$i if !$fh;
print $fh $_;
undef $fh if /^-- END --/;
}
Unfortunately, there is no truncate equivalent for removing data from the beginning of a file.
If you really wanted to do this in stages, then I would suggest that you simply tell the last place you read from, so you can seek when you're ready to output another file.

You could use the flip-flop Operator to get the content between this Pattern:
use File::Slurp;
my #text = read_file( 'filename' ) ;
foreach my $line (#text){
if ($line =~ /Start/ .. /End/) {
# do stuff with $line
print $line; # or so
}
}
When your file is large, be carefull with slurping the whole file at once!

Related

How to handle files that do not contain a pattern?

I need help with my Perl program. The idea is to pass in a pattern and a file list from the command line. If the file name matches the pattern, print the file name. Then if the file name doesn't match, it should look for instances of the pattern in the text of the file and print filename : first line of text that contained occurrence.
However should the user add the -i option at the beginning the opposite should occur. If the filename does not match print it. Then print any files that do not contain any instances of the pattern in their text.
This last part is where I'm struggling I'm not exactly sure how to get files that don't have the pattern in their text. For example in my code
#!/usr/bin/perl -w
die("\n Usage: find.pl [-i] <perlRegexPattern> <listOfFiles>\n\n") if(#ARGV<2);
my (#array,$pattern,#filesmatch,#files);
#I can separate files based on name match
($pattern,#array) = ($ARGV[0] eq "-i") ? (#ARGV[1 .. $#ARGV]) : (#ARGV);
foreach(#array){
($_ =~ m/.*\/?$pattern/) ? (push #filesmatch,$_) : (push #files, $_);
}
#and I can get files that contain a pattern match in their text
if($ARGV[0] ne "-i"){
for my $matches(#filesmatch){ #remove path print just file name
$matches =~s/.*\///; #/
print "$matches\n";
}
for my $file(#files){
open(FILE,'<',$file) or die("\nCould not open file $file\n\n");
while(my $line = <FILE>){
if($line =~ m/$pattern/){
$file =~ s/.*\///; #/ remove path print just file name
print "$file: $line";
next;
}
}
}
}
#however I'm not sure how to say this file dosen't have any matches so print it
else{
for my $matches(#files){ #remove path print just file name
$matches =~ s/.*\///;
print "$matches\n";
}
for my $file(#filesmatch){
open(FILE,'<',$file) or die("\nCould not open file $file\n\n");;
while(my $line = <FILE>){...
I'm not sure if something like grep could be used to do this but I'm having a hard time working with Perl's grep.
In order to decide whether to print or not a file based on its content you have to first read the file. With your criterion -- that a phrase does not exist -- you have to check the whole file.
A standard way is to use a separate variable ("flag") to record the condition then go back to print
my $has_match;
while (<$fh>) {
if (/$pattern/) {
$has_match = 1;
last;
}
}
if (not $has_match) {
seek $fh, 0, 0; # rewind to the beginning
print while <$fh>;
}
This can be simplified by reading the file into a variable first, and by using labels (also see perlsyn)
FILE: foreach my $file (#filesmatch) {
open my $fh, '<', $file or die "Can't open $file: $!";
my #lines = <$fh>;
for (#lines) {
next FILE if /$pattern/;
}
print for #lines;
}
Note that skipping an iteration in the middle of a loop isn't the cleanest way since one has to always keep in mind that the rest of the loop may not run.
Each file is read first so that we don't read it twice, but don't do that if any of the files can be huge.
If there is any command line processing it is better to use a module; Getopt::Long is nice.
use Getopt::Long;
my ($inverse, $pattern);
GetOptions('inverse|i' => \$inverse, 'pattern=s' => \$pattern)
or usage(), exit;
usage(), exit if not $pattern or not #ARGV;
sub usage { say STDERR "Usage: $0 ... " }
Call the program as progname [-i] --patern PATTERN files. The module provides a lot, please see docs. For example, in this case you can also just use -p PATTERN.
As GetOptions parses the command line the submitted options are removed from #ARGV and what remains in it are file names. And you have the $inverse variable to nicely make decisions.
Please have use warnings; (not -w) and use strict; at the top of every program.

Perl capture and add to end of string

I have a file with a lot of lines like this:
ChrVIII_A_nidulans_FGSC_A4 AspGD gene 3861520 3863875 . + . ID=AN0338;Name=AN0338;Gene=CYP680A1;Note=Putative%20cytochrome%20P450;orf_classification=Uncharacterized;Alias=ANIA_00338,ANID_00338
My region of interest is ;Gene=_____; -- the stuff between the = and ;.
If this region exists, I want to append it to the end of the line with a , attached to the front. If it does not exist I want to print the line anyway!
ChrVIII_A_nidulans_FGSC_A4 AspGD gene 3861520 3863875 . + . ID=AN0338;Name=AN0338;Gene=CYP680A1;Note=Putative%20cytochrome%20P450;orf_classification=Uncharacterized;Alias=ANIA_00338,ANID_00338,CYP680A1
This is what I tried in Perl and I don't know why it doesn't work.
use strict;
use warnings;
open(SOURCE,"<annotation.gff") or die "Source file not found!\n";
my $line1;
foreach $line1(<SOURCE>) #iterating over SOURCE file
{
if($line1=~/Gene\=([a-zA-Z0-9\-]+)\;/)
printf "$line1,$1";
}
else {printf "$line1";}
}
Can anyone show me what I am doing wrong?
Let's go through your code:
use strict;
use warnings;
Good. However, trying to run your code gives:
syntax error at ss.pl line 9, near ")
printf"
syntax error at ss.pl line 11, near "else"
which means you did not post the code you ran, so we can't really trust it. Don't do that. Reduce your problem to a small, self-contained script others can run.
open(SOURCE,"<annotation.gff") or die "Source file not found!\n";
Don't use bareword filehandles such as SOURCE. Instead, use lexical filehandles.
Don't hard code the name of the file you are trying to open. Doing so makes it hard to accurately convey the name of the file your program failed to open in case of a failure.
In the error message, include actual error your program encountered, rather than hardcoding your unwarranted assumptions.
Don't use the two argument form of open, especially if you are going to want the flexibility to specify file names as command line arguments instead of having to edit the script every time you get a new input file. That is, use
my $annotation_file = 'annotation.gff';
open my $source, '<', $annotation_file
or die "Failed to open annotation source '$annotation_file': $!";
Don't declare the iteration variable for a loop outside the scope of the loop.That is, instead of:
my $line1;
foreach $line1 ( ... )
use
foreach my $line1 ( ... )
But, of course, you should not use a for loop to iterate over the contents of a file because doing so makes your program slurp (i.e. read the entire contents of) the file into memory as a list of lines. This makes the memory footprint of your program depend on the size of its input instead of the size of the longest line. Also, drop the 1 suffix: You are iterating through every line in the file, not just the first one.
while (my $line = <$source>) {
Don't use printf if you are just printing plain strings. That is, instead of printf "$line1,$1", use print "$line,$1\n".
And, that brings us to another problem. When you read the line, you never remove the newline off its end. Therefore, the string you print is "...\n..." which creates the effect of prepending the captured string to the beginning of the following line.
That brings us to something that works:
use strict;
use warnings;
my $annotation_file = 'annotation.gff';
open my $source, '<', $annotation_file
or die "Cannot open annotation source '$annotation_file': $!";
while (my $line = <$source>) {
if( $line =~ /Gene = ( [^;]+ ) ;/x ) {
chomp $line;
print join(',' => $line, $1), "\n";
}
else {
print $line;
}
}
Try this:
use strict;
use warnings;
open(my $fh, '<', 'annotation.gff') or die $!;
while (<$fh>) {
chomp;
/Gene=([a-zA-Z0-9\-]+)\;/ and $_ .= ",$1";
print "$_\n";
}
close $fh;

perl regex: searching thru entire line of file

I'm a regex newbie, and I am trying to use a regex to return a list of dates from a text file. The dates are in mm/dd/yy format, so for years it would be '55' for '1955', for example. I am trying to return all entries from years'50' to '99'.
I believe the problem I am having is that once my regex finds a match on a line, it stops right there and jumps to the next line without checking the rest of the line. For example, I have the dates 12/12/12, 10/10/57, 10/09/66 all on one line in the text file, and it only returns 10/10/57.
Here is my code thus far. Any hints or tips? Thank you
open INPUT, "< dates.txt" or die "Can't open input file: $!";
while (my $line = <INPUT>){
if ($line =~ /(\d\d)\/(\d\d)\/([5-9][0-9])/g){
print "$&\n" ;
}
}
A few points about your code
You must always use strict and use warnings 'all' at the top of all your Perl programs
You should prefer lexical file handles and the three-parameter form of open
If your regex pattern contains literal slashes then it is clearest to use a non-standard delimiter so that they don't need to be escaped
Although recent releases of Perl have fixed the issue, there used to be a significant performance hit when using $&, so it is best to avoid it, at least for now. Put capturing parentheses around the whole pattern and use $1 instead
This program will do as you ask
use strict;
use warnings 'all';
open my $fh, '<', 'dates.txt' or die "Can't open input file: $!";
while ( <$fh> ) {
print $1, "\n" while m{(\d\d/\d\d/[5-9][0-9])}g
}
output
10/10/57
10/09/66
You are printing $& which gets updated whenever any new match is encountered.
But in this case you need to store the all the previous matches and the updated one too, so you can use array for storing all the matches.
while(<$fh>) {
#dates = $_ =~ /(\d\d)\/(\d\d)\/([5-9][0-9])/g;
print "#dates\n" if(#dates);
}
You just need to change the 'if' to a 'while' and the regex will take up where it left off;
open INPUT, "< a.dat" or die "Can't open input file: $!";
while (my $line = <INPUT>){
while ($line =~ /(\d\d)\/(\d\d)\/([5-9][0-9])/g){
print "$&\n" ;
}
}
# Output given line above
# 10/10/57
# 10/09/66
You could also capture the whole of the date into one capture variable and use a different regex delimiter to save escaping the slashes:
while ($line =~ m|(\d\d/\d\d/[5-9]\d)|g) {
print "$1\n" ;
}
...but that's a matter of taste, perhaps.
You can use map also to get year range 50 to 99 and store in array
open INPUT, "< dates.txt" or die "Can't open input file: $!";
#as = map{$_ =~ m/\d\d\/\d\d\/[5-9][0-9]/g} <INPUT>;
$, = "\n";
print #as;
Another way around it is removing the dates you don't want.
$line =~ s/\d\d\/\d\d\/[0-4]\d//g;
print $line;

Perl Regex Match Text String and Extract Following Number

I have a giant text data file (~100MB) that is a concatenation of a bunch of data files with various header information then some columns of data. Here's the problem. I want to extract a particular number from the header info before each of these data sets and then append that to another column in the data (and write out that data to a different file).
The header info that I want is of the format ex: BGA 1
Where what I want for that extra data column is the # after word BGA. It will be a number between 1 and maybe 20000. I can write the regex to pull the word BGA, but I don't seem to be able to figure out how to just get the digit after it.
To add EXTRA fun, that text "BGA 1" is repeated in each data section TWICE.
Here's what I have so far, which actually doesn't work... I want it to at least print "BGA" everytime it encounters the word BGA, but it prints nothing.... Any help would be appreciated.
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'alldata.txt';
open my $info, $file or die "Could not open $file: $!";
$_="";
while(my $line = <$info>){
if ($line eq "/BGA/"){
print <>,"\n";
}
}
close $file;
if ($line =~ /BGA\s(\d+)/){
#your code
print "BGA number $1 \n";
#your code
}
And $1 variable will have the number you want
If there is more than one BGA per line, you'll need to allow the regex to match more than once per line:
while (my $line = <$info>) {
while ( $line =~ /BGA\s(\d+)/g ) {
print "$1\n";
}
}
This should print out all the BGA numbers as a single column. Without any further information it's hard to answer this any better.
First, a 100 MB file is not giant. Don't be so defeatist. You could even slurp it into memory:
Let's look at the few critical places in your code:
while(my $line = <$info>) {
if ($line eq "/BGA/") {
Your condition $line eq "/BGA/" tests if the line literally consists of the string "/BGA/". But, that can never be true for the line with at least have the input record separator, i.e. the contents of $/ at the end because you did not chomp it. In any case, what you want is to match lines that contain "BGA" anywhere and the proper Perl syntax to do that is
if ($line =~ /BGA/) {
Now, once you fix that, you are going to run into a problem with the following statement:
print <>,"\n";
What you really want is print $line;. The diamond operator, <>, in list context is going to try to slurp from STDIN or any files specified as arguments on the command line. Not a good idea.
Others have pointed out how to match the string "BGA" followed by a digit. For better answers, you are going to need to show examples of input and expected output.

help with perl code to parse a file

I am new to Perl and have a question about the syntax. I received this code for parsing a file containing specific information. I was wondering what the if (/DID/) part of the subroutine get_number is doing? Is this leveraging regular expressions? I'm not quite sure because regular-expression matches look like $_ =~ /some expression/. Finally, is the while loop in the get_number subroutine necessary?
#!/usr/bin/env perl
use Scalar::Util qw/ looks_like_number /;
use WWW::Mechanize;
# store the name of all the OCR file names in an array
my #file_list=qw{
blah.txt
};
# set the scalar index to zero
my $file_index=0;
# open the file titled 'outputfile.txt' and write to it
# (or indicate that the file can't be opened)
open(OUT_FILE, '>', 'outputfile.txt')
or die "Can't open output file\n";
while($file_index < 1){
# open the OCR file and store it in the filehandle IN_FILE
open(IN_FILE, '<', "$file_list[$file_index]")
or die "Can't read source file!\n";
print "Processing file $file_list[$file_index]\n";
while(<IN_FILE>){
my $citing_pat=get_number();
get_country($citing_pat);
}
$file_index=$file_index+1;
}
close IN_FILE;
close OUT_FILE;
The definition of get_number is below.
sub get_number {
while(<IN_FILE>){
if(/DID/){
my #fields=split / /;
chomp($fields[3]);
if($fields[3] !~ /\D/){
return $fields[3];
}
}
}
}
Perl has a variable $_ that is sort of the default dumping ground for a lot of things.
In get_number, while(<IN_FILE>){ is reading a line into $_, and the next line is checking if $_ matches the regular expression DID.
It's also common to see chomp; which also operates on $_ when no argument is given.
In that case, if (/DID/) by default searches the $_ variable, so it is correct. However, it is a rather loose regex, IMO.
The while loop in the sub may be necessary, it depends on what your input looks like. You should be aware that the two while loops will cause some lines to get completely skipped.
The while loop in the main program will take one line, and do nothing with it. Basically, this means that the first line in the file, and every line directly following a matching line (e.g. a line that contains "DID" and the 4th field is a number), will also be discarded.
In order to answer that question properly, we'd need to see the input file.
There are a number of issues with this code, and if it works as intended, it's probably due to a healthy amount of luck.
Below is a cleaned up version of the code. I kept the modules in, since I do not know if they are used elsewhere. I also kept the output file, since it might be used somewhere you have not shown. This code will not attempt to use undefined values for get_country, and will simply do nothing if it does not find a suitable number.
use warnings;
use strict;
use Scalar::Util qw/ looks_like_number /;
use WWW::Mechanize;
my #file_list=qw{ blah.txt };
open(my $outfile, '>', 'outputfile.txt') or die "Can't open output file: $!";
for my $file (#file_list) {
open(my $in_file, '<', $file) or die "Can't read source file: $!";
print "Processing file $file\n";
while (my $citing_pat = get_number($in_file)) {
get_country($citing_pat);
}
}
close $out_file;
sub get_number {
my $fh = shift;
while(<$fh>) {
if (/DID/) {
my $field = (split)[3];
if($field =~ /^\d+$/){
return $field;
}
}
}
return undef;
}