perl script to copy file content which is between certain lines - regex

I am new to perl scripting and need help regarding a given problem.
I have many files with details of persons.
I want to print the contents from each of the file after a particular line and before a particular line.
Example: (one of the file contains following details:)
My name is XYZ.
Address: ***
ID:12414
Country:USA
End XYZ
Another file contains details like:
My name is ABC.
Address: ###
ID:124344
Country:Singapore
End ABC
I want to print the lines from the first file after My name is XYZ and before End XYZ into my new file. Similarly, I want to print the contents from the second file after My name is ABC and before End ABC, into my new file.
I wrote the logic as below, but I am not sure of the perl syntax to print after and below a particular line.
while(<file1>)
{
if () # if we read the phrase "My name" in file1 start printing after this +line
{
print #print the contents into file3(output file)
if() # if we read the phrase "End" in file1 stop printing the content into +file3
}
}
I hope my question is clear. Any help is appreciated.

OK. I believe your question is about the perl syntax to print to the output file. I will try to give you a little more complete solution based on the description of what you are trying to do. This is just a quick very simple code example. (For somre reference you may want to also look at http://perlmaven.com/slurp.)
First lets call your new file "newfile.txt".
Then lets call you source file(s) "sourcefile.txt". Here
is some code with comments:
# First I would set the buffer to flush everything to to newfile.txt
$++;
# Now open newfile.txt for writing the intformation you want
open my $NEWFILE, '>', 'newfile.txt';
# Now open sourcerfile.txt (or iterate over a list of them)
open my $SOURCEFILE, '<', 'sourcefile.txt';
# Now go through the sourcefile and get info you want to
# add to your newfile
# set a variable to print data to newfile - initialize to
# N or false
$data_wanted = "N";
# open sourcefile and start reading lines
while <$SOURCEFILE> {
# Test to see if data is between My Name and
if ($_ =~ /^My name/ ) {
$data_wanted = "N";
}
elsif ($_ =~ /^End/ ) {
$data_wanted = "N";
next;
}
elsif ($_ =~ /^STUFF TO OMIT/) {
$data_wanted = "N";
}
else {
$data_wanted = "Y";
}
if ( $data_wanted eq "Y" ) {
print $NEWFILE $_;
}
# you don't really need this but
# it will show you how this works in perl
next;
} # end of while
# finish by closing the files
close $SOURCEFILE;
close $NEWFILE;
##########################################
Hope this helps ;-)

You can get the lines between My name is <name>. and End <name> with one of several regexes.
Lazy:
My name is ([^\n]+)\.(.*?)End \1
Greedy:
My name is ([^\n]+)\.(.*)End \1
Optimized:
My name is ([^\s]+)\.((?:[^\n]*(?!End \1)\n)+)End \1
Either way, you'll need the s modifier. If more than one thing needs to be parsed in a file, you will need the g modifier.
The back-references ensure a match without needing to know the name. This means that the content you want will be in capture group 2.
What's the difference between the three regexes? Speed! Depending on how many files you need to parse, you may need the speed.
The optimized one is the best if there is significant variance in what you are parsing. It works the same way as this other regex I wrote. (You should do some testing if speed is important.)
It should be fairly straight forward to write the code from this.

Is this what you are looking for?
while (<>) {
if ( /^My name / .. /^End / ) {
if ( /^My name / ) {
# Do nothing, or anything you would like for this line.
} elsif ( /^End / ) {
# Do nothing, or anything you would like for this line.
} else {
print $_;
}
}
}

Related

Resolve Perl error: "Use of uninitialized value"

To clarify the following post, we have an automation requirement to send shipping information to an online platform so users can track their orders. We receive a daily .csv file through email, we have to extract the unique Shopify order reference from a field (last 10 digits of a field), save the amended .csv file and upload to an FTP site so tracking references can be matched to the specific order.
A previous colleague wrote an application in Perl to handle this, however it has not worked and I have no experience with Perl at all!
The program is called by a "Watcher" monitoring for files, the code for this is as follows:
use strict;
use warnings;
use Datatools::Watcher;
my $hotfolder = '\\gen-svr-01\users\DATA\MW\DMO_Report_IO\INPUT';
my $process = '"C:\Workspace\bin\WS_DMO_Report_Manipulation_v1.0.pl"';
my #backup = ('\\gen-svr-01\users\DATA\MW\DMO_Report_IO\ARCHIVE');
watcher($hotfolder,$process,\#backup);
The main code (PERL PROGRAM) is:
use strict;
use warnings;
use File::Copy;
use Datatools::Watcher;
my $output = '\\gen-svr-01\users\DATA\MW\DMO_Report_IO\OUTPUT';
my $desthotfolder = '\\gen-svr-01\users\DATA\MW\Data_TO_MWS_FTP_TEST';
my $shopifyPos = 0;
my $shopifyNew = "";
my $header = 1;
my $inputfile = $ARGV[0];
my ($path,$file,$extention) = $inputfile =~ m/ \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms;
my $outputfilename = $file . "_FORMATTED" . $extention;
$outputfilename =~ s/.~#~//;
my $outputfile = "$output\\$outputfilename";
open (INPUT, $inputfile) or die "Could not open input file: $inputfile\n";
open (OUTPUT, ">$outputfile") or die "Could not open output file: $outputfile\n";
while (my $record = <INPUT>){
chomp $record;
my #field = parse_csv($record);
if ($header == 1){
print OUTPUT $record . "\n";
$header = 0;
next;
} else {
$shopifyNew = substr $field[$shopifyPos], -10;
splice (#field, 0, 1, $shopifyNew);
print OUTPUT join(',',#field) . "\n";
next;
}
}
close INPUT;
close OUTPUT;
my $destfile = "$desthotfolder\\$outputfilename";
move $outputfile, $destfile or die "Could not move output file: $outputfile\nto: $destfile\n";
print "\nProcessing complete\n";
sub parse_csv {
my ($shift) = #_;
my $text = $shift; # record containing comma-separated values
my #new = ();
push(#new, $+) while $text =~ m{
# the first part groups the phrase inside the quotes.
# see explanation of this pattern in MRE
"([^\"\\]*(?:\\.[^\"\\]*)*)",?
| ([^,]+),?
| ,
}gx;
push(#new, undef) if substr($text, -1,1) eq ',';
return #new; # list of values that were comma-separated
}
When the program runs, the "Watcher" details the following:
File Seen, Processing File \\gen-svr-01\users\DATA\MW\DMO_Report_IO\INPUT/OrderTracking.csvUse of uninitialized value $file in concatenation <.> or string at C:\Workspace\bin\WS_DMO_Report_Manipulation_v1.0.pl line 47.
Use of uninitialized value $extention in concatenation <.> or string at C:\Workspace\bin\WS_DMO_Report_Manipulation_v1.0.pl line 47.
Processing complete
Line 47 refers to the following code:
my $outputfilename = $file . "_FORMATTED" . $extention;
In the output folder, there is a file with the name "_FORMATTED" (no file extensions)
I have looked for a solution, and from my limited understanding I don't think the variables: file and extension are being defined, but I have no idea how to correct!
It would help to know which is line 47 in this code. I assume it's this line:
my $outputfilename = $file . "_FORMATTED" . $extention;
So, at this point, $file and $extention are both uninitialised. They are both supposed to be initialised in the previous line:
my ($path,$file,$extention) =
$inputfile =~ m/ \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms;
So it seems that your $inputfile doesn't match the regex. This leaves us with two options:
$inputfile isn't being set at all (which would mean it isn't being passed to the program).
$inputfile isn't in the correct format to to match the regex.
To work out which of the problems we have here, add the following validation lines before the line which tries to set $file and $extention:
die "No input file given\n" unless $inputfile;
die "Input file name ($inputfile) is the wrong format\n"
unless $inputfile =~ / \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms;
Update: From recent updates to your question, I can see that you are running the program and passing it the filename \\gen-svr-01\users\DATA\MW\DMO_Report_IO\INPUT/OrderTracking.csv.
Let's take a closer look at your regex.
m/ \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms
The /x option at the end means that the regex compiler ignores any literal whitespace in the string. So we can do the same. Let's break down what the individual parts are trying to match:
\A : matches the start of the string
(.*\/) : matches anything up to and including the last / in your string. It captures the matched substring into $1. This is what is stored in $path in your code. It's the directory that your file is in.
(.+\d\d\d\d) : This matches one or more of any character followed by four digits. This is stored in $2 and in your code it ends up in `$file``. It's the main part of the filename.
.+ : Matches one or more characters. Any characters. Your code does nothing with these characters.
([.]\w{3}) : Matches a dot followed by three "word" characters (basically alphanumerics). This is captured into $3 and ends up in your $extention variable.
\z : Matches the end of the string.
Putting all that together, you have a regex that looks for filenames and splits them into three parts - the path, the name and the extension. The only complication is that the filename section needs to contain four consecutive digits. And your filename is OrderTracking - which doesn't contain those required digits. So the regex doesn't match and your variables don't get set.
When this program was written, it was assumed that the filenames would contain four digits. The files that you are trying to process do not contain digits, so the program fails.
We can't suggest how you fix this. You need to speak to the people who supply your input files and find out why they have started to send you files with a different name format. Once you know that, you can decide one the best approach to work round the problems.

Using Perl to print multiple lines

This code grabs a keyword 'fun' from text files that I have and then prints the 20 characters before and after the keyword. However, I also want it to print the previous 2 lines and the next two lines, and I'm not sure how to do that. I wasn't sure if it is easier to change the code with this or just read the whole file at one time.
{my $inputfile = "file";
$searchword = 'fun';
open (INPUT, '<', $inputfile) or die "fatal error reading the file \n";
while ($line1=<INPUT>)
{
#read in a line of the file
if ($line1 =~m/$searchword/i)
{print "searchword found\n";
$keepline = $line1;
$goodline =1;
$keepline =~/(.{1,20})(fun)(.{1,20})/gi;
if ($goodline==1)
{&write_excel};
$goodline =0;
}
Your code as is seems to
Take 20 chars each side of 'pledge' not $searchword;
Have an unmatched '{' at the start;
Doesn't print any file contents save for &write_excel which we can't examine; and
Has a logic problem in that if $searchword is found, $goodline is unconditionally set to '1' and then tested to see if its '1' and finally reset to '0'
Putting that aside, the question as to whether to read in the whole file depends on your circumstances some what - how big are the files you're going to be searching, does your machine have plenty of memory; is the machine a shared resource and so on. I'm going to presume you can read in the whole file as that's the more common position in my experience (those who disagree please keep in mind (a) I've acknowledge that its debatable; and (b) its very dependant on the circumstances that only the OP knows)
Given that, there are several ways to read in a whole file but the consensus seems to be to go with the module File::Slurp. Given those parameters, the answer looks like this;
#!/usr/bin/env perl
use v5.12;
use File::Slurp;
my $searchword = 'fun';
my $inputfile = "file.txt";
my $contents = read_file($inputfile);
my $line = '\N*\n';
if ( $contents =~ /(
$line?
$line?
\N* $searchword \N* \n?
$line?
$line?
)/x) {
say "Found:\n" . $1 ;
}
else {
say "Not found."
}
File::Slurp prints a reasonable error message if the file isn't present (or something else goes wrong), so I've left out the typical or die.... Whenever working with regexes - particularly if your trying to match stuff on multiple lines, it pays to use "extended mode" (by putting an 'x' after the final '/') to allow insignificant whitespace in the regex. This allows a clearer layout.
I've also separated out the definition of a line for added clarity which consists of 0, 1 or more non-newlines characters, \N*, followed by a new line, \n. However, if your target is on the first, second, second-last or last line I presume you still want the information, so the requested preceding and following pairs of lines are optionally matched. $line?
Please note that regular expressions are pedantic and there are inevitably 'fine details' that effect the definition of a successful match vs an unwanted match - ie. Don't expect this to do exactly what you want in all circumstances. Expect that you'll have to experiment and tweek things a bit.
I'm not sure I understand your code block (what purpose does "pledge" have? what is &write_excel?), but I can answer your question itself.
First, is this grep command acceptable? It's far faster and cleaner:
grep -i -C2 --color "fun" "file"
The -C NUM flag tells grep to provide NUM lines of context surrounding each pattern match. Obviously, --color is optional, but it may help you find the matches on really long lines.
Otherwise, here's a bit of perl:
#!/usr/bin/perl
my $searchword = "fun";
my $inputfile = "file";
my $blue = "\e[1;34m"; # change output color to blue
my $green = "\e[1;32m"; # change output color to green
my $nocolor = "\e[0;0m"; # reset output to no color
my $prev1 = my $prev2 = my $result = "";
open (INPUT, '<', $inputfile) or die "fatal error reading the file \n";
while(<INPUT>) {
if (/$searchword/i) {
$result .= $prev2 . $prev1 . $_; # pick up last two lines
$prev2 = $prev1 = ""; # prevent reusing last two lines
for (1..2) { # for two more non-matching lines
while (<INPUT>) { # parse them to ensure they don't match
$result .= $_; # pick up this line
last unless /$searchword/i; # reset counting if it matched
}
}
} else {
$prev2 = $prev1; # save last line as $prev2
$prev1 = $_; # save current line as $prev1
}
}
close $inputfile;
exit 1 unless $result; # return with failure if without matches
$result =~ # add colors (okay to remove this line)
s/([^\e]{0,20})($searchword)([^\e]{0,20})/$blue$1$green$2$blue$3$nocolor/g;
print "$result"; # print the result
print "\n" unless $result =~ /\n\Z/m; # add newline if there wasn't already one
Bug: this assumes that the two lines before and the two lines after are actually 20+ characters. If you need to fix this, it goes in the else stanza.

Create multiple output files and cut dna with enzymes - Perl

I am a first year grad student who's relatively new in computational biology. I recently started using Perl and it's not the easiest language to learn, at least not for me.
I need help applying my idea/logic the right way to figure out the solution to my problem.
I have a dna string and I want to split it at specific sites to get multiple fragments using information from an enzyme file that contains lines of recognition sites. Once the fragments are obtained, I want to output the list of dna fragments in an output file. I want to create an output file for every line in the enzyme file I am going to extract the information from, to apply it to the dna string.
Here's what I mean exactly:
Hypothetical scenario:
Enzyme.File contains:
abc/at'gtct// (abc is the name of the enzyme. (atgtct) is the recognition site.)
def/cgg'ataaa// ........
Suppose the dna string is: $dna = "accggttatgtctaaacggataaagtctcggataaattt" (recognition sites are bolded)
For line 1
When I extract the info from the first line/enzyme(abc) from the enzyme file and apply it to this string, the output should be:
accggttat
gtctaaacggataaagtctcggataaattt
(split between cgg'ataaa) the apostrophe represents the cut point
(note: Even though there is another gtct in the string, it does not split it because at ought to precede it.)
For line 2
$dna = accggttatgtctaaacggataaagtctcggataaattt (Info is applied to same dna string)
Info from line/enzyme 2 (def) would split the dna as follow:
accggttatgtctaaacgg (split between cgg'ataaa)
ataaagtctcgg
ataaattt
I want to put each output from the different lines in separate file with distinct names. (I can take care of assigning the names)
So in conclusion, this example would create two new files, one name "abc_whatever" and "def_whatever". Important: If the enzyme file had 8 lines with different enzymes, I would get 8 new output files with their distinct dna fragments."
Here's what I've tried so far:
#!/usr/bin/perl;
use warnings;
use strict;
open(ENZ,$ARGV[0]) || die; # ENZ(file handle for enzyme file)
my $dna = "accggttatgtctaaacggataaagtctcggataaattt";
while (<ENZ>) {
if ( match pattern etc..) { # I took care of that and created captured groups of
$1 = holds "abc" # the info I needed from the line e.g. I captured
$2 = ..."at" # (abc)/(at)'(gtct)//, so they are stored in $1,$2,$3
$3 = ..."gtct" # respectively
}
while (<$dna>){
my #fragments_array = split(/$3/, $dna);
open (OutFile, ">$dna"."_"."$1")
print OutFile shift #fragments_array,"\n";
foreach (#fragments_array) {
print OutFile "$3$_\n";
close OutFile;
}
}
}
close ENZ;
FIRST
I can only create an output only for the 1st line in the Enzyme file. I want to create and output file for all the lines.
SECOND
I am not properly cutting the dna. From other examples I have seen online, it looks like I am gonna have to use the following functions to properly apply the enzyme information on the dna. The functions include:
the for loop, length and substr(),
If you can, please demonstrate your work in the simplest form (no extravagant, impressing codes lol :-) since I am just learning this language)
Thanks in advance!
FIRST I can only create an output only for the 1st line in the Enzyme file. I want to create and output file for all the lines.
That's simply because you put close OutFile; into the foreach (#fragments_array) loop, instead of placing the close after the loop body.
SECOND I am not properly cutting the dna.
That's because you forgot to include $2, the head of the recognition site (e. g. the at of atgtct) in the split pattern as well as in the output.
The problem is solved easier if we just insert the splitting new-line character everywhere between the head and the tail:
#!/usr/bin/perl
use warnings;
use strict;
open(ENZ, $ARGV[0]) || die; # ENZ (file handle for enzyme file)
my $dna = "accggttatgtctaaacggataaagtctcggataaattt";
while (<ENZ>)
{
if (m-(.*)/(.*)'(.*)//-)
{
my ($head, $tail) = ($2, $3); # $2$3 is the recognition site; save it
open(OutFile, ">${dna}_$1");
(my $fragments = $dna) =~ s/$head$tail/$head\n$tail/g; # insert NLs
print OutFile $fragments, "\n";
close OutFile;
}
}
close ENZ;
I changed your code a bit, hope it works now
#!/usr/bin/perl
use warnings;
use strict;
open(ENZ, $ARGV[0]);
my $dna = "accggttatgtctaaacggataaagtctcggataaattt";
my ($enzyme, $first, $second) = ("", "", "");
for my $line (<ENZ>) {
chomp($line); # remove \n at the end of string
my #elements = split(/\/|'/, $line); # split string into tokens (e.g. abc/at'gtct => array(abc, at, gtct))
$elements[2] = substr($elements[2], 0, -2); # remove the last "//"
my ($firstPart, $secondPart) = ($elements[1], $elements[2]);
if ($dna =~ /(.*)$firstPart$secondPart(.*)/) {
$first = $1 . $firstPart;
$second = $2 . $secondPart;
$enzyme = $elements[0];
open(OUTPUT, ">$enzyme" . "_something");
print OUTPUT "$first\n$second\n";
close(OUTPUT);
}
}
close ENZ;
EDIT: this is the working version. I suggest you learn how to use Regular Expression if you want to use Perl for your study. It is the strongest tool in Perl.

Regex: How to remove extra spaces between strings in Perl

I am working on a program that take user input for two file names. Unfortunately, the program can easily break if the user does not follow the specified format of the input. I want to write code that improves its resiliency against these types of errors. You'll understand when you see my code:
# Ask the user for the filename of the qseq file and barcode.txt file
print "Please enter the name of the qseq file and the barcode file separated by a comma:";
# user should enter filenames like this: sample1.qseq, barcode.txt
# remove the newline from the qseq filename
chomp ($filenames = <STDIN>);
# an empty array
my #filenames;
# remove the ',' and put the files into an array separated by spaces; indexes the files
push #filename, join(' ', split(',', $filenames))
# the qseq file
my $qseq_filename = shift #filenames;
# the barcode file.
my barcode = shift #filenames;
Obviously this code runs can run into errors if the user enters the wrong type of filename (.tab file instead of .txt or .seq instead of .qseq). I want code that can do some sort of check to see that the user enters the appropriate file type.
Another error that could break the code is if the user enters too many spaces before the filenames. For example: sample1.qseq,(imagine 6 spaces here) barcode.txt (Notice the numerous spaces after the comma)
Another example: (imagine 6 spaces here) sample1.qseq,barcode.txt (This time notice the number of spaces before the first filename)
I also want lines of code that can remove extra spaces so that the program doesn't break. I think the user input has to be in the following kind of format: sample1.qseq, barcode.txt. The user input has to be in this format so that I can properly index the filenames into an array and shift them out later.
Thanks any help or suggestions are greatly appreciated!
The standard way to deal with this kind of problem is utilising command-line options, not gathering input from STDIN. Getopt::Long comes with Perl and is servicable:
use strict; use warnings FATAL => 'all';
use Getopt::Long qw(GetOptions);
my %opt;
GetOptions(\%opt, 'qseq=s', 'barcode=s') or die;
die <<"USAGE" unless exists $opt{qseq} and $opt{qseq} =~ /^sample\d[.]qseq$/ and exists $opt{barcode} and $opt{barcode} =~ /^barcode.*\.txt$/;
Usage: $0 --qseq sample1.qseq --barcode barcode.txt
$0 -q sample1.qseq -b barcode.txt
USAGE
printf "q==<%s> b==<%s>\n", $opt{qseq}, $opt{barcode};
The shell will deal with any extraneous whitespace, try it and see. You need to do the validation of the file names, I made up something with regex in the example. Employ Pod::Usage for a fancier way to output helpful documentation to your users who are likely to get the invocation wrong.
There are dozens of more advanced Getopt modules on CPAN.
First, put use strict; at the top of your code and declare your variables.
Second, this:
# remove the ',' and put the files into an array separated by spaces; indexes the files
push #filename, join(' ', split(',', $filenames))
Is not going to do what you want. split() takes a string and turns it into an array. Join takes a list of items and returns a string. You just want to split:
my #filenames = split(',', $filenames);
That will create an array like you expect.
This function will safely trim white space from the beginning and end of a string:
sub trim {
my $string = shift;
$string =~ s/^\s+//;
$string =~ s/\s+$//;
return $string;
}
Access it like this:
my $file = trim(shift #filenames);
Depending on your script, it might be easier to pass the strings as command line arguments. You can access them through the #ARGV array but I prefer to use GetOpt::Long:
use strict;
use Getopt::Long;
Getopt::Long::Configure("bundling");
my ($qseq_filename, $barcode);
GetOptions (
'q|qseq=s' => \$qseq_filename,
'b|bar=s' => \$barcode,
);
You can then call this as:
./script.pl -q sample1.qseq -b barcode.txt
And the variables will be properly populated without a need to worry about trimming white space.
You'll need to trim spaces before handling the filename data in your routine, you could check the file extension with yet another regular expression, as nicely described in Is there a regular expression in Perl to find a file's extension?. If it's the actual type of file that matters to you, then it might be more worthwile to check for that instead with File::LibMagicType.
While I think your design is a little iffy, the following will work?
my #fileNames = split(',', $filenames);
foreach my $fileName (#fileNames) {
if($fileName =~ /\s/) {
print STDERR "Invalid filename.";
exit -1;
}
}
my ($qsec, $barcode) = #fileNames;
And here is one more way you could do it with regex (if you are reading the input from STDIN):
# read a line from STDIN
my $filenames = <STDIN>;
# parse the line with a regex or die with an error message
my ($qseq_filename, $barcode) = $filenames =~ /^\s*(\S.*?)\s*,\s*(\S.*?)\s*$/
or die "invalid input '$filenames'";

How do I match a word followed by new line then grab the next line up to its new line?

I'm editing a bunch of SQL files and I need to remove date references in the queries. However the way the files are written is that logical operators like, OR and AND are on lines by themselves and the rest of the associated argument are on another line. Like so:
OR
field.lastupdate > DATE_SUB(CURDATE(), INTERVAL 31 DAY))
AND
*some more code*
I want to remove the OR (and it can be an AND too) up to the newline character, in this example, after the second parenthesis. However I want to leave the rest of the code intact.
I think the regex should be straightforward except how do I ignore the newline after the OR but stop at the following newline?
I should note that some of the date lines I want to remove end with a ";" which I do not want to remove.
Here's a more complete example that I hope clears things up:
OR
x.is_deleted = 0
OR
x.lastupd > DATE_SUB(CURDATE(), INTERVAL 31 DAY))
AND
(j.active = 1
OR
j.is_deleted = 0
OR
j.lastupd > DATE_SUB(CURDATE(), INTERVAL 31 DAY));
So you see I want to keep the first "OR" and it's following line,
delete the second "OR" and the line that follows it.
Keep the "AND" and the line that follows it as well as the following "OR" and it's corresponding line.
And then delete the final "OR" and it's line while leaving the final ";".
$sql =~ s/\b(?:OR|AND)[ \t]*[\n\r]+(?=.*DATE).*(?<![;\s])//mg;
Removes the OR (or AND) and the content on the following line (if it contains DATE), except the possible ending ;.
Note that such simple regex will not work with your updated example, because there are closing parenthesis on the removed line which belong to other lines.
Example at http://ideone.com/0Lbxp
Well i'm not whether there is only one sentence after sentence having OR/AND.
The idea is to keep track of a flag which will tell you that you came across an OR/AND in the previous sentence.
Probably you can do something like this.
open(FPTR, "infilename")
or die "\nCan't open $filename for reading: $!\n";
open(OUT, ">outfilename")
or die "\nCan't open $OUT for writing: $!\n";
my $st=0;
while(<FPTR>)
{
if($_ =~ m/OR$/ || $_ =~ m/AND$/) {
$st=1;
}
elsif($st==1 $$ **match to your sentence**) {
$st=0;
next;#since you want to remove the line followed by line containing OR/AND
}
else {
print OUT $_;
#i'm not sure if here also you need to set $st=0;
}
}
close(FPTR);
close(OUT);
Sometimes the simpler solutions are the best. This script will only (re)print lines that do not match the description of the lines you wanted removed. It will print a trailing semi-colon ; if it finds one. It will preserve the lines as read.
It relies on no lines being empty, and that no wanted lines contain the word DATE_SUB.
Usage:
$ script.pl input.txt > output.txt
Code:
use strict;
use warnings;
use ARGV::readonly;
while (my $line1 = <>) {
if ($line1 =~ /^\s*(OR|AND)\s*$/) {
my $line2 = <>;
if ($line2 =~ /DATE_SUB/) {
if ($line2 =~ /;\s*$/) {
print ";\n";
}
} else {
print $line1, $line2;
}
} else {
print $line1;
}
}