Using Perl to print multiple lines - regex

This code grabs a keyword 'fun' from text files that I have and then prints the 20 characters before and after the keyword. However, I also want it to print the previous 2 lines and the next two lines, and I'm not sure how to do that. I wasn't sure if it is easier to change the code with this or just read the whole file at one time.
{my $inputfile = "file";
$searchword = 'fun';
open (INPUT, '<', $inputfile) or die "fatal error reading the file \n";
while ($line1=<INPUT>)
{
#read in a line of the file
if ($line1 =~m/$searchword/i)
{print "searchword found\n";
$keepline = $line1;
$goodline =1;
$keepline =~/(.{1,20})(fun)(.{1,20})/gi;
if ($goodline==1)
{&write_excel};
$goodline =0;
}

Your code as is seems to
Take 20 chars each side of 'pledge' not $searchword;
Have an unmatched '{' at the start;
Doesn't print any file contents save for &write_excel which we can't examine; and
Has a logic problem in that if $searchword is found, $goodline is unconditionally set to '1' and then tested to see if its '1' and finally reset to '0'
Putting that aside, the question as to whether to read in the whole file depends on your circumstances some what - how big are the files you're going to be searching, does your machine have plenty of memory; is the machine a shared resource and so on. I'm going to presume you can read in the whole file as that's the more common position in my experience (those who disagree please keep in mind (a) I've acknowledge that its debatable; and (b) its very dependant on the circumstances that only the OP knows)
Given that, there are several ways to read in a whole file but the consensus seems to be to go with the module File::Slurp. Given those parameters, the answer looks like this;
#!/usr/bin/env perl
use v5.12;
use File::Slurp;
my $searchword = 'fun';
my $inputfile = "file.txt";
my $contents = read_file($inputfile);
my $line = '\N*\n';
if ( $contents =~ /(
$line?
$line?
\N* $searchword \N* \n?
$line?
$line?
)/x) {
say "Found:\n" . $1 ;
}
else {
say "Not found."
}
File::Slurp prints a reasonable error message if the file isn't present (or something else goes wrong), so I've left out the typical or die.... Whenever working with regexes - particularly if your trying to match stuff on multiple lines, it pays to use "extended mode" (by putting an 'x' after the final '/') to allow insignificant whitespace in the regex. This allows a clearer layout.
I've also separated out the definition of a line for added clarity which consists of 0, 1 or more non-newlines characters, \N*, followed by a new line, \n. However, if your target is on the first, second, second-last or last line I presume you still want the information, so the requested preceding and following pairs of lines are optionally matched. $line?
Please note that regular expressions are pedantic and there are inevitably 'fine details' that effect the definition of a successful match vs an unwanted match - ie. Don't expect this to do exactly what you want in all circumstances. Expect that you'll have to experiment and tweek things a bit.

I'm not sure I understand your code block (what purpose does "pledge" have? what is &write_excel?), but I can answer your question itself.
First, is this grep command acceptable? It's far faster and cleaner:
grep -i -C2 --color "fun" "file"
The -C NUM flag tells grep to provide NUM lines of context surrounding each pattern match. Obviously, --color is optional, but it may help you find the matches on really long lines.
Otherwise, here's a bit of perl:
#!/usr/bin/perl
my $searchword = "fun";
my $inputfile = "file";
my $blue = "\e[1;34m"; # change output color to blue
my $green = "\e[1;32m"; # change output color to green
my $nocolor = "\e[0;0m"; # reset output to no color
my $prev1 = my $prev2 = my $result = "";
open (INPUT, '<', $inputfile) or die "fatal error reading the file \n";
while(<INPUT>) {
if (/$searchword/i) {
$result .= $prev2 . $prev1 . $_; # pick up last two lines
$prev2 = $prev1 = ""; # prevent reusing last two lines
for (1..2) { # for two more non-matching lines
while (<INPUT>) { # parse them to ensure they don't match
$result .= $_; # pick up this line
last unless /$searchword/i; # reset counting if it matched
}
}
} else {
$prev2 = $prev1; # save last line as $prev2
$prev1 = $_; # save current line as $prev1
}
}
close $inputfile;
exit 1 unless $result; # return with failure if without matches
$result =~ # add colors (okay to remove this line)
s/([^\e]{0,20})($searchword)([^\e]{0,20})/$blue$1$green$2$blue$3$nocolor/g;
print "$result"; # print the result
print "\n" unless $result =~ /\n\Z/m; # add newline if there wasn't already one
Bug: this assumes that the two lines before and the two lines after are actually 20+ characters. If you need to fix this, it goes in the else stanza.

Related

Perl: Trying to see if the Nth column of line X matches the Nth column of line X-1

I have a perl script that reads text file line by line and splits the line into 4 different columns (shown by dashes & referred to as $cols[0-3] in code; important parts are bolded). For each distinct value before the decimal point in column 0, it should randomly generate a hex color.
Essentially, I need to compare if the Xth column in the current line matches that of the previous line.
A----last_column----221----18
A----last_column----221----76
A----last_column----221----42
B----last_column----335----18
C----last_column----467----83
So far, I am randomly generating a new #random_hex_color for every line, but desired output is below:
221.18-------#EB23AE1-------#$some/random/path/A.txt-------last_column
221.76-------#EB23AE1-------#$some/random/path/A.txt-------last_column
221.42-------#EB23AE1-------#$some/random/path/A.txt-------last_column
335.18-------#AC16D6E-------#$some/random/path/B.txt-------last_column
467.83-------#FD89A1C-------#$some/random/path/C.txt-------last_column
[Image of input file and desired output][1]
my #cols;
my $row;
my $color = color_gen();
my $path = "\t#\some_random_path/";
my $newvar = dir_contents();
my #array = ($color, $path, $newvar);
my %hash;
while ($row = <$fh>){
next if $row =~ /^(#|\s|\t)/; #skip lines beginning with comments and spaces
#cols = split(" ", $row);
%hash = (
"$cols[2]" => ["$color", "$path", "$newvar"]
);
say Dumper (\%hash);
print("$cols[2].$cols[3]\t#");
print(color_gen());
printf("%-65s", $path.dir_contents());
print("\t\t$cols[0]_"."$cols[1]"." 1 1\n");
}
Use a hash to store, and thus be able to check for, the distinct values in the first column.
I assume that color_gen() returns a new random color at each invocation. The desired output is unclear to me so it is only indicated in the code.
use warnings;
use strict;
my $file = shift #ARGV;
die "Usage: $0 filename\n" if not $file or not -f $file;
open my $fh, '<', $file or die "Can't open $file: $!";
my %c0;
while (<$fh>) {
next if /^(?:\s*$|\s*#)/; # skip: spaces only or empty, comment
my #cols = split;
my ($num) = $cols[0] =~ /^([0-9]+)/;
if (not exists $c0{$num}) { # this number not seen yet; assign color
$c0{$num} = color_gen();
}
# write line of output, with $c0{$num} and #cols
}
The value "before the decimal point in column 0" is extracted using regex as the leading number in that string and stored in $num. The parens around are needed to provide the list context for the match operator, in which case it returns the captured values. See perlretut.
This number is stored as a key in a hash with its value being the generated color. Unless it already exists that is, in which case it has been seen and a color for it generated. This way you can keep track of distinct numbers in that column. Then you can write output using $c0{$num}.
This can be written far more compactly but I hoped for clarity.
The skipped lines here aren't those "beginning with comments and spaces" but are ones with only spaces (or empty), or comments. If you really mean to skip lines that merely start with whitespace (or #) then indeed use /^(?:\s|#)/, where ?: makes () only group and not capture.
A few comments on the code
Always have use warnings; and use strict; at the beginning of each program
The \s in regex matches most types of whitespace; no need for a separate pattern for tab
A variable can be declared right in the while condition, which makes it perfectly scoped -- to that loop. However, you can also omit it and use $_
If while condition has only the input read, such as <$fh>, then the value is assigned to $_ variable; also see I/O in perlop.
I use that here since then the regex is simpler (match on $_ by default) and so is split
The split without arguments has default of split ' ', $_;, where ' ' stands for any amount of any whitespace (and leading spaces are removed before splitting)
Please provide exact samples of input and desired output for a more complete example.

Resolve Perl error: "Use of uninitialized value"

To clarify the following post, we have an automation requirement to send shipping information to an online platform so users can track their orders. We receive a daily .csv file through email, we have to extract the unique Shopify order reference from a field (last 10 digits of a field), save the amended .csv file and upload to an FTP site so tracking references can be matched to the specific order.
A previous colleague wrote an application in Perl to handle this, however it has not worked and I have no experience with Perl at all!
The program is called by a "Watcher" monitoring for files, the code for this is as follows:
use strict;
use warnings;
use Datatools::Watcher;
my $hotfolder = '\\gen-svr-01\users\DATA\MW\DMO_Report_IO\INPUT';
my $process = '"C:\Workspace\bin\WS_DMO_Report_Manipulation_v1.0.pl"';
my #backup = ('\\gen-svr-01\users\DATA\MW\DMO_Report_IO\ARCHIVE');
watcher($hotfolder,$process,\#backup);
The main code (PERL PROGRAM) is:
use strict;
use warnings;
use File::Copy;
use Datatools::Watcher;
my $output = '\\gen-svr-01\users\DATA\MW\DMO_Report_IO\OUTPUT';
my $desthotfolder = '\\gen-svr-01\users\DATA\MW\Data_TO_MWS_FTP_TEST';
my $shopifyPos = 0;
my $shopifyNew = "";
my $header = 1;
my $inputfile = $ARGV[0];
my ($path,$file,$extention) = $inputfile =~ m/ \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms;
my $outputfilename = $file . "_FORMATTED" . $extention;
$outputfilename =~ s/.~#~//;
my $outputfile = "$output\\$outputfilename";
open (INPUT, $inputfile) or die "Could not open input file: $inputfile\n";
open (OUTPUT, ">$outputfile") or die "Could not open output file: $outputfile\n";
while (my $record = <INPUT>){
chomp $record;
my #field = parse_csv($record);
if ($header == 1){
print OUTPUT $record . "\n";
$header = 0;
next;
} else {
$shopifyNew = substr $field[$shopifyPos], -10;
splice (#field, 0, 1, $shopifyNew);
print OUTPUT join(',',#field) . "\n";
next;
}
}
close INPUT;
close OUTPUT;
my $destfile = "$desthotfolder\\$outputfilename";
move $outputfile, $destfile or die "Could not move output file: $outputfile\nto: $destfile\n";
print "\nProcessing complete\n";
sub parse_csv {
my ($shift) = #_;
my $text = $shift; # record containing comma-separated values
my #new = ();
push(#new, $+) while $text =~ m{
# the first part groups the phrase inside the quotes.
# see explanation of this pattern in MRE
"([^\"\\]*(?:\\.[^\"\\]*)*)",?
| ([^,]+),?
| ,
}gx;
push(#new, undef) if substr($text, -1,1) eq ',';
return #new; # list of values that were comma-separated
}
When the program runs, the "Watcher" details the following:
File Seen, Processing File \\gen-svr-01\users\DATA\MW\DMO_Report_IO\INPUT/OrderTracking.csvUse of uninitialized value $file in concatenation <.> or string at C:\Workspace\bin\WS_DMO_Report_Manipulation_v1.0.pl line 47.
Use of uninitialized value $extention in concatenation <.> or string at C:\Workspace\bin\WS_DMO_Report_Manipulation_v1.0.pl line 47.
Processing complete
Line 47 refers to the following code:
my $outputfilename = $file . "_FORMATTED" . $extention;
In the output folder, there is a file with the name "_FORMATTED" (no file extensions)
I have looked for a solution, and from my limited understanding I don't think the variables: file and extension are being defined, but I have no idea how to correct!
It would help to know which is line 47 in this code. I assume it's this line:
my $outputfilename = $file . "_FORMATTED" . $extention;
So, at this point, $file and $extention are both uninitialised. They are both supposed to be initialised in the previous line:
my ($path,$file,$extention) =
$inputfile =~ m/ \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms;
So it seems that your $inputfile doesn't match the regex. This leaves us with two options:
$inputfile isn't being set at all (which would mean it isn't being passed to the program).
$inputfile isn't in the correct format to to match the regex.
To work out which of the problems we have here, add the following validation lines before the line which tries to set $file and $extention:
die "No input file given\n" unless $inputfile;
die "Input file name ($inputfile) is the wrong format\n"
unless $inputfile =~ / \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms;
Update: From recent updates to your question, I can see that you are running the program and passing it the filename \\gen-svr-01\users\DATA\MW\DMO_Report_IO\INPUT/OrderTracking.csv.
Let's take a closer look at your regex.
m/ \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms
The /x option at the end means that the regex compiler ignores any literal whitespace in the string. So we can do the same. Let's break down what the individual parts are trying to match:
\A : matches the start of the string
(.*\/) : matches anything up to and including the last / in your string. It captures the matched substring into $1. This is what is stored in $path in your code. It's the directory that your file is in.
(.+\d\d\d\d) : This matches one or more of any character followed by four digits. This is stored in $2 and in your code it ends up in `$file``. It's the main part of the filename.
.+ : Matches one or more characters. Any characters. Your code does nothing with these characters.
([.]\w{3}) : Matches a dot followed by three "word" characters (basically alphanumerics). This is captured into $3 and ends up in your $extention variable.
\z : Matches the end of the string.
Putting all that together, you have a regex that looks for filenames and splits them into three parts - the path, the name and the extension. The only complication is that the filename section needs to contain four consecutive digits. And your filename is OrderTracking - which doesn't contain those required digits. So the regex doesn't match and your variables don't get set.
When this program was written, it was assumed that the filenames would contain four digits. The files that you are trying to process do not contain digits, so the program fails.
We can't suggest how you fix this. You need to speak to the people who supply your input files and find out why they have started to send you files with a different name format. Once you know that, you can decide one the best approach to work round the problems.

perl script to copy file content which is between certain lines

I am new to perl scripting and need help regarding a given problem.
I have many files with details of persons.
I want to print the contents from each of the file after a particular line and before a particular line.
Example: (one of the file contains following details:)
My name is XYZ.
Address: ***
ID:12414
Country:USA
End XYZ
Another file contains details like:
My name is ABC.
Address: ###
ID:124344
Country:Singapore
End ABC
I want to print the lines from the first file after My name is XYZ and before End XYZ into my new file. Similarly, I want to print the contents from the second file after My name is ABC and before End ABC, into my new file.
I wrote the logic as below, but I am not sure of the perl syntax to print after and below a particular line.
while(<file1>)
{
if () # if we read the phrase "My name" in file1 start printing after this +line
{
print #print the contents into file3(output file)
if() # if we read the phrase "End" in file1 stop printing the content into +file3
}
}
I hope my question is clear. Any help is appreciated.
OK. I believe your question is about the perl syntax to print to the output file. I will try to give you a little more complete solution based on the description of what you are trying to do. This is just a quick very simple code example. (For somre reference you may want to also look at http://perlmaven.com/slurp.)
First lets call your new file "newfile.txt".
Then lets call you source file(s) "sourcefile.txt". Here
is some code with comments:
# First I would set the buffer to flush everything to to newfile.txt
$++;
# Now open newfile.txt for writing the intformation you want
open my $NEWFILE, '>', 'newfile.txt';
# Now open sourcerfile.txt (or iterate over a list of them)
open my $SOURCEFILE, '<', 'sourcefile.txt';
# Now go through the sourcefile and get info you want to
# add to your newfile
# set a variable to print data to newfile - initialize to
# N or false
$data_wanted = "N";
# open sourcefile and start reading lines
while <$SOURCEFILE> {
# Test to see if data is between My Name and
if ($_ =~ /^My name/ ) {
$data_wanted = "N";
}
elsif ($_ =~ /^End/ ) {
$data_wanted = "N";
next;
}
elsif ($_ =~ /^STUFF TO OMIT/) {
$data_wanted = "N";
}
else {
$data_wanted = "Y";
}
if ( $data_wanted eq "Y" ) {
print $NEWFILE $_;
}
# you don't really need this but
# it will show you how this works in perl
next;
} # end of while
# finish by closing the files
close $SOURCEFILE;
close $NEWFILE;
##########################################
Hope this helps ;-)
You can get the lines between My name is <name>. and End <name> with one of several regexes.
Lazy:
My name is ([^\n]+)\.(.*?)End \1
Greedy:
My name is ([^\n]+)\.(.*)End \1
Optimized:
My name is ([^\s]+)\.((?:[^\n]*(?!End \1)\n)+)End \1
Either way, you'll need the s modifier. If more than one thing needs to be parsed in a file, you will need the g modifier.
The back-references ensure a match without needing to know the name. This means that the content you want will be in capture group 2.
What's the difference between the three regexes? Speed! Depending on how many files you need to parse, you may need the speed.
The optimized one is the best if there is significant variance in what you are parsing. It works the same way as this other regex I wrote. (You should do some testing if speed is important.)
It should be fairly straight forward to write the code from this.
Is this what you are looking for?
while (<>) {
if ( /^My name / .. /^End / ) {
if ( /^My name / ) {
# Do nothing, or anything you would like for this line.
} elsif ( /^End / ) {
# Do nothing, or anything you would like for this line.
} else {
print $_;
}
}
}

Perl Regex Match Text String and Extract Following Number

I have a giant text data file (~100MB) that is a concatenation of a bunch of data files with various header information then some columns of data. Here's the problem. I want to extract a particular number from the header info before each of these data sets and then append that to another column in the data (and write out that data to a different file).
The header info that I want is of the format ex: BGA 1
Where what I want for that extra data column is the # after word BGA. It will be a number between 1 and maybe 20000. I can write the regex to pull the word BGA, but I don't seem to be able to figure out how to just get the digit after it.
To add EXTRA fun, that text "BGA 1" is repeated in each data section TWICE.
Here's what I have so far, which actually doesn't work... I want it to at least print "BGA" everytime it encounters the word BGA, but it prints nothing.... Any help would be appreciated.
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'alldata.txt';
open my $info, $file or die "Could not open $file: $!";
$_="";
while(my $line = <$info>){
if ($line eq "/BGA/"){
print <>,"\n";
}
}
close $file;
if ($line =~ /BGA\s(\d+)/){
#your code
print "BGA number $1 \n";
#your code
}
And $1 variable will have the number you want
If there is more than one BGA per line, you'll need to allow the regex to match more than once per line:
while (my $line = <$info>) {
while ( $line =~ /BGA\s(\d+)/g ) {
print "$1\n";
}
}
This should print out all the BGA numbers as a single column. Without any further information it's hard to answer this any better.
First, a 100 MB file is not giant. Don't be so defeatist. You could even slurp it into memory:
Let's look at the few critical places in your code:
while(my $line = <$info>) {
if ($line eq "/BGA/") {
Your condition $line eq "/BGA/" tests if the line literally consists of the string "/BGA/". But, that can never be true for the line with at least have the input record separator, i.e. the contents of $/ at the end because you did not chomp it. In any case, what you want is to match lines that contain "BGA" anywhere and the proper Perl syntax to do that is
if ($line =~ /BGA/) {
Now, once you fix that, you are going to run into a problem with the following statement:
print <>,"\n";
What you really want is print $line;. The diamond operator, <>, in list context is going to try to slurp from STDIN or any files specified as arguments on the command line. Not a good idea.
Others have pointed out how to match the string "BGA" followed by a digit. For better answers, you are going to need to show examples of input and expected output.

How do I match a word followed by new line then grab the next line up to its new line?

I'm editing a bunch of SQL files and I need to remove date references in the queries. However the way the files are written is that logical operators like, OR and AND are on lines by themselves and the rest of the associated argument are on another line. Like so:
OR
field.lastupdate > DATE_SUB(CURDATE(), INTERVAL 31 DAY))
AND
*some more code*
I want to remove the OR (and it can be an AND too) up to the newline character, in this example, after the second parenthesis. However I want to leave the rest of the code intact.
I think the regex should be straightforward except how do I ignore the newline after the OR but stop at the following newline?
I should note that some of the date lines I want to remove end with a ";" which I do not want to remove.
Here's a more complete example that I hope clears things up:
OR
x.is_deleted = 0
OR
x.lastupd > DATE_SUB(CURDATE(), INTERVAL 31 DAY))
AND
(j.active = 1
OR
j.is_deleted = 0
OR
j.lastupd > DATE_SUB(CURDATE(), INTERVAL 31 DAY));
So you see I want to keep the first "OR" and it's following line,
delete the second "OR" and the line that follows it.
Keep the "AND" and the line that follows it as well as the following "OR" and it's corresponding line.
And then delete the final "OR" and it's line while leaving the final ";".
$sql =~ s/\b(?:OR|AND)[ \t]*[\n\r]+(?=.*DATE).*(?<![;\s])//mg;
Removes the OR (or AND) and the content on the following line (if it contains DATE), except the possible ending ;.
Note that such simple regex will not work with your updated example, because there are closing parenthesis on the removed line which belong to other lines.
Example at http://ideone.com/0Lbxp
Well i'm not whether there is only one sentence after sentence having OR/AND.
The idea is to keep track of a flag which will tell you that you came across an OR/AND in the previous sentence.
Probably you can do something like this.
open(FPTR, "infilename")
or die "\nCan't open $filename for reading: $!\n";
open(OUT, ">outfilename")
or die "\nCan't open $OUT for writing: $!\n";
my $st=0;
while(<FPTR>)
{
if($_ =~ m/OR$/ || $_ =~ m/AND$/) {
$st=1;
}
elsif($st==1 $$ **match to your sentence**) {
$st=0;
next;#since you want to remove the line followed by line containing OR/AND
}
else {
print OUT $_;
#i'm not sure if here also you need to set $st=0;
}
}
close(FPTR);
close(OUT);
Sometimes the simpler solutions are the best. This script will only (re)print lines that do not match the description of the lines you wanted removed. It will print a trailing semi-colon ; if it finds one. It will preserve the lines as read.
It relies on no lines being empty, and that no wanted lines contain the word DATE_SUB.
Usage:
$ script.pl input.txt > output.txt
Code:
use strict;
use warnings;
use ARGV::readonly;
while (my $line1 = <>) {
if ($line1 =~ /^\s*(OR|AND)\s*$/) {
my $line2 = <>;
if ($line2 =~ /DATE_SUB/) {
if ($line2 =~ /;\s*$/) {
print ";\n";
}
} else {
print $line1, $line2;
}
} else {
print $line1;
}
}