Comparing files line by line using a simple pattern match - regex

I have two files: in the first file each line has some labels associated with it; the second file contains the labels which fall under certain categories.
File1 - labelled lines:
I have never had an issue. L_102 ----- L_127
I travel overseas and offer a lot of services that are very useful to me L_105 ----- L_134 ----- L_148
Expense to have L_522
Great benefits L_148
prestige L_118
File2 - categories under which the labels fall:
Issues:113,114,115,116,127
Benefits:105,220,154,543,590
General:148,134,154
I have written a Perl script to fetch labels from the first file.
#!/usr/bin/perl
use strict;
use warnings;
my $file = shift || "INPUTFILE";
my $outputfile = shift || "OUTPUTFILE";
open my $fh, '<', $file or die "Can not open '$file': $!";
open( OUTFILE, ">", $outputfile) or die "Can not open '$outputfile': $!";
while(my $w = <$fh>) {
my #matches = $w =~ m/(L_[0-9][0-9][0-9])/g;
for(#matches){s/L_//g;
s/\s+/\t/g;
}
print OUTFILE "#matches\n";
}
The output from this first script is:
102 127
105 134 148
522
148
118
I have a second Perl script to fetch the levels from second file (which contains the categories):
#!/usr/bin/perl
use strict;
use warnings;
my $file = shift || "INPUTFILE";
my $outputfile = shift || "OUTPUTFILE";
my $patern = shift ||"Issues:"
open my $fh, '<', $file or die "Can not open '$file': $!";
open( OUTFILE, ">", $outputfile) or die "Can not open '$outputfile': $!";
while(my $var = <$fh>) {
if(my #matches =$var=~/(.*$patern.*)/)
{
for(#matches){s/$patern//g;s/\,/\t/g}
print OUTFILE "#matches\n";
}
}
The second output from the second script is:
113 114 115 116 127
Now I want to match the first output with the second output line by line.
The results I want are: if the any of the numbers in the second output matches with any of the lines in the first output then I want to print 1; or else if there is no match print -1 for that line.
The output from the above would be as below:
1
-1
-1
-1
-1

This combines your two scripts into one. It reads through the $inputfile file handle that is pointing at "INPUTFILE.txt"looking for matches based on either a regular expression ($regexp) or the existence of a search key in the %patterns hash.
Since the match is simple, the regular expression we use is built up using join, |, and the required search strings. In the alternative approach (which is commented out here) we use the hash keys themselves to check whether a search pattern exists.
I have changed the variables and file names in the open statements somewhat since the capitalized file names made them seem like old style file handles:
#!perl -l
my $inputfile = "INPUTFILE.txt";
my $outputfile = "OUTPUTFILE.txt";
my $matchfile = "MATCHFILE.txt";
open my $inputfh, '<', $inputfile or die "No file '$inputfile': $!";
open my $matchfh, '<', $matchfile or die "No file '$matchfile': $!\n";
open my $outfh, '>', $outputfile or die "No file '$outputfile': $!\n";
my %patterns;
while (<$matchfh>) {
$patterns{$_} = () for map { split /,/, $_ } /Issues:(.*)/;
}
my $regex = join "|", keys %patterns;
$regex = qr/$regex/; # create a regex from %patterns
print "Search patterns : ", join " ", keys %patterns;
print "Regex : $regex \n";
while (my $line = <$inputfh>) {
chomp $line;
# Print "1" for 3 digits matching search pattern; "-1" otherwise:
#print exists $patterns{$_} ? "1" : "-1" for $line =~ m/(\d\d\d)/g;
# Print "1" if a matching pattern is on a line; -1 otherwise:
if (grep /$regex/, $line) { #
print "1 - $line";
}
else {
print "-1 - $line";
}
}
The above script should work. You can remove - $line from the last print statements and add a file handle destination ($outfh) to direct the output to a file.
Since there are five lines in the inputfile, the output is:
Search patterns : 127 116 114 115 113
Regex : (?^:127|116|114|115|113)
1 - I have never had an issue. L_102 ----- L_127
-1 - I travel overseas ... very useful to me L_105 ----- L_134 ----- L_148
-1 - Expense to have L_522
-1 - Great benefits L_148
-1 - prestige L_118
NB the final if ... else blocks could be shortened using the "ternary operator"(<cond> ? 1 : 0) to:
print $line =~ /$regex/ ? '1' : '-1';
so that "1" will printed if $line =~ /$regex/ evaluates to "true" (or "1") ; and "-1" will be printed if it evaluates to "false" (or "0").
If you read from your two files and simply redirect the output with your shell, the short version of all this would be:
#!perl -l
my $inputfile = "INPUTFILE.txt";
my $matchfile = "MATCHFILE.txt";
open my $inputfh, '<', $inputfile or die "No '$inputfile': $!";
open my $matchfh, '<', $matchfile or die "No '$matchfile': $!\n";
my %patterns;
while (<$matchfh>) {
$patterns{$_} = () for map { split /,/, $_ } /Issues:(.*)/;
}
my $regex = join "|", keys %patterns;
$regex = qr/$regex/;
while (my $line = <$inputfh>) {
chomp $line;
print $line =~ $regex ? '1' : '-1';
}

Related

How to match 2 array?

I have 2 files I need to match.
File1.txt contains:
-----------------------------------------------
Words | Keyword | Sentence
-----------------------------------------------
Lunch >WORDS> when do you want to have lunch?.
Hate >WORDS> I hate you.
Other >WORDS> Other than that?
File2.txt contains:
I love you.
Other than that?.
I like you.
when do you want to have lunch?.
File1 will do the word matching with File2, after this keyword >WORDS>. Meaning File1 and File2 just compare the word "Other than that?" and "when do you want to have lunch?". So the result will take the same word after the keywords >WORDS>. I use array to do.
The expected output will print:
Other >WORDS> Other than that?.
Lunch >WORDS> when do you want to have lunch?.
CODE:
use strict;
use warnings;
use diagnostics;
use Data::Dumper;
use 5.010;
my $new= File1.txt; #read File1
my $old= File2.txt; #read File2
my $string1;
my $string2;
my #new_array;
my #old_array;
my $string11;
my #array1;
#---------------------------------------------------------------
# Main
#---------------------------------------------------------------
open(NEW_FILE,"<", $new) || die "Cannot open file $new to Read! - $!";
open(OLD_FILE,"<", $old) || die "Cannot open file $old to Read! - $!";
while (<NEW_FILE>) {
my $string1= $_;
my $string11= $_;
if ($string1=~ m/WORDS/){ #matching the Keyword >WORDS>
$string1 = $'; #string1 will take after >WORDS>
$string11 = $_; #string11 will take the full.
push (#new_array, ($string1)); #string1 = #new_array
push (#array1, ($string11)); }} #string11 = #array1
while (<OLD_FILE>) {
my $string2= $_;
if ($string2 =~ m/WORDS/){ #matching the Keyword >WORDS>
$string2 = $'; #string2 will take after >WORDS>
push (#old_array, ($string2)); #string2 = #old_array
}}
#------Do comparison between new file and old file. (only after WORDS)
my #intersection =();
my #unintersection = ();
my %hash1 = map{$_ => 1} #old_array;
foreach (#new_array){
if (defined $hash1{$_}){
push #intersection, $_; #this one will take the same array between new and old
}
else {
push #unintersection, $_; #this one will take the new array only. So, will read this one.
}}
Until this part, if I print the #unintersection, it will produce:
other than that?
when do you want to have lunch?.
Do comparison between#unintersection (result after WORDS) and (#array1).
my #same();
my #not_same= ();
my %hash2 = map{$_ => 1} #unintersection;
foreach (#array1) {
if (#array1 = m/WORDS/){
#array1 = $';
if (defined $hash2{$_}) {
#array1 = $_;
push #same, $_;
}
else {
push #not_same, $_;}}}
print #same;
print #not_same;
close(NEW_FILE);
close(OLD_FILE);
close(NEW_OUTPUT_FILE);
The result that I produce only 1. have lunch?"
Other >WORDS> Other than that?
Should be got 2 output. "Other >WORDS> Other than that?" and "Lunch >WORDS> when do you want to have lunch?"
The problem can be solved with a lookup table (implemented as hashref) build on information provided in File1.txt (words_lookup.dat).
Once we have lookup table at our disposal read File2.txt (words_data.dat) and compare with lookup table. If the input line matches lookup table then output stored value ($lookup->{$1}{line}) to the console.
use strict;
use warnings;
use feature 'say';
my($fh, $lookup);
my $fname_lookup = 'words_lookup.dat'; # File1.txt
my $fname_data = 'words_data.dat'; # File2.txt
my $re_lookup = qr/(\S+)\s+>WORDS>\s+(.*)/;
open $fh, '<', $fname_lookup
or die "Couldn't open $fname_lookup";
while( <$fh> ) {
chomp;
next unless /$re_lookup/;
$lookup->{$1}{sentence} = $2;
$lookup->{$1}{line} = $_;
}
close $fh;
open $fh, '<', $fname_data
or die "Couldn't open $fname_data";
while( my $line = <$fh> ) {
$line =~ /$lookup->{$_}{sentence}/ && say $lookup->{$_}{line} for keys $lookup->%*;
}
close $fh;
exit 0;
Output
Other >WORDS> Other than that?
Lunch >WORDS> when do you want to have lunch?.

Perl Script – Look at the first number of each line and alert if 0 exists

Currently the script only looks at the first character of a txt file and emails if that value =0 using regex. I'm trying to update the script so it looks at each line until the end of the file and alert of any of the lines have the number 0. If all lines have 1 then do nothing. Any help would be greatly appreciated.
Example of an alert
1
1
1
0 -since there is a 0 an email alert would be generated
1
1
code below:
use warnings;
use strict;
my $file = '/users/tneal01/SPOOL/output.txt';
my $mark = 0;
my $cont = do {
open my $fh, '<', $file or die "Can't open $file -- $!";
local $/;
<$fh>;
};
# Pull the first number
my ($num) = $cont =~ /^(\d+)/;
if ($num == $mark)
{
my $body = "status $num has been recorded ";
my $cmd_email = "echo $body | " .
"mailx -s \"error occurring\" tneal01\#gmail.com";
system($cmd_email) == 0 or die "Error sending email -- $!";
}
I'd probably go with something like:
#!/usr/bin/env perl
use strict;
use warnings;
my $file = '/users/tneal01/SPOOL/output.txt';
my $mark = '0';
my $search = qr/^$mark\b/;
open my $fh, '<', $file or die "Can't open $file -- $!";
while (<$fh>) {
#line starts with 0. Or check other regex.
if (m/$search/) {
my $body = "status $mark has been recorded ";
my $cmd_email =
"echo $body | " . "mailx -s \"error occurring\" tneal01\#gmail.com";
system($cmd_email) == 0 or die "Error sending email -- $!";
#bail out the loop - assume you don't want more than one email per thing.
last;
}
}
close ( $fh );
This solution only reads one line at a time... there are simple solutions, but would need to load the entire file into memory...
I'm also assuming you want to know how many occurencies of $mark there are in the file.
#!/usr/bin/perl
use strict;
my $file = 'file.txt';
my $mark = '0';
open my $f, "<$file" or die "Error open file: $!\n";
my $counter=0;
while(my $line = <$f>) {
if($line =~ /$mark/) {
$counter++;
}
}
if($counter) {
my $body = "status $mark has been recorded $counter times";
my $cmd_email = "echo $body | mailx -s \"error occurring\" tneal01\#gmail.com";
system($cmd_email) == 0 or die "Error sending email -- $!";
}
This does a couple of things differently to other solutions here.
It reads from whatever filename is given on the command line
It stops checking once the first error is found
I don't know how useful those improvements are to you.
use warnings;
use strict;
my $mark = 0;
while (<>) {
my ($num) = /^(\d)/;
if ($num == $mark) {
my $body = "status $num has been recorded ";
my $cmd_email = "echo $body | " .
'mailx -s "error occurring" tneal01\#gmail.com';
system($cmd_email) == 0 or die "Error sending email -- $!";
last; # stop checking after the first error
}
}
(Oh, and I switched some double-quotes to single-quotes so you don't have to escape embedded double quotes.)
This version uses minimal changes to your existing script. I've commented the changes.
use warnings;
use strict;
my $file = '/users/tneal01/SPOOL/output.txt';
my $mark = 0;
my $cont = do {
open my $fh, '<', $file or die "Can't open $file -- $!";
local $/;
<$fh>;
};
# Pull the first number <-- comment not needed
# my ($num) = $cont =~ /^(\d+)/; # <-- delete this line
if ($cont =~ /^$mark/m) # <-- change the condition to this regex
{
my $body = "status $mark has been recorded "; # replace $num with $mark
my $cmd_email = "echo $body | " .
"mailx -s \"error occurring\" tneal01\#gmail.com";
system($cmd_email) == 0 or die "Error sending email -- $!";
}
What the change does is use a regular expression to check if any lines begin with the value defined in $mark.
I deleted the $num variable, but its contents was the same as $mark so we can just use $mark instead.
A breakdown of $cont =~ /^$mark/m:
$cont =~ Apply the following regex to the string contained in $cont
/ Start the regular expression
^ Match the beginning of a line
$mark Match the string specified in the $mark variable
/ End regular expression
m Flag to tell the regex to treat $cont as a multiple-line string (which it is)

Parsing a file and use perl regex to match fields in the file

I have a file with the format bellow and I need to use Perl and regex to get the values to insert a database. I try to use regex, but I have errors. Follow the script I use to get the first value:
# line 1
open(my $fin, "<", "input.txt")
or die "cannot open < input.txt: $!";
my $line1 = <$fin>;
chomp $line1;
print "$line\n";
my ($codigo, $nome) = ($line1 =~ m/^((\d)+)\s[\S]\s(([\s\d\w])+)$/);
print "#$codigo#$nome#\n";
# line 2
$line2 = <$fin>;
chomp $line2;
..
But I don´t see the correct value of "codigo" and "nome".
The format of file is:
84404167 - NAME ONE OF SILVA
R NONONONO, 143334, HOUSE - REGION - CITY - 81280400
Res: (22)5555.4543 Cel: (33)5555.8659 Ou: 
Início: 17/12/2013 - Data de aniv.: 23/02/1955
Crédito: 1440 - Crédito Disponível: 1152 - Status: ATIVA 
96071311 - NAME SECOND OF JOSE
R SECRET ADDRESS NONONNO, 433, ap 232 b azaleia - fazendinha - CURITIBA - 81320420
Res:  Cel: (22)5555.9776 Ou: (33)5555.2352
Início: 22/01/2015 - Data de aniv.: 10/05/1981
Crédito: 764 - Crédito Disponível: 516 - Status: ATIVA 
Instead of regex you can use split function:
#!/usr/bin/perl
use warnings;
use strict;
open my $fin, "<", "abc.txt" or die "cannot open input.txt: $!";
while (my $line1 = <$fin>)
{
chomp $line1;
if ($. == 1) # line 1
{
my ($codigo, $nome) = split (/\s*-\s*/, $line1);
print "codigo: $codigo nome: $nome\n";
}
}
Output:
codigo: 84404167 nome: NAME ONE OF SILVA

How to get the next line of file if pattern match in file using Perl script

I am trying to find pattern Pattern String , once it found , I need to get the next line of pattern, which contains page number, I need extract the page number 2 in below sample text file Page: 2 of 5. Here is my try:
my $filename="sample.txt";
$i=1;
open(FILE, "<$filename") or die "File couldn't be matched $filename\n";
#array = <FILE>;
foreach $line(#array){
chomp($line);
if ($array[$i]=~/(\s+)Pattern String(\s+)/) {
if ($array[$i]=~/(\s+)Page:(\s+)(.*) of (.*)/) {
$page = $3;
}
}
Here is my sample text file :
Pattern String
MCN: 349450A0 NCP Account ID: 999 600-0089 Page: 2 of 5
=============================================================================
Customer Name: PCS HEALTH SYSTEMS
Customer Number: 349450A0
What about this? Is that what you want? After a match and if next line is not empty then show the line. Let me know if worked for you.
# Perl:
my $filename="sample.txt";
my $match = undef;
my $line = "";
open(my $fh, "<", $filename) or die "Failed to open file: $!";
foreach (<$fh>) {
$line = $_;
if ( $line =~ /.*Pattern\sString.*/ ) {
$match = 1;
next;
}
if (($match == "1") && ($line !~ /^$/)){
print $line;
$match = undef;
}
}
I think this will solve the problem (I'm assuming that the sample files will always have the same format). I hope this will help you, please let me know if it worked.
my $filename="sample.txt";
my $count = 0;
my $tgline = 0;
open(my $fh, "<", $filename) or die "Failed to open file: $!";
my #lines = <$fh>;
foreach (#lines) {
if ( $_ =~ /.*Pattern\sString.*/ ) {
$tgline = $count + 2;
if ( $lines[$tgline] =~ /.*Page\:\s(\d+)\sof\s(\d+)$/ ) {
print "Current page: " . $1 . "\n";
print "Total page #: " . $2 . "\n";
}
}
$count+=1;
}
I don't know why are you matching Pattern String, if your target is achieveing 2 from Page: 2 of 5 from your input file. This is a way to get this:
use warnings;
use strict;
my $filename = "sample.txt";
open my $fh, "<","$filename" or die "Couldn't open $filename: $!";
while (my $line = <$fh>)
{
if($line =~ m/.*Page:\s(\d+)\sof\s(\d+)$/)
{
print "$1\n";
}
}
sample.txt:
Pattern String
MCN: 349450A0 NCP Account ID: 999 600-0089 Page: 2 of 5
=============================================================================
Customer Name: PCS HEALTH SYSTEMS
Customer Number: 349450A0
Output:
2

Find and get the location of a list of words in a text

I'm sure this is simple but I just can't figure out what to do...
I have a text file with a bunch of words in it (let's call it "wordlist") organized in a single column. Then I have a big text file (let's call it "essay"). What I want to do is to look in the "essay" file for the words in my "wordlist".
The trick is that I want to know the position of the matched word in the "essay" (meaning, match found after X characters).
I'm actually able to do it when I look for a single word (so wordlist containing just 1 word) but I can't get it to work when working with a list of words...
Any advice ?
thanks a lot
Ok so I just realized it would just tell me "no match found" anyway...Here is the code
use strict;
use warnings;
open (my $wordlist, "<", "/wordlist.txt")
or die "cannot open < wordlist.txt $!";
open (my $essay, "<", "/essay.txt")
or die "cannot open < essay.txt $!";
while (<$essay>) { print "match found\n" if ($essay =~ m/$wordlist/) ; }
{ print "no match found\n" if ($essay !~ m/$wordlist/) ; }
Help please...?
perl index function basically matches substring which does not ensure the match of a full string. A regular expression based match is more useful here imho.
Explanation:
Read whole text of essay in a string. => $essay
For each word from wordlist.txt => $_
-- Keep matching $_ within $essay with proper regex. The one used here is b$_\b
-- For each match, collect the value of #-[0]
\b: is the word boundary character here which ensures that it only matches with complete words not substrings.
#-: is a special variable that contains the start position of the last regex match.
Here is a sample code:
use strict;
use warnings;
use 5.010;
my $wordlist_file = 'wordlist.txt';
open my $wordlist_fh, '<', $wordlist_file or die "Failed to open '$wordlist_file': $!";
my %pos;
my $essay_file = 'essay.txt';
my $essay = do {
local $/ = undef;
open my $fh, "<", $essay_file
or die "could not open $essay_file: $!";
<$fh>;
};
while (<$wordlist_fh>) {
chomp;
$pos{$_} = [] unless $pos{$_};
while($essay =~ m/\b$_\b/g){
push #{$pos{$_}}, #-;
}
}
use Data::Dumper;
print Dumper(\%pos);
the wordlist file and essay files are similar as mentioned by ThisSuitIsBlackNot.
wordlist.txt
I
Perl
hacker
essay.txt
I want to be just another Perl hacker when I grow up
I want to be just another Perl hacker when I grow up
The %pos hash now contains all the positions of your each word. I just showed them through dumper
$VAR1 = {
'hacker' => [
'31',
'84'
],
'Perl' => [
'26',
'79'
],
'I' => [
'0',
'43',
'53',
'96'
]
};
Note that the counts are including the newline characters at the end of each line.
Maybe you can use index() function.
Here is the link: Using the Perl index() function
This is my sample. The performance may be not too well. Hope it helps~:)
open (my $wordlist, "<", "files/wordlist.txt")
or die "cannot open < wordlist.txt $!";
open (my $essay, "<", "files/essay.txt")
or die "cannot open < essay.txt $!";
my $words = {};
while (<$wordlist>) {
chomp($_);
$words->{$_} = 1;
}
my $row_count = 0;
while (<$essay>) {
$row_count++;
chomp($_);
foreach my $word (keys %{$words}) {
my $offset = 0;
my $r = index($_, $word, $offset);
while ($r != -1) {
print "Found [$word] in line $row_count at $r\n";
$offset = $r + 1;
$r = index($_, $word, $offset);
}
}
}
In your code, $essay and $wordlist are both filehandles. When you say
print "match found\n" if ($essay =~ m/$wordlist/);
You're trying to match the stringification of one filehandle to the stringification of another filehandle. When a filehandle is stringified, it looks something like this:
GLOB(0x9a26c38)
So your code actually does something like:
print "match found\n" if ('GLOB(0x9a26c38)' =~ m/GLOB(0x94bbc38)/);
This is not what you want. You need to read the contents of your files and compare those, not the filehandles themselves.
Essay words each on their own line
The following code assumes that your "essay" consists of one word per line. We read the contents of the essay file into a hash of arrays, with the lines as keys and an array of positions as values. We use an array in case the same word appears multiple times in the file. The position of the first word is zero. We then loop through the word list file, printing the word and the first matching position, if there is one.
use strict;
use warnings;
use 5.010;
my $essay_file = 'files/essay.txt';
open my $essay_fh, '<', $essay_file or die "Failed to open '$essay_file': $!";
my $pos = 0;
my %essay;
while (<$essay_fh>) {
chomp;
push #{ $essay{$_} }, $pos;
$pos += length $_;
}
my $wordlist_file = 'files/wordlist.txt';
open my $wordlist_fh, '<', $wordlist_file or die "Failed to open '$wordlist_file': $!";
while (<$wordlist_fh>) {
chomp;
say "$_: $essay{$_}[0]" if exists $essay{$_};
}
essay.txt
I
want
to
be
just
another
Perl
hacker
when
I
grow
up
wordlist.txt
I
Perl
hacker
Output
I: 0
Perl: 20
hacker: 24
Note that I'm ignoring newline characters when computing the position values. You can adjust this as necessary.
Essay words more than one per line
If your essay file can have more than one word per line, we can use a regex to check for matches:
use strict;
use warnings;
use 5.010;
# Slurp entire essay file into a variable
my $essay = do {
local $/;
my $essay_file = 'files/essay.txt';
open my $essay_fh, '<', $essay_file or die "Failed to open '$essay_file': $!";
<$essay_fh>;
};
my $wordlist_file = 'files/wordlist.txt';
open my $wordlist_fh, '<', $wordlist_file or die "Failed to open '$wordlist_file': $!";
while (<$wordlist_fh>) {
chomp;
say "$_: ", pos($essay) - length($_) if $essay =~ /\b$_\b/g;
}
essay.txt
I want to be just another Perl hacker when I grow up
wordlist.txt
I
Perl
hacker
hack
Output
I: 0
Perl: 26
hacker: 31
Note that the results are a little bit different from our other program, because now there are spaces between words. Also note that there is no output for the word hack, since we're only checking for whole word matches.