How to match 2 array? - regex

I have 2 files I need to match.
File1.txt contains:
-----------------------------------------------
Words | Keyword | Sentence
-----------------------------------------------
Lunch >WORDS> when do you want to have lunch?.
Hate >WORDS> I hate you.
Other >WORDS> Other than that?
File2.txt contains:
I love you.
Other than that?.
I like you.
when do you want to have lunch?.
File1 will do the word matching with File2, after this keyword >WORDS>. Meaning File1 and File2 just compare the word "Other than that?" and "when do you want to have lunch?". So the result will take the same word after the keywords >WORDS>. I use array to do.
The expected output will print:
Other >WORDS> Other than that?.
Lunch >WORDS> when do you want to have lunch?.
CODE:
use strict;
use warnings;
use diagnostics;
use Data::Dumper;
use 5.010;
my $new= File1.txt; #read File1
my $old= File2.txt; #read File2
my $string1;
my $string2;
my #new_array;
my #old_array;
my $string11;
my #array1;
#---------------------------------------------------------------
# Main
#---------------------------------------------------------------
open(NEW_FILE,"<", $new) || die "Cannot open file $new to Read! - $!";
open(OLD_FILE,"<", $old) || die "Cannot open file $old to Read! - $!";
while (<NEW_FILE>) {
my $string1= $_;
my $string11= $_;
if ($string1=~ m/WORDS/){ #matching the Keyword >WORDS>
$string1 = $'; #string1 will take after >WORDS>
$string11 = $_; #string11 will take the full.
push (#new_array, ($string1)); #string1 = #new_array
push (#array1, ($string11)); }} #string11 = #array1
while (<OLD_FILE>) {
my $string2= $_;
if ($string2 =~ m/WORDS/){ #matching the Keyword >WORDS>
$string2 = $'; #string2 will take after >WORDS>
push (#old_array, ($string2)); #string2 = #old_array
}}
#------Do comparison between new file and old file. (only after WORDS)
my #intersection =();
my #unintersection = ();
my %hash1 = map{$_ => 1} #old_array;
foreach (#new_array){
if (defined $hash1{$_}){
push #intersection, $_; #this one will take the same array between new and old
}
else {
push #unintersection, $_; #this one will take the new array only. So, will read this one.
}}
Until this part, if I print the #unintersection, it will produce:
other than that?
when do you want to have lunch?.
Do comparison between#unintersection (result after WORDS) and (#array1).
my #same();
my #not_same= ();
my %hash2 = map{$_ => 1} #unintersection;
foreach (#array1) {
if (#array1 = m/WORDS/){
#array1 = $';
if (defined $hash2{$_}) {
#array1 = $_;
push #same, $_;
}
else {
push #not_same, $_;}}}
print #same;
print #not_same;
close(NEW_FILE);
close(OLD_FILE);
close(NEW_OUTPUT_FILE);
The result that I produce only 1. have lunch?"
Other >WORDS> Other than that?
Should be got 2 output. "Other >WORDS> Other than that?" and "Lunch >WORDS> when do you want to have lunch?"

The problem can be solved with a lookup table (implemented as hashref) build on information provided in File1.txt (words_lookup.dat).
Once we have lookup table at our disposal read File2.txt (words_data.dat) and compare with lookup table. If the input line matches lookup table then output stored value ($lookup->{$1}{line}) to the console.
use strict;
use warnings;
use feature 'say';
my($fh, $lookup);
my $fname_lookup = 'words_lookup.dat'; # File1.txt
my $fname_data = 'words_data.dat'; # File2.txt
my $re_lookup = qr/(\S+)\s+>WORDS>\s+(.*)/;
open $fh, '<', $fname_lookup
or die "Couldn't open $fname_lookup";
while( <$fh> ) {
chomp;
next unless /$re_lookup/;
$lookup->{$1}{sentence} = $2;
$lookup->{$1}{line} = $_;
}
close $fh;
open $fh, '<', $fname_data
or die "Couldn't open $fname_data";
while( my $line = <$fh> ) {
$line =~ /$lookup->{$_}{sentence}/ && say $lookup->{$_}{line} for keys $lookup->%*;
}
close $fh;
exit 0;
Output
Other >WORDS> Other than that?
Lunch >WORDS> when do you want to have lunch?.

Related

How to grep word from file

I want to grep some word inside a file from another file. My code is able to grep the word on last line of the file but not the word before it. I have no idea why and hope can get help here. Below is the perl script i using:
open(FILE1,"file1.txt") or die "Error, File1 could not open\n";
open(FILE2,"file2.txt") or die "Error, File2 could not open\n";
open(FILE3, ">file3.txt") or die "Error, File3 could not open\n";
use strict;
use warnings;
use List::MoreUtils qw(uniq);
my #file1=<FILE1>;
my #file2=<FILE2>;
my $j =0;
my $i =0;
my $zone =0;
for ($j=0; $j<=$#file2; $j++){
$zone = $file2[$j];
unless ( $zone =~ m/#(.*?)/ ) {
print "$zone";
my #fid = grep /$zone/ , #file1;
#fid = uniq(#fid);
s{^\s+|\s+$}{}g foreach #fid; #cancel leading space
for ($i=0; $i<=$#fid; $i++){
print FILE3 "$fid[$i]\n";
}
##fid=();
}
}
close(FILE3);
My file1.txt is something like this:
i am a dog
i am a cat
we are the fish
he is a boy
she is a girl
My file2.txt is like this:
is
am
But my file3 can only show those sentence contain am but no is, if i put is in second line and am in first line then my file3 only contain the sentences with is. I not very sure why my code can only grep the last row in my file2. Thanks for the help.
When reading from a file, the final newline is part of each line read. You can remove the newlines from the pattern array by chomping:
chomp( my #file2 = <FILE2> );
You can already do this with egrep :
egrep -f file2.txt file1.txt
The root of this problems is chomp - you're not removing linefeeds, so the matches aren't working.
But aside from that, there's a few problems with your code that could do with addressing:
opening files, you should use 3 arg open with lexical file handles, as it's better style: open (my $file1, '<', 'file1.txt' ) or die $!;
rather than a loop of loops, you'd probably be better off compiling up a 'match regex'.
Instead of reading all of a file into an array, you can iterate line by line, and you don't need to use the memory.
If you're iterating a loop, and only using the index to acccess the current element, you're far better off using foreach my $line ( #things ) { type syntax.
So your code actually could be reduced to something like:
#!/usr/bin/env perl
use strict;
use warnings;
open(my $data, '<',"file1.txt") or die $!;
open(my $search, '<', "file2.txt") or die $!;
open(my $output, '>', "file3.txt" ) or die $!;
chomp ( my #search_terms = <$search> );
#quotemeta is needed to avoid 'special' regex characters doing things.
my $search_regex = join "|", map { quotemeta }, #search_terms;
#note - '\b' denotes word boundary, which may not be what you want.
#means 'is' won't match 'fish'
#so get rid of them if that's not what you want.
$search_regex = qr/\b($search_regex)\b/;
print "Using: $search_regex\n";
select $output; #default print destination
while ( <$data> ) {
print if m/$search_regex/;
}
Output (in 'file3.txt'):
i am a dog
i am a cat
he is a boy
she is a girl
please try this .
use strict;
use warnings;
use List::MoreUtils qw(uniq);
open(FILE1,"file1.txt") or die "Error, File1 could not open\n";
open(FILE2,"file2.txt") or die "Error, File2 could not open\n";
open(FILE3, ">file3.txt") or die "Error, File3 could not open\n";
my #file1=<FILE1>;
my #file2=<FILE2>;
my $j =0;
my $i =0;
foreach my $main_line(#file1){
chomp($main_line);
foreach my $line(#file2){
chomp($line);
if ($main_line =~ /$line/i) {
print FILE3 "$main_line\n";
}
}
}
close(FILE3);
thanks,
praveenzx~

Multiple pattern match and replace

How to extract patterns from a file and replace the multiple patterns with a new pattern from a file?
For example:
Lets say the pattern file is pattern.txt, as follows with 2,000 lines.
a
b
d
e
f
....
...
...
File to replace pattens is replace.txt containing:
a,1
b,3
c,5
d,10
e,14
....
...
...
The intended final file content for file patterns.txt is:
a,1
b,3
d,10
e,14
....
...
...
Perl from command line,
perl -i -pe'
BEGIN{ local (#ARGV, $/, $^I) =pop; %h = split /[\s,]+/, <> }
s| (\S+)\K |,$h{$1}|x
' pattern.txt replace.txt
It slurps content of second file ($/ to undef), and temporarily disables in-place editing ($^I to undef), splits string on white-spaces/commas and populate %h hash in key/value manner. Then for every line of first file adds comma and value for current key.
With the possibility of arbitrary characters in your input, it might be safest to use Text::CSV. The benefit is that it will handle things like quoted delimiters, multiline strings, etc. The drawback is that it can break on non-csv content, so it sort of relies on your input being proper csv.
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({
binary => 1,
eol => $/,
});
my %s;
my ($input, $replace) = #ARGV;
open my $fh, "<", $replace or die "Cannot open $replace: $!";
while (my $row = $csv->getline($fh)) {
my ($key, $line) = #$row;
$s{$key} = $line;
}
open $fh, "<", $input or die "Cannot open $input: $!";
while (<$fh>) {
chomp;
$csv->print(*STDOUT, [$_, $s{$_}]);
}
Not sure this really needs a regex as you're not really altering your source, as much as 'just' printing based on key fields.
So I would approach it something like this:
#!/usr/bin/env perl
use strict;
use warnings;
open( my $replace, "<", "replace.txt" ) or die $!;
my %replacements;
while (<$replace>) {
chomp;
my ( $key, $value ) = split(/,/);
$replacements{$key} = $value;
}
close($replace);
open( my $input, "<", "input.txt" ) or die $!;
open( my $output, ">", "patterns.txt" ) or die $!;
while ( my $line = <$input> ) {
chomp $line;
if ( $replacements{$line} ) {
print {$output} $replacements{$line}, "\n";
}
}
close($input);
close($output);
It's not as concise as some of the other examples, but hopefully clearer what it's actually doing. This I call a good thing. (I can make it much more compact, in the way that perl is (in)famous for).

Comparing files line by line using a simple pattern match

I have two files: in the first file each line has some labels associated with it; the second file contains the labels which fall under certain categories.
File1 - labelled lines:
I have never had an issue. L_102 ----- L_127
I travel overseas and offer a lot of services that are very useful to me L_105 ----- L_134 ----- L_148
Expense to have L_522
Great benefits L_148
prestige L_118
File2 - categories under which the labels fall:
Issues:113,114,115,116,127
Benefits:105,220,154,543,590
General:148,134,154
I have written a Perl script to fetch labels from the first file.
#!/usr/bin/perl
use strict;
use warnings;
my $file = shift || "INPUTFILE";
my $outputfile = shift || "OUTPUTFILE";
open my $fh, '<', $file or die "Can not open '$file': $!";
open( OUTFILE, ">", $outputfile) or die "Can not open '$outputfile': $!";
while(my $w = <$fh>) {
my #matches = $w =~ m/(L_[0-9][0-9][0-9])/g;
for(#matches){s/L_//g;
s/\s+/\t/g;
}
print OUTFILE "#matches\n";
}
The output from this first script is:
102 127
105 134 148
522
148
118
I have a second Perl script to fetch the levels from second file (which contains the categories):
#!/usr/bin/perl
use strict;
use warnings;
my $file = shift || "INPUTFILE";
my $outputfile = shift || "OUTPUTFILE";
my $patern = shift ||"Issues:"
open my $fh, '<', $file or die "Can not open '$file': $!";
open( OUTFILE, ">", $outputfile) or die "Can not open '$outputfile': $!";
while(my $var = <$fh>) {
if(my #matches =$var=~/(.*$patern.*)/)
{
for(#matches){s/$patern//g;s/\,/\t/g}
print OUTFILE "#matches\n";
}
}
The second output from the second script is:
113 114 115 116 127
Now I want to match the first output with the second output line by line.
The results I want are: if the any of the numbers in the second output matches with any of the lines in the first output then I want to print 1; or else if there is no match print -1 for that line.
The output from the above would be as below:
1
-1
-1
-1
-1
This combines your two scripts into one. It reads through the $inputfile file handle that is pointing at "INPUTFILE.txt"looking for matches based on either a regular expression ($regexp) or the existence of a search key in the %patterns hash.
Since the match is simple, the regular expression we use is built up using join, |, and the required search strings. In the alternative approach (which is commented out here) we use the hash keys themselves to check whether a search pattern exists.
I have changed the variables and file names in the open statements somewhat since the capitalized file names made them seem like old style file handles:
#!perl -l
my $inputfile = "INPUTFILE.txt";
my $outputfile = "OUTPUTFILE.txt";
my $matchfile = "MATCHFILE.txt";
open my $inputfh, '<', $inputfile or die "No file '$inputfile': $!";
open my $matchfh, '<', $matchfile or die "No file '$matchfile': $!\n";
open my $outfh, '>', $outputfile or die "No file '$outputfile': $!\n";
my %patterns;
while (<$matchfh>) {
$patterns{$_} = () for map { split /,/, $_ } /Issues:(.*)/;
}
my $regex = join "|", keys %patterns;
$regex = qr/$regex/; # create a regex from %patterns
print "Search patterns : ", join " ", keys %patterns;
print "Regex : $regex \n";
while (my $line = <$inputfh>) {
chomp $line;
# Print "1" for 3 digits matching search pattern; "-1" otherwise:
#print exists $patterns{$_} ? "1" : "-1" for $line =~ m/(\d\d\d)/g;
# Print "1" if a matching pattern is on a line; -1 otherwise:
if (grep /$regex/, $line) { #
print "1 - $line";
}
else {
print "-1 - $line";
}
}
The above script should work. You can remove - $line from the last print statements and add a file handle destination ($outfh) to direct the output to a file.
Since there are five lines in the inputfile, the output is:
Search patterns : 127 116 114 115 113
Regex : (?^:127|116|114|115|113)
1 - I have never had an issue. L_102 ----- L_127
-1 - I travel overseas ... very useful to me L_105 ----- L_134 ----- L_148
-1 - Expense to have L_522
-1 - Great benefits L_148
-1 - prestige L_118
NB the final if ... else blocks could be shortened using the "ternary operator"(<cond> ? 1 : 0) to:
print $line =~ /$regex/ ? '1' : '-1';
so that "1" will printed if $line =~ /$regex/ evaluates to "true" (or "1") ; and "-1" will be printed if it evaluates to "false" (or "0").
If you read from your two files and simply redirect the output with your shell, the short version of all this would be:
#!perl -l
my $inputfile = "INPUTFILE.txt";
my $matchfile = "MATCHFILE.txt";
open my $inputfh, '<', $inputfile or die "No '$inputfile': $!";
open my $matchfh, '<', $matchfile or die "No '$matchfile': $!\n";
my %patterns;
while (<$matchfh>) {
$patterns{$_} = () for map { split /,/, $_ } /Issues:(.*)/;
}
my $regex = join "|", keys %patterns;
$regex = qr/$regex/;
while (my $line = <$inputfh>) {
chomp $line;
print $line =~ $regex ? '1' : '-1';
}

How to read string from a file and to split them to different array?

I am struggling with this part for my college exercise...
I need to read string from a file and put them into different variable...
Team, kindly review and please reply in your free moment...
Input File: (test_ts.txt)
Test1--12:45
Test2--1:30
Script:
use strict;
use warnings;
my $filename = "test_ts.txt";
my #name = ();
my #hrs=();
my #mins=();
open(my $fh, $filename)
or die "Could not open file '$filename' $!";
while (my $row = <$fh>) {
chomp $row;
push(#name, $row);
print "$row\n";
}
Output:
Test1--12:45
Test2--1:30
Expected output:
Test1
Test2
*(Array should have the below values
name[0]=Test1
name[1]=Test2
hrs[0]=12
hrs[1]=1
mins[0]=45
mins[1]=30)*
Tried using Split:
while (my $row = <$fh>) {
chomp $row;
$row=split('--',$row);
print $row;
$row=split(':',$row);
print $row;
push(#name, $row);
print "$row\n";
}
Output which i got after trying split:
211
211
split returns a list; when you use it in a scalar context like $row = split(..., $row); then:
You only get the number of elements of the array assigned.
You destroy your $row in the input.
You need something more like:
while (my $row = <$fh>)
{
chomp $row;
my #bits = split /[-:]+/, $row;
print "#bits\n";
push(#name, $bits[0]);
…other pushes…
print "$row\n";
}
You will need to learn about scalar and array context sooner or later. In the mean time, assign the result of split to an array.
Here is the simple method split the row based on "--" and then time on basis of ":". Hope this help you.
use strict;
use warnings;
my $filename = "test_ts.pl";
my #name = ();
my #hrs=();
my #mins=();
open(my $fh, $filename)
or die "Could not open file '$filename' $!";
while (my $row = <$fh>) {
chomp $row;
my ($a,$b) = split("--", $row);
my ($c, $d) = split (":", $b);
push(#name, $a);
push(#hrs, $c);
push(#mins, $d);
}
print "$name[0]\n";
print "$name[1]\n";
print "$hrs[0]\n";
print "$hrs[1]\n";
print "$mins[0]\n";
print "$mins[1]\n";
It is sometimes simpler to use a global regular expression than split. This short program works by finding all alphanumeric fields in the target string.
use strict;
use warnings;
use autodie;
open my $fh, '<', 'test_ts.txt';
my (#name, #hrs, #mins);
while (<$fh>) {
my ($name, $hrs, $mins) = /\w+/g;
push #name, $name;
push #hrs, $hrs;
push #mins, $mins;
print "$name\n";
}
print "\n";
print "Names: #name\n";
print "Hours: #hrs\n";
print "Minutes: #mins\n";
output
Test1
Test2
Names: Test1 Test2
Hours: 12 1
Minutes: 45 30

Perl: Comparing two files and printing data that match and don't match

For the Perl code below, I need to increase its efficiency since it's taking hours to process the input files (which contain millions of lines of data). Any ideas on how I can speed things up?
Given two files, I want to compare the data and print those lines that match and those that don't. Please note that two columns need to be compared interchangeably.
For example,
input1.txt
A B
C D
input2.txt
B A
C D
E F
G H
Please note:
Lines 1 and 2 match (interchangeably); Lines 3 and 4 don't match
Output:
B A match
C D match
E F don't match
G H don't match
Perl code:
#!/usr/bin/perl -w
use strict;
use warnings;
open INFH1, "<input1.txt" || die "Error\n";
open INFH2, "<input2.txt" || die "Error\n";
chomp (my #array=<INFH2>);
while (<INFH1>)
{
my #values = split;
next if grep /\D/, #values or #values != 2;
my $re = qr/\A$values[0]\s+$values[1]\z|\A$values[1]\s+$values[0]\z/;
foreach my $temp (#array)
{
chomp $_;
print "$_\n" if grep $_ =~ $re, $temp;
}
}
close INFH1;
close INFH2;
1;
Any ideas on how to increase the efficiency of this code is highly appreciated. Thanks!
If you have enough memory, use a hash. If symbols do not occur multiple times in input1.txt (i.e. if A B is in the file, A X is not), the following should work:
#!/usr/bin/perl
use warnings;
use strict;
my %hash;
open my $F1, '<', 'input1.txt' or die $!;
while (<$F1>) {
my #values = split / /;
#hash{#values} = reverse #values;
}
close $F1;
open my $F2, '<', 'input2.txt' or die $!;
while (<$F2>) {
my #values = split / /;
my $value = $hash{$values[0]};
if ($value and $value eq $values[1]) {
print "Matches: $_";
} else {
print "Does not match: $_";
}
}
close $F2;
Update:
For repeated values, I would use a hash of hashes. Just sort the symbols, the first one will be the key in the large hash, the second one will be the key in the subhash:
#!/usr/bin/perl
use warnings;
use strict;
my %hash;
open my $IN1, '<', 'input1.txt' or die $!;
while (<$IN1>) {
my #values = sort split;
undef $hash{$values[0]}{$values[1]};
}
close $IN1;
open my $IN2, '<', 'input2.txt' or die $!;
while (<$IN2>) {
chomp;
my #values = sort split;
if (exists $hash{$values[0]}{$values[1]}) {
print "$_ matches\n";
} else {
print "$_ doesn't match\n";
}
}
close $IN2;
for those interested in another solution that's independent of the amount of columns:
#!/usr/bin/perl -w
use strict;
use warnings;
use 5.010;
open INFH1, "<", input1.txt" || die "Error\n";
my #storage = map {[sort split]} <$IN1>; # store content as matrix (each row sorted)
close INFH1;
open INFH2, "<input2.txt" || die "Error\n";
while(<INFH2>) {
chomp;
if(#{$storage[$.]} ~~ sort split) { # if stored matrix row is elementwise-equal to current line (each row sorted)
say "$_ matches";
}
else {
say "$_ doesn't match";
}
}
close INFH2;