How to grep word from file - regex

I want to grep some word inside a file from another file. My code is able to grep the word on last line of the file but not the word before it. I have no idea why and hope can get help here. Below is the perl script i using:
open(FILE1,"file1.txt") or die "Error, File1 could not open\n";
open(FILE2,"file2.txt") or die "Error, File2 could not open\n";
open(FILE3, ">file3.txt") or die "Error, File3 could not open\n";
use strict;
use warnings;
use List::MoreUtils qw(uniq);
my #file1=<FILE1>;
my #file2=<FILE2>;
my $j =0;
my $i =0;
my $zone =0;
for ($j=0; $j<=$#file2; $j++){
$zone = $file2[$j];
unless ( $zone =~ m/#(.*?)/ ) {
print "$zone";
my #fid = grep /$zone/ , #file1;
#fid = uniq(#fid);
s{^\s+|\s+$}{}g foreach #fid; #cancel leading space
for ($i=0; $i<=$#fid; $i++){
print FILE3 "$fid[$i]\n";
}
##fid=();
}
}
close(FILE3);
My file1.txt is something like this:
i am a dog
i am a cat
we are the fish
he is a boy
she is a girl
My file2.txt is like this:
is
am
But my file3 can only show those sentence contain am but no is, if i put is in second line and am in first line then my file3 only contain the sentences with is. I not very sure why my code can only grep the last row in my file2. Thanks for the help.

When reading from a file, the final newline is part of each line read. You can remove the newlines from the pattern array by chomping:
chomp( my #file2 = <FILE2> );

You can already do this with egrep :
egrep -f file2.txt file1.txt

The root of this problems is chomp - you're not removing linefeeds, so the matches aren't working.
But aside from that, there's a few problems with your code that could do with addressing:
opening files, you should use 3 arg open with lexical file handles, as it's better style: open (my $file1, '<', 'file1.txt' ) or die $!;
rather than a loop of loops, you'd probably be better off compiling up a 'match regex'.
Instead of reading all of a file into an array, you can iterate line by line, and you don't need to use the memory.
If you're iterating a loop, and only using the index to acccess the current element, you're far better off using foreach my $line ( #things ) { type syntax.
So your code actually could be reduced to something like:
#!/usr/bin/env perl
use strict;
use warnings;
open(my $data, '<',"file1.txt") or die $!;
open(my $search, '<', "file2.txt") or die $!;
open(my $output, '>', "file3.txt" ) or die $!;
chomp ( my #search_terms = <$search> );
#quotemeta is needed to avoid 'special' regex characters doing things.
my $search_regex = join "|", map { quotemeta }, #search_terms;
#note - '\b' denotes word boundary, which may not be what you want.
#means 'is' won't match 'fish'
#so get rid of them if that's not what you want.
$search_regex = qr/\b($search_regex)\b/;
print "Using: $search_regex\n";
select $output; #default print destination
while ( <$data> ) {
print if m/$search_regex/;
}
Output (in 'file3.txt'):
i am a dog
i am a cat
he is a boy
she is a girl

please try this .
use strict;
use warnings;
use List::MoreUtils qw(uniq);
open(FILE1,"file1.txt") or die "Error, File1 could not open\n";
open(FILE2,"file2.txt") or die "Error, File2 could not open\n";
open(FILE3, ">file3.txt") or die "Error, File3 could not open\n";
my #file1=<FILE1>;
my #file2=<FILE2>;
my $j =0;
my $i =0;
foreach my $main_line(#file1){
chomp($main_line);
foreach my $line(#file2){
chomp($line);
if ($main_line =~ /$line/i) {
print FILE3 "$main_line\n";
}
}
}
close(FILE3);
thanks,
praveenzx~

Related

How to match 2 array?

I have 2 files I need to match.
File1.txt contains:
-----------------------------------------------
Words | Keyword | Sentence
-----------------------------------------------
Lunch >WORDS> when do you want to have lunch?.
Hate >WORDS> I hate you.
Other >WORDS> Other than that?
File2.txt contains:
I love you.
Other than that?.
I like you.
when do you want to have lunch?.
File1 will do the word matching with File2, after this keyword >WORDS>. Meaning File1 and File2 just compare the word "Other than that?" and "when do you want to have lunch?". So the result will take the same word after the keywords >WORDS>. I use array to do.
The expected output will print:
Other >WORDS> Other than that?.
Lunch >WORDS> when do you want to have lunch?.
CODE:
use strict;
use warnings;
use diagnostics;
use Data::Dumper;
use 5.010;
my $new= File1.txt; #read File1
my $old= File2.txt; #read File2
my $string1;
my $string2;
my #new_array;
my #old_array;
my $string11;
my #array1;
#---------------------------------------------------------------
# Main
#---------------------------------------------------------------
open(NEW_FILE,"<", $new) || die "Cannot open file $new to Read! - $!";
open(OLD_FILE,"<", $old) || die "Cannot open file $old to Read! - $!";
while (<NEW_FILE>) {
my $string1= $_;
my $string11= $_;
if ($string1=~ m/WORDS/){ #matching the Keyword >WORDS>
$string1 = $'; #string1 will take after >WORDS>
$string11 = $_; #string11 will take the full.
push (#new_array, ($string1)); #string1 = #new_array
push (#array1, ($string11)); }} #string11 = #array1
while (<OLD_FILE>) {
my $string2= $_;
if ($string2 =~ m/WORDS/){ #matching the Keyword >WORDS>
$string2 = $'; #string2 will take after >WORDS>
push (#old_array, ($string2)); #string2 = #old_array
}}
#------Do comparison between new file and old file. (only after WORDS)
my #intersection =();
my #unintersection = ();
my %hash1 = map{$_ => 1} #old_array;
foreach (#new_array){
if (defined $hash1{$_}){
push #intersection, $_; #this one will take the same array between new and old
}
else {
push #unintersection, $_; #this one will take the new array only. So, will read this one.
}}
Until this part, if I print the #unintersection, it will produce:
other than that?
when do you want to have lunch?.
Do comparison between#unintersection (result after WORDS) and (#array1).
my #same();
my #not_same= ();
my %hash2 = map{$_ => 1} #unintersection;
foreach (#array1) {
if (#array1 = m/WORDS/){
#array1 = $';
if (defined $hash2{$_}) {
#array1 = $_;
push #same, $_;
}
else {
push #not_same, $_;}}}
print #same;
print #not_same;
close(NEW_FILE);
close(OLD_FILE);
close(NEW_OUTPUT_FILE);
The result that I produce only 1. have lunch?"
Other >WORDS> Other than that?
Should be got 2 output. "Other >WORDS> Other than that?" and "Lunch >WORDS> when do you want to have lunch?"
The problem can be solved with a lookup table (implemented as hashref) build on information provided in File1.txt (words_lookup.dat).
Once we have lookup table at our disposal read File2.txt (words_data.dat) and compare with lookup table. If the input line matches lookup table then output stored value ($lookup->{$1}{line}) to the console.
use strict;
use warnings;
use feature 'say';
my($fh, $lookup);
my $fname_lookup = 'words_lookup.dat'; # File1.txt
my $fname_data = 'words_data.dat'; # File2.txt
my $re_lookup = qr/(\S+)\s+>WORDS>\s+(.*)/;
open $fh, '<', $fname_lookup
or die "Couldn't open $fname_lookup";
while( <$fh> ) {
chomp;
next unless /$re_lookup/;
$lookup->{$1}{sentence} = $2;
$lookup->{$1}{line} = $_;
}
close $fh;
open $fh, '<', $fname_data
or die "Couldn't open $fname_data";
while( my $line = <$fh> ) {
$line =~ /$lookup->{$_}{sentence}/ && say $lookup->{$_}{line} for keys $lookup->%*;
}
close $fh;
exit 0;
Output
Other >WORDS> Other than that?
Lunch >WORDS> when do you want to have lunch?.

Search and replace multiple lines from a file

I'm trying to remove a part of a.txt file and replace with contents of b.txt file while also doing modification to other lines in a.txt using a Perl program.
file a.txt
line1
line2
replace from below line
replace from this line
bla bla...
bla bla...
to this line
line3
line4
file b.txt
replacement1
replacement2
replacementn
Below is my code which is not working.
#!/apps/perl/5.8.3/bin/perl -w
open (INPUT, "a.txt") or die $!;
open (REPLACE, "b.txt") or die $!;
open (OUTPUT, ">c.txt") or die $!;
my $replace_text;
{
local $/;
$replace_text = <REPLACE>;
}
close(REPLACE);
while (<INPUT>) {
s/line1/modified_line1/;
s/line2/modified_line2/;
if($_ =~ /replace from below line/){
while(<INPUT>){
{
local undef $/;
s/replace from this line.*to this line/$replace_text/smg;
}
s/line3/modified_line3/;
s/line4/modified_line4/;
print OUTPUT;
}
}
}
close(INPUT);
close(OUTPUT);
Expected output file c.txt
modified_line1
modified_line2
replacement1
replacement2
replacementn
modified_line3
modified_line4
Can someone help me understand where I'm going wrong?
I don't think you need nested while loops to read your input file.
One way is to use a variable to control when you print to the output file:
use warnings;
use strict;
open (INPUT, "a.txt") or die $!;
open (REPLACE, "b.txt") or die $!;
open (OUTPUT, ">c.txt") or die $!;
my $replace_text;
{
local $/;
$replace_text = <REPLACE>;
}
close(REPLACE);
my $print = 1;
while (<INPUT>) {
s/line(\d)/modified_line$1/;
$print = 0 if /replace from below line/;
if (/to this line/) {
$print = 1;
$_ = $replace_text;
}
print OUTPUT if $print;
}
close(INPUT);
close(OUTPUT);
Output:
modified_line1
modified_line2
replacement1
replacement2
replacementn
modified_line3
modified_line4
I also consolidated your 4 line substitutions into 1 using \d.
As much as I like perl, it's really not necessary here:
sed -e 's/line1/modified_line1/' \
-e 's/line2/modified_line2/' \
-e 's/line3/modified_line3/' \
-e 's/line4/modified_line4/' \
-e '/replace from below/rb.txt' \
-e '/replace from below/,/to this line/d' a.txt
modified_line1
modified_line2
replacement1
replacement2
replacementn
modified_line3
modified_line4
If you did want to use perl, I'd just do:
#!/usr/bin/env perl
use strict;
use warnings;
open my $ah, '<', "a.txt" or die "a.txt: $!\n";
while(<$ah>) {
s/line1/modified_line1/;
s/line2/modified_line2/;
s/line3/modified_line3/;
s/line4/modified_line4/;
if( /replace from below/ ){
system "cat b.txt" and exit 1;
}
next if( /replace from below/ .. /to this line/);
print;
}
The problem description does not specify how big can be a.txt file. Posted code utilizes regular expression with modifier /smg what indicates that OP tries to work on multiline text.
Let's assume that input file is small enough to be read and processed in the memory.
For code manageability substitute placed into __DATA__ block which read in %substitute hash.
Build regular expression $re based on keys %substitute to utilize in substitution pattern.
Multiline substitution is based on original OP's code (is not applicable to line by line read of input data).
Two subroutines defined to read content of the file into variable and to store variable data into a file -- just to make the code easier to read and understand.
use strict;
use warnings;
use feature 'say';
my($fname_in,$fname_repl,$fname_out) = qw/a.txt b.txt c.txt/;
my %substitute = split(/[,\s]/, do{ local $/; <DATA>} );
my $re = '\b(' . join('|',keys %substitute) . ')\b';
my $data = read_file($fname_in);
my $replace_with = read_file($fname_repl);
$data =~ s/$re/$substitute{$1}/g;
$data =~ s/replace from below line.*?to this line/$replace_with/gsm;
save_file($fname_out,$data);
say $data;
exit 0;
sub read_file {
my $fname = shift;
my $data;
open my $fh, '<', $fname
or die "Couldn't open $fname";
$data = do { local $/; <$fh> };
close $fh;
return $data;
}
sub save_file {
my $fname = shift;
my $data = shift;
open my $fh, '>', $fname
or die "Couldn't open $fname";
say $fh $data;
close $fh;
}
__DATA__
line1,modified_line1
line2,modified_line2
line3,modified_line3
line4,modified_line4
Output
modified_line1
modified_line2
replacement1
replacement2
replacementn
modified_line3
modified_line4

Multiple pattern match and replace

How to extract patterns from a file and replace the multiple patterns with a new pattern from a file?
For example:
Lets say the pattern file is pattern.txt, as follows with 2,000 lines.
a
b
d
e
f
....
...
...
File to replace pattens is replace.txt containing:
a,1
b,3
c,5
d,10
e,14
....
...
...
The intended final file content for file patterns.txt is:
a,1
b,3
d,10
e,14
....
...
...
Perl from command line,
perl -i -pe'
BEGIN{ local (#ARGV, $/, $^I) =pop; %h = split /[\s,]+/, <> }
s| (\S+)\K |,$h{$1}|x
' pattern.txt replace.txt
It slurps content of second file ($/ to undef), and temporarily disables in-place editing ($^I to undef), splits string on white-spaces/commas and populate %h hash in key/value manner. Then for every line of first file adds comma and value for current key.
With the possibility of arbitrary characters in your input, it might be safest to use Text::CSV. The benefit is that it will handle things like quoted delimiters, multiline strings, etc. The drawback is that it can break on non-csv content, so it sort of relies on your input being proper csv.
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({
binary => 1,
eol => $/,
});
my %s;
my ($input, $replace) = #ARGV;
open my $fh, "<", $replace or die "Cannot open $replace: $!";
while (my $row = $csv->getline($fh)) {
my ($key, $line) = #$row;
$s{$key} = $line;
}
open $fh, "<", $input or die "Cannot open $input: $!";
while (<$fh>) {
chomp;
$csv->print(*STDOUT, [$_, $s{$_}]);
}
Not sure this really needs a regex as you're not really altering your source, as much as 'just' printing based on key fields.
So I would approach it something like this:
#!/usr/bin/env perl
use strict;
use warnings;
open( my $replace, "<", "replace.txt" ) or die $!;
my %replacements;
while (<$replace>) {
chomp;
my ( $key, $value ) = split(/,/);
$replacements{$key} = $value;
}
close($replace);
open( my $input, "<", "input.txt" ) or die $!;
open( my $output, ">", "patterns.txt" ) or die $!;
while ( my $line = <$input> ) {
chomp $line;
if ( $replacements{$line} ) {
print {$output} $replacements{$line}, "\n";
}
}
close($input);
close($output);
It's not as concise as some of the other examples, but hopefully clearer what it's actually doing. This I call a good thing. (I can make it much more compact, in the way that perl is (in)famous for).

Find and get the location of a list of words in a text

I'm sure this is simple but I just can't figure out what to do...
I have a text file with a bunch of words in it (let's call it "wordlist") organized in a single column. Then I have a big text file (let's call it "essay"). What I want to do is to look in the "essay" file for the words in my "wordlist".
The trick is that I want to know the position of the matched word in the "essay" (meaning, match found after X characters).
I'm actually able to do it when I look for a single word (so wordlist containing just 1 word) but I can't get it to work when working with a list of words...
Any advice ?
thanks a lot
Ok so I just realized it would just tell me "no match found" anyway...Here is the code
use strict;
use warnings;
open (my $wordlist, "<", "/wordlist.txt")
or die "cannot open < wordlist.txt $!";
open (my $essay, "<", "/essay.txt")
or die "cannot open < essay.txt $!";
while (<$essay>) { print "match found\n" if ($essay =~ m/$wordlist/) ; }
{ print "no match found\n" if ($essay !~ m/$wordlist/) ; }
Help please...?
perl index function basically matches substring which does not ensure the match of a full string. A regular expression based match is more useful here imho.
Explanation:
Read whole text of essay in a string. => $essay
For each word from wordlist.txt => $_
-- Keep matching $_ within $essay with proper regex. The one used here is b$_\b
-- For each match, collect the value of #-[0]
\b: is the word boundary character here which ensures that it only matches with complete words not substrings.
#-: is a special variable that contains the start position of the last regex match.
Here is a sample code:
use strict;
use warnings;
use 5.010;
my $wordlist_file = 'wordlist.txt';
open my $wordlist_fh, '<', $wordlist_file or die "Failed to open '$wordlist_file': $!";
my %pos;
my $essay_file = 'essay.txt';
my $essay = do {
local $/ = undef;
open my $fh, "<", $essay_file
or die "could not open $essay_file: $!";
<$fh>;
};
while (<$wordlist_fh>) {
chomp;
$pos{$_} = [] unless $pos{$_};
while($essay =~ m/\b$_\b/g){
push #{$pos{$_}}, #-;
}
}
use Data::Dumper;
print Dumper(\%pos);
the wordlist file and essay files are similar as mentioned by ThisSuitIsBlackNot.
wordlist.txt
I
Perl
hacker
essay.txt
I want to be just another Perl hacker when I grow up
I want to be just another Perl hacker when I grow up
The %pos hash now contains all the positions of your each word. I just showed them through dumper
$VAR1 = {
'hacker' => [
'31',
'84'
],
'Perl' => [
'26',
'79'
],
'I' => [
'0',
'43',
'53',
'96'
]
};
Note that the counts are including the newline characters at the end of each line.
Maybe you can use index() function.
Here is the link: Using the Perl index() function
This is my sample. The performance may be not too well. Hope it helps~:)
open (my $wordlist, "<", "files/wordlist.txt")
or die "cannot open < wordlist.txt $!";
open (my $essay, "<", "files/essay.txt")
or die "cannot open < essay.txt $!";
my $words = {};
while (<$wordlist>) {
chomp($_);
$words->{$_} = 1;
}
my $row_count = 0;
while (<$essay>) {
$row_count++;
chomp($_);
foreach my $word (keys %{$words}) {
my $offset = 0;
my $r = index($_, $word, $offset);
while ($r != -1) {
print "Found [$word] in line $row_count at $r\n";
$offset = $r + 1;
$r = index($_, $word, $offset);
}
}
}
In your code, $essay and $wordlist are both filehandles. When you say
print "match found\n" if ($essay =~ m/$wordlist/);
You're trying to match the stringification of one filehandle to the stringification of another filehandle. When a filehandle is stringified, it looks something like this:
GLOB(0x9a26c38)
So your code actually does something like:
print "match found\n" if ('GLOB(0x9a26c38)' =~ m/GLOB(0x94bbc38)/);
This is not what you want. You need to read the contents of your files and compare those, not the filehandles themselves.
Essay words each on their own line
The following code assumes that your "essay" consists of one word per line. We read the contents of the essay file into a hash of arrays, with the lines as keys and an array of positions as values. We use an array in case the same word appears multiple times in the file. The position of the first word is zero. We then loop through the word list file, printing the word and the first matching position, if there is one.
use strict;
use warnings;
use 5.010;
my $essay_file = 'files/essay.txt';
open my $essay_fh, '<', $essay_file or die "Failed to open '$essay_file': $!";
my $pos = 0;
my %essay;
while (<$essay_fh>) {
chomp;
push #{ $essay{$_} }, $pos;
$pos += length $_;
}
my $wordlist_file = 'files/wordlist.txt';
open my $wordlist_fh, '<', $wordlist_file or die "Failed to open '$wordlist_file': $!";
while (<$wordlist_fh>) {
chomp;
say "$_: $essay{$_}[0]" if exists $essay{$_};
}
essay.txt
I
want
to
be
just
another
Perl
hacker
when
I
grow
up
wordlist.txt
I
Perl
hacker
Output
I: 0
Perl: 20
hacker: 24
Note that I'm ignoring newline characters when computing the position values. You can adjust this as necessary.
Essay words more than one per line
If your essay file can have more than one word per line, we can use a regex to check for matches:
use strict;
use warnings;
use 5.010;
# Slurp entire essay file into a variable
my $essay = do {
local $/;
my $essay_file = 'files/essay.txt';
open my $essay_fh, '<', $essay_file or die "Failed to open '$essay_file': $!";
<$essay_fh>;
};
my $wordlist_file = 'files/wordlist.txt';
open my $wordlist_fh, '<', $wordlist_file or die "Failed to open '$wordlist_file': $!";
while (<$wordlist_fh>) {
chomp;
say "$_: ", pos($essay) - length($_) if $essay =~ /\b$_\b/g;
}
essay.txt
I want to be just another Perl hacker when I grow up
wordlist.txt
I
Perl
hacker
hack
Output
I: 0
Perl: 26
hacker: 31
Note that the results are a little bit different from our other program, because now there are spaces between words. Also note that there is no output for the word hack, since we're only checking for whole word matches.

How can I extract and save text using Perl?

No extracted data output to data2.txt? What goes wrong to the code?
MyFile.txt
ex1,fx2,xx1
mm1,nn2,gg3
EX1,hh2,ff7
This is my desired output in data2.txt:
ex1,fx2,xx1
EX1,hh2,ff7
#! /DATA/PLUG/pvelasco/Softwares/PERLINUX/bin/perl -w
my $infile ='My1.txt';
my $outfile ='data2.txt';
open IN, '<', $infile or die "Cant open $infile:$!";
open OUT, '>', $outfile or die "Cant open $outfile:$!";
while (<IN>) {
if (m/EX$HF|ex$HF/) {
print OUT $_, "\n";
print $_;
}
}
close IN;
close OUT;
This regex makes no sense:
m/EX$HF|ex$HF/
Is $HF supposed to be a variable? What are you trying to match?
Also, the second line in every Perl script you write should be:
use strict;
It will make Perl catch such mistakes and tell you about them, rather than silently ignoring them.
while (<IN>) {
if (m/^(EX|ex)\d.*/) {
print OUT "$_";
print $_;
}
}
Sorry if this seems like stating the bleeding obvious, but what's wrong with
grep -i ^ex < My1.txt > data2.txt
... or if you really want to do it in perl (and there's nothing wrong with that):
perl -ne '/^ex/i && print' < My1.txt > data2.txt
This assumes the purpose of the request is to find lines that start with EX, with case-insensitivity.
When I run your code, but name the input file My1.txt instead of MyFile.txt I get the desired output - except with empty lines, which you can remove by removing the , "\n" from the print statement.
The filenames don't match.
open(my $inhandle, '<', $infile) or die "Cant open $infile: $!";
open(my $outhandle, '>', $outfile) or die "Cant open $outfile: $!";
while(my $line = <$inhandle>) {
# Assumes that ex, Ex, eX, EX all are valid first characters
if($line =~ m{^ex}i) { # or if(lc(substr $line, 0 => 2) eq 'ex') {
print { $outhandle } $line;
print $line;
}
}
And yes, always always use strict;
You could also chomp $line and (if using perl 5.10) say $line instead of print "$line\n".