Multiple pattern match and replace - regex

How to extract patterns from a file and replace the multiple patterns with a new pattern from a file?
For example:
Lets say the pattern file is pattern.txt, as follows with 2,000 lines.
a
b
d
e
f
....
...
...
File to replace pattens is replace.txt containing:
a,1
b,3
c,5
d,10
e,14
....
...
...
The intended final file content for file patterns.txt is:
a,1
b,3
d,10
e,14
....
...
...

Perl from command line,
perl -i -pe'
BEGIN{ local (#ARGV, $/, $^I) =pop; %h = split /[\s,]+/, <> }
s| (\S+)\K |,$h{$1}|x
' pattern.txt replace.txt
It slurps content of second file ($/ to undef), and temporarily disables in-place editing ($^I to undef), splits string on white-spaces/commas and populate %h hash in key/value manner. Then for every line of first file adds comma and value for current key.

With the possibility of arbitrary characters in your input, it might be safest to use Text::CSV. The benefit is that it will handle things like quoted delimiters, multiline strings, etc. The drawback is that it can break on non-csv content, so it sort of relies on your input being proper csv.
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({
binary => 1,
eol => $/,
});
my %s;
my ($input, $replace) = #ARGV;
open my $fh, "<", $replace or die "Cannot open $replace: $!";
while (my $row = $csv->getline($fh)) {
my ($key, $line) = #$row;
$s{$key} = $line;
}
open $fh, "<", $input or die "Cannot open $input: $!";
while (<$fh>) {
chomp;
$csv->print(*STDOUT, [$_, $s{$_}]);
}

Not sure this really needs a regex as you're not really altering your source, as much as 'just' printing based on key fields.
So I would approach it something like this:
#!/usr/bin/env perl
use strict;
use warnings;
open( my $replace, "<", "replace.txt" ) or die $!;
my %replacements;
while (<$replace>) {
chomp;
my ( $key, $value ) = split(/,/);
$replacements{$key} = $value;
}
close($replace);
open( my $input, "<", "input.txt" ) or die $!;
open( my $output, ">", "patterns.txt" ) or die $!;
while ( my $line = <$input> ) {
chomp $line;
if ( $replacements{$line} ) {
print {$output} $replacements{$line}, "\n";
}
}
close($input);
close($output);
It's not as concise as some of the other examples, but hopefully clearer what it's actually doing. This I call a good thing. (I can make it much more compact, in the way that perl is (in)famous for).

Related

Search and replace multiple lines from a file

I'm trying to remove a part of a.txt file and replace with contents of b.txt file while also doing modification to other lines in a.txt using a Perl program.
file a.txt
line1
line2
replace from below line
replace from this line
bla bla...
bla bla...
to this line
line3
line4
file b.txt
replacement1
replacement2
replacementn
Below is my code which is not working.
#!/apps/perl/5.8.3/bin/perl -w
open (INPUT, "a.txt") or die $!;
open (REPLACE, "b.txt") or die $!;
open (OUTPUT, ">c.txt") or die $!;
my $replace_text;
{
local $/;
$replace_text = <REPLACE>;
}
close(REPLACE);
while (<INPUT>) {
s/line1/modified_line1/;
s/line2/modified_line2/;
if($_ =~ /replace from below line/){
while(<INPUT>){
{
local undef $/;
s/replace from this line.*to this line/$replace_text/smg;
}
s/line3/modified_line3/;
s/line4/modified_line4/;
print OUTPUT;
}
}
}
close(INPUT);
close(OUTPUT);
Expected output file c.txt
modified_line1
modified_line2
replacement1
replacement2
replacementn
modified_line3
modified_line4
Can someone help me understand where I'm going wrong?
I don't think you need nested while loops to read your input file.
One way is to use a variable to control when you print to the output file:
use warnings;
use strict;
open (INPUT, "a.txt") or die $!;
open (REPLACE, "b.txt") or die $!;
open (OUTPUT, ">c.txt") or die $!;
my $replace_text;
{
local $/;
$replace_text = <REPLACE>;
}
close(REPLACE);
my $print = 1;
while (<INPUT>) {
s/line(\d)/modified_line$1/;
$print = 0 if /replace from below line/;
if (/to this line/) {
$print = 1;
$_ = $replace_text;
}
print OUTPUT if $print;
}
close(INPUT);
close(OUTPUT);
Output:
modified_line1
modified_line2
replacement1
replacement2
replacementn
modified_line3
modified_line4
I also consolidated your 4 line substitutions into 1 using \d.
As much as I like perl, it's really not necessary here:
sed -e 's/line1/modified_line1/' \
-e 's/line2/modified_line2/' \
-e 's/line3/modified_line3/' \
-e 's/line4/modified_line4/' \
-e '/replace from below/rb.txt' \
-e '/replace from below/,/to this line/d' a.txt
modified_line1
modified_line2
replacement1
replacement2
replacementn
modified_line3
modified_line4
If you did want to use perl, I'd just do:
#!/usr/bin/env perl
use strict;
use warnings;
open my $ah, '<', "a.txt" or die "a.txt: $!\n";
while(<$ah>) {
s/line1/modified_line1/;
s/line2/modified_line2/;
s/line3/modified_line3/;
s/line4/modified_line4/;
if( /replace from below/ ){
system "cat b.txt" and exit 1;
}
next if( /replace from below/ .. /to this line/);
print;
}
The problem description does not specify how big can be a.txt file. Posted code utilizes regular expression with modifier /smg what indicates that OP tries to work on multiline text.
Let's assume that input file is small enough to be read and processed in the memory.
For code manageability substitute placed into __DATA__ block which read in %substitute hash.
Build regular expression $re based on keys %substitute to utilize in substitution pattern.
Multiline substitution is based on original OP's code (is not applicable to line by line read of input data).
Two subroutines defined to read content of the file into variable and to store variable data into a file -- just to make the code easier to read and understand.
use strict;
use warnings;
use feature 'say';
my($fname_in,$fname_repl,$fname_out) = qw/a.txt b.txt c.txt/;
my %substitute = split(/[,\s]/, do{ local $/; <DATA>} );
my $re = '\b(' . join('|',keys %substitute) . ')\b';
my $data = read_file($fname_in);
my $replace_with = read_file($fname_repl);
$data =~ s/$re/$substitute{$1}/g;
$data =~ s/replace from below line.*?to this line/$replace_with/gsm;
save_file($fname_out,$data);
say $data;
exit 0;
sub read_file {
my $fname = shift;
my $data;
open my $fh, '<', $fname
or die "Couldn't open $fname";
$data = do { local $/; <$fh> };
close $fh;
return $data;
}
sub save_file {
my $fname = shift;
my $data = shift;
open my $fh, '>', $fname
or die "Couldn't open $fname";
say $fh $data;
close $fh;
}
__DATA__
line1,modified_line1
line2,modified_line2
line3,modified_line3
line4,modified_line4
Output
modified_line1
modified_line2
replacement1
replacement2
replacementn
modified_line3
modified_line4

How to grep word from file

I want to grep some word inside a file from another file. My code is able to grep the word on last line of the file but not the word before it. I have no idea why and hope can get help here. Below is the perl script i using:
open(FILE1,"file1.txt") or die "Error, File1 could not open\n";
open(FILE2,"file2.txt") or die "Error, File2 could not open\n";
open(FILE3, ">file3.txt") or die "Error, File3 could not open\n";
use strict;
use warnings;
use List::MoreUtils qw(uniq);
my #file1=<FILE1>;
my #file2=<FILE2>;
my $j =0;
my $i =0;
my $zone =0;
for ($j=0; $j<=$#file2; $j++){
$zone = $file2[$j];
unless ( $zone =~ m/#(.*?)/ ) {
print "$zone";
my #fid = grep /$zone/ , #file1;
#fid = uniq(#fid);
s{^\s+|\s+$}{}g foreach #fid; #cancel leading space
for ($i=0; $i<=$#fid; $i++){
print FILE3 "$fid[$i]\n";
}
##fid=();
}
}
close(FILE3);
My file1.txt is something like this:
i am a dog
i am a cat
we are the fish
he is a boy
she is a girl
My file2.txt is like this:
is
am
But my file3 can only show those sentence contain am but no is, if i put is in second line and am in first line then my file3 only contain the sentences with is. I not very sure why my code can only grep the last row in my file2. Thanks for the help.
When reading from a file, the final newline is part of each line read. You can remove the newlines from the pattern array by chomping:
chomp( my #file2 = <FILE2> );
You can already do this with egrep :
egrep -f file2.txt file1.txt
The root of this problems is chomp - you're not removing linefeeds, so the matches aren't working.
But aside from that, there's a few problems with your code that could do with addressing:
opening files, you should use 3 arg open with lexical file handles, as it's better style: open (my $file1, '<', 'file1.txt' ) or die $!;
rather than a loop of loops, you'd probably be better off compiling up a 'match regex'.
Instead of reading all of a file into an array, you can iterate line by line, and you don't need to use the memory.
If you're iterating a loop, and only using the index to acccess the current element, you're far better off using foreach my $line ( #things ) { type syntax.
So your code actually could be reduced to something like:
#!/usr/bin/env perl
use strict;
use warnings;
open(my $data, '<',"file1.txt") or die $!;
open(my $search, '<', "file2.txt") or die $!;
open(my $output, '>', "file3.txt" ) or die $!;
chomp ( my #search_terms = <$search> );
#quotemeta is needed to avoid 'special' regex characters doing things.
my $search_regex = join "|", map { quotemeta }, #search_terms;
#note - '\b' denotes word boundary, which may not be what you want.
#means 'is' won't match 'fish'
#so get rid of them if that's not what you want.
$search_regex = qr/\b($search_regex)\b/;
print "Using: $search_regex\n";
select $output; #default print destination
while ( <$data> ) {
print if m/$search_regex/;
}
Output (in 'file3.txt'):
i am a dog
i am a cat
he is a boy
she is a girl
please try this .
use strict;
use warnings;
use List::MoreUtils qw(uniq);
open(FILE1,"file1.txt") or die "Error, File1 could not open\n";
open(FILE2,"file2.txt") or die "Error, File2 could not open\n";
open(FILE3, ">file3.txt") or die "Error, File3 could not open\n";
my #file1=<FILE1>;
my #file2=<FILE2>;
my $j =0;
my $i =0;
foreach my $main_line(#file1){
chomp($main_line);
foreach my $line(#file2){
chomp($line);
if ($main_line =~ /$line/i) {
print FILE3 "$main_line\n";
}
}
}
close(FILE3);
thanks,
praveenzx~

Search for a pattern in a file

I have a file eg.txt with contents of this sort :
....text...
....text...
COMP1 = ../../path1/path2/path3
COMP2 = ../../path4/path5/path6
and so on, for a large number of application names (the "COMP"s). I need to get the path -- the stuff including and after the second slash -- for a user-specified application.
This is the code I've been trying :
use strict;
use warnings;
my $line = "";
my $app = "";
print "Enter the app";
$app = <STDIN>;
print $app;
open my $fh, '<', "eg.txt" or die "Cannot open $!";
while (<$fh>) {
$line = <$fh>;
if ( $line && $line =~ /($app)( = )(..\/)(..)(.*)/ ) {
print $5;
}
}
This prints the name of the user-input application, and does nothing else. Any help would be greatly appreciated!
There are two main problems with your program
The $app variable contains a newline at the end from the enter key you pressed when you typed it in. That will prevent the pattern from matching so you need to use chomp to remove it. The same applies to lines read from your file
The <$fh> in your while statement reads a line from your file into the default variable $_, and then $line = <$fh> reads another, so you are ignoring alternate lines from the file
Here is a version of your program that I think should work although I am unable to test it at present. I have dropped your $line variable altogether and hope that doesn't confuse you. $_ is the default variable for the pattern match so it isn't mentioned explicitly anywhere
use strict;
use warnings;
print "Enter the app: ";
my $app = <STDIN>;
chomp $app;
open my $fh, '<', 'eg.txt' or die "Cannot open: $!";
while ( <$fh> ) {
if ( /$app\s*=\s*(.+)/ ) {
my $path = $1;
$path =~ s/.*\.\.//;
print $path, "\n";
}
}
The input did not matched in regex because newlines were coming along with them, so better use chomp to trim them. In while loop you are displacing two times the file handle, I don't know why. So after corrections this should work:
use strict;
use warnings;
my $line = "";
my $app = "";
print "Enter the app";
chomp($app = <STDIN>);
print "$app: ";
open my $fh, '<', "eg.txt" or die "Cannot open $!";
while($line = <$fh>)
{
chomp $line;
if($line && $line =~ /($app)( = )(..\/)(..)(.*)/)
{
print "$5 \n";
}
}
close($fh);
Try this code:
use strict;
use warnings;
my $line = "";
my $app = "";
print "Enter the app";
$app = <STDIN>;
print $app;
open my $fh, '<', "eg.txt" or die "Cannot open $!";
my #line = <$fh>;
my #fetch = map { /COMP\d+\s\=\s(\..\/\..\/.*)/g } #line ;
$, = "\n";
print #fetch;
and then please send your response.
You are accessing <$fh> twice in your loop. This will have the effect of interpreting only every other line. You might want to change the top of the loop to something like this:
while (defined(my $line = <$fh>)) {
and remove the my $line ... at the top of the program.
Also, you might want to consider chomping your input line so that you don't have to think about the trailing newline character:
while (defined(my $line = <$fh>)) {
chomp $line;
Your regular expression is also a bit dicey. You probably want to bind it to the beginning and end of the search space and escape the literal dots. You may also want $app to be interpreted as a string rather than a regexp, which can be done by wrapping it with \Q...\E. Also unless your file format specifies single spaces around the equals, I'd be tempted to make those flexible to zero or more occurrences. Also, if you aren't going to use the earlier captures, I would say don't do them, so:
if ($line && $line =~ /^\Q$app\E *= *\.\.\/\.\.(.*)$/)
{
print $1;
(Some may say you should use \A and \z rather than ^ and $. That choice is left as an exercise to the reader.)

perl regex to read contents between double quotes

I have a file which contains information something like this:
TAG1 "file1.txt"
some additional lines
TAG2 "file2.txt"
some more lines
TAG3 "file3.txt".
Now, I want to read what is inside the double quotes and assign it to variable ( something like $var1 = file1.txt $var2 = file2.txt $var3 = fil3.txt). Can anyone guild me how to do this.?
You could achieve your goal by
using regular expression
my #files;
while (my $line = <>) {
if (m/"([^"]+)"/) {
push #files, $1;
}
}
using split()
my #files;
while (my $line = <>) {
my (undef, $file, undef) = split /"/, $line, 3;
push #files, $file;
}
Except for the period after "file3.txt". (which I suspect is a artifact from posting the question), your data appears to be a CSV file with tabs.
If that's the case, I advise you to parse the file with Text::CSV
use strict;
use warnings;
use autodie;
use Text::CSV;
my $csv = Text::CSV->new ( { sep_char => "\t" } )
or die "Cannot use CSV: ".Text::CSV->error_diag ();
my #files;
open my $fh, '<', 'file.csv';
while ( my $row = $csv->getline( $fh ) ) {
push #files; $row->[1];
}
$csv->eof or $csv->error_diag();
close $fh;
print "#files";

How to match exactly two empty lines

I have a question about regular expressions. I have a file and I need to parse it in such a way that I could distinguish some specific blocks of text in it. These blocks of text are separated by two empty lines (there are blocks which are separated by 3 or 1 empty lines but I need exactly 2). So I have a piece of code and this is \s*$^\s*$/ regular expression I think should match, but it does not.
What is wrong?
$filename="yu";
open($in,$filename);
open(OUT,">>out.text");
while($str=<$in>)
{
unless($str = /^\s*$^\s*$/){
print "yes";
print OUT $str;
}
}
close($in);
close(OUT);
Cheers,
Yuliya
By default, Perl reads files a line at a time, so you won't see multiple new lines. The following code selects text terminated by a double new line.
local $/ = "\n\n" ;
while (<> ) {
print "-- found $_" ;
}
New Answer
After having problems excluding >2 empty lines, and a good nights sleep here is a better method that doesn't even need to slurp.
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'yu';
my #blocks; #each element will be an arrayref, one per block
#that referenced array will hold lines in that block
open(my $fh, '<', $file);
my $empty = 0;
my $block_num = 0;
while (my $line = <$fh>) {
chomp($line);
if ($line =~ /^\s*$/) {
$empty++;
} elsif ($empty == 2) { #not blank and exactly 2 previous blanks
$block_num++; # move on to next block
$empty = 0;
} else {
$empty = 0;
}
push #{ $blocks[$block_num] }, $line;
}
#write out each block to a new file
my $file_num = 1;
foreach my $block (#blocks) {
open(my $out, '>', $file_num++ . ".txt");
print $out join("\n", #$block);
}
In fact rather than store and write later, you could simply write to one file per block as you go:
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'yu';
open(my $fh, '<', $file);
my $empty = 0;
my $block_num = 1;
open(OUT, '>', $block_num . '.txt');
while (my $line = <$fh>) {
chomp($line);
if ($line =~ /^\s*$/) {
$empty++;
} elsif ($empty == 2) { #not blank and exactly 2 previous blanks
close(OUT); #just learned this line isn't necessary, perldoc -f close
open(OUT, '>', ++$block_num . '.txt');
$empty = 0;
} else {
$empty = 0;
}
print OUT "$line\n";
}
close(OUT);
use 5.012;
open my $fh,'<','1.txt';
#slurping file
local $/;
my $content = <$fh>;
close $fh;
for my $block ( split /(?<!\n)\n\n\n(?!\n)/,$content ) {
say 'found:';
say $block;
}
Deprecated in favor of new answer
justintime's answer works by telling perl that you want to call the end of a line "\n\n", which is clever and will work well. One exception is that this must match exactly. By the regex you are using it makes it seem like there might be whitespace on the "empty" lines, in which case this will not work. Also his method will split even on more than 2 linebreaks, which was not allowed in the OP.
For completeness, to do it the way you were asking, you need to slurp the whole file into a variable (if the file is not so large as to use all your memory, probably fine in most cases).
I would then probably say to use the split function to split the block of text into an array of chunks. Your code would then look something like:
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'yu';
my $text;
open(my $fh, '<', $file);
{
local $/; enables slurp mode inside this block
$text = <$fh>;
}
close($fh);
my #blocks = split(
/
(?<!\n)\n #check to make sure there isn't another \n behind this one
\s*\n #first whitespace only line
\s*\n #second "
(?!\n) #check to make sure there isn't another \n after this one
/x, # x flag allows comments and whitespace in regex
$text
);
You can then do operations on the array. If I understand your comment to justintime's answer, you want to write each block out to a different file. That would look something like
my $file_num = 1;
foreach my $block (#blocks) {
open(my $out, '>', $file_num++ . ".txt");
print $out $block;
}
Notice that since you open $out lexically (with my) when it reaches the end of the foreach block, the $out variable dies (i.e. "goes out of scope"). When this happens to a lexical filehandle, the file is automatically closed. And you can do a similar thing to that with justintime's method as well:
local $/ = "\n\n" ;
my $file_num = 1;
while (<>) {
open(my $out, '>', $file_num++ . ".txt");
print $out $block;
}