How to grep capture a multiline pattern of a file in Perl - regex

I have a file that looks something like this:
Random words go here
/attribute1
/attribute2
/attribute3="all*the*things*I'm*interested*in*are*inside*here**
and*it*goes*into*the*next*line.*blah*blah*blah*foo*foo*foo*foo*
bar*bar*bar*bar*random*words*go*here*until*the*end*of*the*sente
nce.*I*think*we*have*enough*words"
I want to grep the file for the line \attribute3= then I want to save the string found inside the quotation marks to a separate variable.
Here's what I have so far:
#!/bin/perl
use warnings; use strict;
my $file = "data.txt";
open(my $fh, '<', $file) or die $!;
while (my $line = <$fh>) {
if ($line =~ /\/attribute3=/g){
print $line . "\n";
}
}
That's printing out /attribute3="all*the*things*I'm*interested*in*are*inside*here** but
I want all*the*things*I'm*interested*in*are*inside*here**and*it*goes*into*the*next*line.*blah*blah*blah*foo*foo*foo*foo*bar*bar*bar*bar*random*words*go*here*until*the*end*of*the*sentence.*I*think*we*have*enough*words.
So what I did next is:
#!/bin/perl
use warnings; use strict;
my $file = "data.txt";
open(my $fh, '<', $file) or die $!;
my $part_I_want;
while (my $line = <$fh>) {
if ($line =~ /\/attribute3=/g){
$line =~ /^/\attribute3=\"(.*?)/; # capture everything after the quotation mark
$part_I_want .= $1; # the capture group; save the stuff on line 1
# keep adding to the string until we reach the closing quotation marks
next (unless $line =~ /\"/){
$part_I_want .= $_;
}
}
}
The code above doesn't work. How do I grep capture a multiline pattern between two characters (in this case it's quotation marks)?

my $str = do { local($/); <DATA> };
$str =~ /attribute3="([^"]*)"/;
$str = $1;
$str =~ s/\n/ /g;
__DATA__
Random words go here
/attribute1
/attribute2
/attribute3="all*the*things*I'm*interested*in*are*inside*here**
and*it*goes*into*the*next*line.*blah*blah*blah*foo*foo*foo*foo*
bar*bar*bar*bar*random*words*go*here*until*the*end*of*the*sente
nce.*I*think*we*have*enough*words"

Read the entire file into a single variable and use /attribute3=\"([^\"]*)\"/ms

From the command line:
perl -n0e '/\/attribute3="(.*)"/s && print $1' foo.txt
This is basically what you had, but the 0 flag is the equivalent of undef $/ within the code. From the man page:
-0[octal/hexadecimal]
specifies the input record separator ($/) as an octal or hexadecimal number. If there are no digits, the null character is the separator.

Related

Perl: Regex not grabbing multiline C style comments in code

I have a Perl program that:
Reads a SRC file written in C
Uses a regex match from SRC file to find specific formatted data to use as the Destination filename
Opens new Destination file
Performs another regex match to find all C style comments /* */ that contain a keyword abcd. Note: these comments can be 1 line or more than 1 line so the regex is looking for the first /* and then the keyword abcd and then any amount of text and space before it encounters a closing */
Writes the regex matches to the destination file
#!/usr/bin/perl
use warnings;
use strict;
my $src = 'D:\\Scripts\\sample.c';
my $fileName;
# open source file for reading
open(SRC_FH,'<',$src) or die $!;
while(my $row = <SRC_FH>){
if ($row =~ /([0-9]{2}\.[0-9]{2}\.[0-9]{3}\.[a-z,0-9]{2}|[0-9]{2}\.[0-9]{2}\.[0-9]{3}\.[a-z,0-9]{3})/){
$fileName = $1;
}
}
my $des = "D:\\Scripts\\" . $fileName . ".txt";
# open destination file for writing
open(DES_FH,'>',$des) or die $!;
print("copying content from $src to $des\n");
seek SRC_FH, 0, 0;
while(my $row = <SRC_FH>){
if ($row =~ /(\/\*.*abcd.[\s\S]*?\*\/)/){
print DES_FH "$1\n";
}
}
# always close the filehandles
close(SRC_FH);
close(DES_FH);
print "File content copied successfully!\n";
My problem is I think because of the way perl code executes although by regex is correct, my destination file is only getting the 1 line comments written to it. Any C style comments that are more than 1 line are not getting written to the destination file. What am I missing in my 2nd if statement?
I checked my 2nd if statement regex here https://regexr.com/ and it works as its supposed to capturing multi line C style comments as well as single line comments that also contain the keyword abcd.
So I tried the 1st suggestion below by zdim. Here is what I used:
#!/usr/bin/perl
use warnings;
use strict;
my $src = 'D:\\Scripts\\sample.c';
my $fileName;
my #comments;
# open source file for reading
open(SRC_FH,'<',$src) or die $!;
while(my $row = <SRC_FH>){
if ($row =~ /([0-9]{2}\.[0-9]{2}\.[0-9]{3}\.[a-z,0-9]{2}|[0-9]{2}\.[0-9]{2}\.[0-9]{3}\.[a-z,0-9]{3})/){
$fileName = $1;
}
}
my $des = "D:\\Scripts\\" . $fileName . ".txt";
# open destination file for writing
open(DES_FH,'>',$des) or die $!;
print("copying content from $src to $des\n");
#seek SRC_FH, 0, 0;
my $content = do {
#read whole file at once
local $/;
open (SRC_FH,'<', $src) or die $!;
<SRC_FH>;
};
#if($content =~ /(\/\*.*abcd.[\s\S]*?\*\/)/sg){
# my #comments = $content;
# }
my #comments = $content =~ /(\/\*.*abcd.[\s\S]*?\*\/)/sg;
foreach (#comments){
print DES_FH "$1\n";
}
#while(my $row = <SRC_FH>){
# if ($row =~ /(\/\*.*abcd.[\s\S]*?\*\/)/){
# print DES_FH "$1\n";
# }
#}
# always close the filehandles
close(SRC_FH);
close(DES_FH);
print "File content copied successfully!\n";
The result is all the content from sample.c are copied to the destination file. A full 1:1 copy. Where I am looking to pull all comments single line and multiline out of the C file.
Example 1:
/* abcd */
Example 2:
/* some text
* some more comments
abcd and some more comments */
Final Solution
#!/usr/bin/perl
use warnings;
use strict;
my $src = 'D:\\Scripts\\sample.c';
my $fileName;
# open source file for reading
open(SRC_FH,'<',$src) or die $!;
while(my $row = <SRC_FH>){
if ($row =~ /([0-9]{2}\.[0-9]{2}\.[0-9]{3}\.[a-z,0-9]{2}|[0-9]{2}\.[0-9]{2}\.[0-9]{3}\.[a-z,0-9]{3})/){
$fileName = $1;
}
}
my $des = "D:\\Scripts\\" . $fileName . ".txt";
# open destination file for writing
open(DES_FH,'>',$des) or die $!;
print("copying content from $src to $des\n");
seek SRC_FH, 0, 0;
my $content = do{local $/; <SRC_FH>};
my #comments = $content =~ /(\/\*.*abcd.[\s\S]*?\*\/)/g;
for(#comments){
print DES_FH "$_\n";
}
# always close the filehandles
close(SRC_FH);
close(DES_FH);
print "File content copied successfully!\n";
What am I missing in my 2nd if statement?
Well, nothing -- it's just that in a multiline C comment neither of its lines has both /* and */. Thus that regex just cannot match a multiline comment when a file is read line by line.
To catch such comments either:
Read the whole file into a string ("slurp" it), and then add /s modifier on the regex so that . matches a newline as well. Also use /g modifier so to capture all such patterns in the string. One way
my $content = do {
local $/; # undef record separator so the whole file is read at once
open my $src_fh, '<', $src_file or die $!; # have to re-open
<$src_fh>; # reads it all
}; # lexical filehandle gets closed as we leave scope
# NOTE -- there may be difficulties in capturing comments in a C source file
my #comments = $content =~ /.../sg; # your regex
Or use a library to slurp a file, like
use Path::Tiny;
my $content = path($src_file)->slurp;
Or,
Set a flag when you see /*, get/print all lines until you hit the closing */, then unset the flag. Here is a rudimentary version of that
my $inside_comment = 0;
while (<$src_fh>) {
if (m{(/\*.*)}) { #/ fix syntax hilite
$inside_comment = 1; # opening line for the comment
say $des_fh $1;
}
elsif (m{(.*\*/)}) { # closing line for the comment
say $des_fh $1;
$inside_comment = 0;
}
elsif ($inside_comment) { say $des_fh $_}
}
I tested all this but please check and improve. For one, this plays funny with leading spaces.
Note: Getting all comments out of a C program in general may be rather tricky.
Here is a one-line version of slurping
my $file_content = do { local (#ARGV, $/) = $file_name; <> }

how to solve "Use of uninitialized value $2 in concatenation (.) or string at"

Below is my code. I want to print the data $1 and $2 in one row and split it with ,. Why can't I print the data?
#!/usr/intel/bin/perl
use strict;
use warnings;
use IO::Uncompress::Gunzip qw(gunzip $GunzipError);
my $input = "par_disp_fabric.all_max_lowvcc_qor.rpt.gz";
my $output = "par_disp_fabric.all_max_lowvcc_qor.txt";
gunzip $input => $output
or die "gunzip failed: $GunzipError\n";
open (FILE, '<',"$output") or die "Cannot open $output\n";
while (<FILE>) {
my $line = $_;
chomp ($line);
if ($line =~ m/^\s+Timing Path Group \'(\S+)\'/) {
$line = $1;
if ($line =~ m/^\s+Levels of Logic:\s+(\S+)/) {
$line = $2;
}
}
print "$1,$2\n";
}
close (FILE);
The meat of your program is here:
if ($line =~ m/^\s+Timing Path Group \'(\S+)\'/) {
$line = $1;
if ($line =~ m/^\s+Levels of Logic:\s+(\S+)/) {
$line = $2;
}
}
The regex capturing variables ($1, $2, etc) are set when you match a string against a regex that contains sets of capturing parentheses. The first capturing parentheses set the value of $1, the second capturing parentheses set the value of $2, and so on. In order for $2 to be given a value, you need to match against a regex that contains two sets of capturing parentheses.
Both of your regexes only contain a single set of capturing parentheses. Therefore only $1 will be set on each of your matches. $2 will never be given a value - leading to the warning that you are seeing.
You need to rethink the logic in your code. I'm not sure why you think $2 will have a value here. Your code is a little confusing, so I'm unable to offer a more specific solution.
I can, however, give you some more general advice:
Use lexical filehandles and the three-arg version of open().
open my $fh, '<', "$output"
There is no need for the quotes around $output.
open my $fh, '<', $output
I know why you're doing it, but $output is a potentially confusing name for a file that you read from. Consider changing it.
Always include $! in an open() error message.
open my $fh, '<', $output or die "Cannot open '$output': $!\n";
Your $line variable seems unnecessary. Why not just keep the row data in $_, which will simplify your code:
while (<$fh>) {
chomp; # works on $_ by default
if (/some regex/) { # works on $_ by default
# etc...
}
}

Find-Replace Multiple Occurrences of a string and append iterating number

How can I iterate over the code of an html file and find certain recurring text and then append a word and and iterating number to it.
So:
<!-- TemplateBeginEditable -->
<!-- TemplateBeginEditable -->
<!-- TemplateBeginEditable -->
etc...
Becomes :
<!-- TemplateBeginEditable Event=1 -->
<!-- TemplateBeginEditable Event=2 -->
<!-- TemplateBeginEditable Event=3 -->
etc...
I have tried PERL thinking it would be the easiest/fastest and went to jQuery and then back to PERL.
It seems simple enough to find/replace many ways with REGEX and return an array of the occurrences, but getting the iterating variable tacked on proves to be more of a challenge.
Latest Example of what I have tried:
#!/usr/bin/perl -w
# Open input file
open INPUTFILE, "<", $ARGV[0] or die $!;
# Open output file in write mode
open OUTPUTFILE, ">", $ARGV[1] or die $!;
# Read the input file line by line
while (<INPUTFILE>) {
my #matches = ($_ =~ m/TemplateBeginEditable/g);
### what do I do ith matches array? ###
$_ =~ s/TemplateBeginEditable/TemplateBeginEditable Event=/g;
print OUTPUTFILE $_;
}
close INPUTFILE;
close OUTPUTFILE;
To perform a replacement, you don't need to match the pattern before, you can directly perform the replacement. Example with your code:
while (<INPUTFILE>) {
s/TemplateBeginEditable/TemplateBeginEditable Event=/g;
print OUTPUTFILE $_;
}
Now to add a counter incremented at each replacement, you can put a piece of code in the pattern itself using this syntax:
my $i;
while (<INPUTFILE>) {
s/TemplateBeginEditable(?{ ++$i })/TemplateBeginEditable Event=$i/g;
print OUTPUTFILE $_;
}
To make it shorter you can use the \K feature to change the start of the match result:
while (<INPUTFILE>) {
s/TemplateBeginEditable\K(?{ ++$i })/ Event=$i/g;
print OUTPUTFILE $_;
}
Or with a one-liner:
perl -pe 's/TemplateBeginEditable\K(?{++$i})/ Event=$i/g' file > output
If you have awk available, and the target text only occurs at most once per line, then Perl is overkill I think:
awk 'BEGIN{n=1}{n+=sub("TemplateBeginEditable","& Event="n)}1'
Some explanation: The sub function returns the number of substitutions performed (0 or 1); the & means "whatever matched"; "..."n is string concatenation (no operator in awk); the 1 is a "true" condition that invokes the default "action" of {print}.
Expanding on my one-liner in the comments:
#!/usr/bin/perl
use strict;
use warnings;
my $file = shift or die "Usage: $0 <filename>\n";
open my $fh, '<', $file or die "Cannot open $file: $!\n";
open my $ofh, '>', "$file.modified" or die "Cannot open $file.modified: $!\n";
my $i = 1;
while (my $line = <$fh>) {
if ($line =~ s/TemplateBeginEditable/$& Event=$i/) {
$i++;
}
print $ofh $line;
}
__END__
Note that this assumes you will never have more than one instance of your desired text on a single line, as shown in your sample input.
I'd just do:
local $/=undef;
my $content = <FH>;
my $x = 0;
$content =~ s/(My expected pattern)/$1 . " time=" . (++$x)/ge;

Issue with Perl Regex

new perl coder here.
When I copy and paste the text from a website into a text file and read from that file, my perl script works with no issues. When I use getstore to create a file from the website automatically which is what I want, the output is a bunch of |'s.
The text looks identical when I copy and paste, or download the text with getstore.. I'm unable to figure out the problem. Any help would be highly appreciated.
The output that I desire is as follows:
|www\.arkinsoftware\.in|www\.askmeaboutrotary\.com|www\.assculturaleincontri\.it|www\.asu\.msmu\.ru|www\.atousoft\.com|www\.aucoeurdelanature\.
enter code here
Here is the code I am using:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
getstore("http://www.malwaredomainlist.com/hostslist/hosts.txt", "malhosts.txt");
open(my $input, "<", "malhosts.txt");
while (my $line = <$input>) {
chomp $line;
$line =~ s/.*\s+//;
$line =~ s/\./\\\./g;
print "$line\|";
}
The bunch of | you get, is from the unfitting comment-lines at the beginning. So the solution is to ignore all "unfitting" lines.
So instead of
$line =~ s/.*\s+//;
use
next unless $line =~ s/^127.*\s+//;
so you would ignore every line except thos starting with 127.
Here's what I'd do:
my $first = 1;
while (<$input>) {
/^127\.0\.0\.1\s+(.+?)\s*$/ or next;
print '|' if !$first;
$first = 0;
print quotemeta($1);
}
This matches your input in a more precise way, and quotemeta takes care of true regex escaping.
I'd probably go with something like:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
getstore( "http://www.malwaredomainlist.com/hostslist/hosts.txt",
"malhosts.txt" );
open( my $input, "<", "malhosts.txt" );
print join ( "|",
map { m/^\d/ && ! m/localhost/ ?
quotemeta ((split)[1]) : () } <$input> );
Gives:
0koryu0\.easter\.ne\.jp|1\-atraffickim\.tf|10\-trafficimj\.tf|109\-204\-26\-16\.netconnexion\.managedbroadband\.co\.uk|11\-atraasikim\.tf|11\.lamarianella\.info|12\-tgaffickvcmb\.tf| #etc.

System command execution using Perl

I have a Perl script which runs a perforce command and stores the result in a variable $command.
Then it is stored in a file log.txt, and by using a regex the relevant data is taken out.
When I run that command alone the following things pop out:
4680 p4exp/v68 PJIANG-015394 25:34:19 IDLE none
8869 unnamed p4-python R integration semiconductor-project-trunktip turbolinuxclient 01:33:52 IDLE none
8870 unnamed p4-python R integration remote-trunktip-osxclient 01:33:52
The code goes as follows:
#! /usr/bin/env perl
use strict;
use warnings;
use autodie;
my $command = qx |p4 monitor show -ale|;
open FH, '>>', "log.txt";
print FH $command;
close FH;
open my $log_fh, '<', '/root/log.txt';
my %stat;
while ($line = <$log_fh>) {
chomp $line;
next if not $line =~ /(\d+)\s+/;
my $killid = $1;
if ($line =~ /R\s+integration/ and $line =~ /IDLE\s+none$/) {
my $killid_details = $line;
$stat{$killid} = $killid_details;
}
}
close $log_fh;
my $killpro;
foreach my $kill (keys %stat) {
print "$kill\n";
}
The following gets the number 8869 but how to do it without log.txt. Is using an array a better way to do it or hash is fine?
Please correct me as I am still learning.
Seems like your main stumbling block is getting line-by-line input for your loop?
Splitting on newlines should do the trick:
my $killid;
my #lines = split("\n", $command); #split on newlines
for my $line (#lines) {
next if not $line =~ /(\d+)\s+/;
my $id = $1;
if ($line =~ /R\s+integration/ and $line =~ /IDLE\s+none$/){
$killid = $id;
}
}
One caveat: you mentioned an output of 8870, but I'm getting 8869. The regexps you gave are looking for a line with "integration" and "IDLE none", and for your example input that appears to match 8869.
A hash is fine, though if you're using only one key in it (which seems to be the case), you might as well just use a single variable.
If you assign the result of a qx construct to an array instead of a scalar, then it will be split into lines automatically for you. This code demonstrates.
use strict;
use warnings;
my #lines = qx|p4 monitor show -ale|;
my %stat;
for my $line (#lines) {
chomp $line;
next unless $line =~ /(\d+)\s+/;
my $killid = $1;
if ($line =~ /R\s+integration/ and $line =~ /IDLE\s+none$/) {
$stat{$killid} = $line;
}
}
print "$_\n" for keys %stat;