Perl: Regex not grabbing multiline C style comments in code - regex

I have a Perl program that:
Reads a SRC file written in C
Uses a regex match from SRC file to find specific formatted data to use as the Destination filename
Opens new Destination file
Performs another regex match to find all C style comments /* */ that contain a keyword abcd. Note: these comments can be 1 line or more than 1 line so the regex is looking for the first /* and then the keyword abcd and then any amount of text and space before it encounters a closing */
Writes the regex matches to the destination file
#!/usr/bin/perl
use warnings;
use strict;
my $src = 'D:\\Scripts\\sample.c';
my $fileName;
# open source file for reading
open(SRC_FH,'<',$src) or die $!;
while(my $row = <SRC_FH>){
if ($row =~ /([0-9]{2}\.[0-9]{2}\.[0-9]{3}\.[a-z,0-9]{2}|[0-9]{2}\.[0-9]{2}\.[0-9]{3}\.[a-z,0-9]{3})/){
$fileName = $1;
}
}
my $des = "D:\\Scripts\\" . $fileName . ".txt";
# open destination file for writing
open(DES_FH,'>',$des) or die $!;
print("copying content from $src to $des\n");
seek SRC_FH, 0, 0;
while(my $row = <SRC_FH>){
if ($row =~ /(\/\*.*abcd.[\s\S]*?\*\/)/){
print DES_FH "$1\n";
}
}
# always close the filehandles
close(SRC_FH);
close(DES_FH);
print "File content copied successfully!\n";
My problem is I think because of the way perl code executes although by regex is correct, my destination file is only getting the 1 line comments written to it. Any C style comments that are more than 1 line are not getting written to the destination file. What am I missing in my 2nd if statement?
I checked my 2nd if statement regex here https://regexr.com/ and it works as its supposed to capturing multi line C style comments as well as single line comments that also contain the keyword abcd.
So I tried the 1st suggestion below by zdim. Here is what I used:
#!/usr/bin/perl
use warnings;
use strict;
my $src = 'D:\\Scripts\\sample.c';
my $fileName;
my #comments;
# open source file for reading
open(SRC_FH,'<',$src) or die $!;
while(my $row = <SRC_FH>){
if ($row =~ /([0-9]{2}\.[0-9]{2}\.[0-9]{3}\.[a-z,0-9]{2}|[0-9]{2}\.[0-9]{2}\.[0-9]{3}\.[a-z,0-9]{3})/){
$fileName = $1;
}
}
my $des = "D:\\Scripts\\" . $fileName . ".txt";
# open destination file for writing
open(DES_FH,'>',$des) or die $!;
print("copying content from $src to $des\n");
#seek SRC_FH, 0, 0;
my $content = do {
#read whole file at once
local $/;
open (SRC_FH,'<', $src) or die $!;
<SRC_FH>;
};
#if($content =~ /(\/\*.*abcd.[\s\S]*?\*\/)/sg){
# my #comments = $content;
# }
my #comments = $content =~ /(\/\*.*abcd.[\s\S]*?\*\/)/sg;
foreach (#comments){
print DES_FH "$1\n";
}
#while(my $row = <SRC_FH>){
# if ($row =~ /(\/\*.*abcd.[\s\S]*?\*\/)/){
# print DES_FH "$1\n";
# }
#}
# always close the filehandles
close(SRC_FH);
close(DES_FH);
print "File content copied successfully!\n";
The result is all the content from sample.c are copied to the destination file. A full 1:1 copy. Where I am looking to pull all comments single line and multiline out of the C file.
Example 1:
/* abcd */
Example 2:
/* some text
* some more comments
abcd and some more comments */
Final Solution
#!/usr/bin/perl
use warnings;
use strict;
my $src = 'D:\\Scripts\\sample.c';
my $fileName;
# open source file for reading
open(SRC_FH,'<',$src) or die $!;
while(my $row = <SRC_FH>){
if ($row =~ /([0-9]{2}\.[0-9]{2}\.[0-9]{3}\.[a-z,0-9]{2}|[0-9]{2}\.[0-9]{2}\.[0-9]{3}\.[a-z,0-9]{3})/){
$fileName = $1;
}
}
my $des = "D:\\Scripts\\" . $fileName . ".txt";
# open destination file for writing
open(DES_FH,'>',$des) or die $!;
print("copying content from $src to $des\n");
seek SRC_FH, 0, 0;
my $content = do{local $/; <SRC_FH>};
my #comments = $content =~ /(\/\*.*abcd.[\s\S]*?\*\/)/g;
for(#comments){
print DES_FH "$_\n";
}
# always close the filehandles
close(SRC_FH);
close(DES_FH);
print "File content copied successfully!\n";

What am I missing in my 2nd if statement?
Well, nothing -- it's just that in a multiline C comment neither of its lines has both /* and */. Thus that regex just cannot match a multiline comment when a file is read line by line.
To catch such comments either:
Read the whole file into a string ("slurp" it), and then add /s modifier on the regex so that . matches a newline as well. Also use /g modifier so to capture all such patterns in the string. One way
my $content = do {
local $/; # undef record separator so the whole file is read at once
open my $src_fh, '<', $src_file or die $!; # have to re-open
<$src_fh>; # reads it all
}; # lexical filehandle gets closed as we leave scope
# NOTE -- there may be difficulties in capturing comments in a C source file
my #comments = $content =~ /.../sg; # your regex
Or use a library to slurp a file, like
use Path::Tiny;
my $content = path($src_file)->slurp;
Or,
Set a flag when you see /*, get/print all lines until you hit the closing */, then unset the flag. Here is a rudimentary version of that
my $inside_comment = 0;
while (<$src_fh>) {
if (m{(/\*.*)}) { #/ fix syntax hilite
$inside_comment = 1; # opening line for the comment
say $des_fh $1;
}
elsif (m{(.*\*/)}) { # closing line for the comment
say $des_fh $1;
$inside_comment = 0;
}
elsif ($inside_comment) { say $des_fh $_}
}
I tested all this but please check and improve. For one, this plays funny with leading spaces.
Note: Getting all comments out of a C program in general may be rather tricky.
Here is a one-line version of slurping
my $file_content = do { local (#ARGV, $/) = $file_name; <> }

Related

Why does this regex in perl work for one word but not another?

I'm new to perl so please excuse me if my question seems obvious. I made a small perl script that just examines itself to extract a particular substring I'm looking for and I'm getting results that I can't explain. Here is the script:
use 5.006;
use strict;
use warnings;
use File::Find;
my #files;
find(
sub { push #files, $File::Find::name unless -d; },
"."
);
my #filteredfiles = grep(/.pl/, #files);
foreach my $fileName (#filteredfiles)
{
open (my $fh, $fileName) or die "Could not open file $fileName";
while (my $row = <$fh>)
{
chomp $row;
if ($row =~ /file/)
{
my ($substring) = $row =~ /file\(([^\)]*)\)/;
print "$substring\n" if $substring;
}
}
close $fh;
}
# file(stuff)
# directory(stuff)
Now, when I run this, I get the following output:
stuff
[^\
Why is it printing the lines out of order? Since the "stuff" line occurs later in the file, shouldn't it print later?
Why is it printing that second line wrong? It should be "\(([^\". It's missing the first 3 characters.
If I change my regex to the following: /directory\(([^\)]*)\)/, I get no output. The only difference is the word. It should be finding the second comment. What is going on here?
use 5.006 kind of odd if you are just beginning to learn Perl ... That is an ancient version.
You should not build a potentially huge list of all files in all locations under the current directory and then filter it. Instead, push only the files you want to the list.
Especially with escaped meta characters, regex patterns can be become hard to read very quickly, so use the /x modifier to insert some whitespace into those patterns.
You do not have to match twice: Just check & capture at the same time.
If open fails, include the reason in the error message.
Your second question above does not make sense. You seem to expect your pattern to match the literal string file\(([^\)]*)\)/, but it cannot.
use strict;
use warnings;
use File::Find;
my #files;
find(
sub {
return if -d;
return unless / [.] pl \z/x;
push #files, $File::Find::name;
},
'.',
);
for my $file ( #files ) {
open my $fh, '<', $file
or die "Could not open file $file: $!";
while (my $line = <$fh>) {
if (my ($substring) = ($line =~ m{ (?:file|directory) \( ([^\)]*) \) }x)) {
print "$substring\n";
}
}
close $fh;
}
# file(stuff)
# directory(other)
Output:
stuff
other

how to extract the whole word in the line when there is a match of my substring found perl regular expressions

I am parsing multiple log files from perl(Windows) generated on different dates.
each file has only one line, which contains a pattern as below:
notepad/version_number
entry "notepad" is constant
version_number varies for each file
both are always separated by /
In my current line read from the log file, if the string notepad is present, I want to extract notepad/version_number into a variable (including /)
Can someone please help me out with this. Thank you.
foreach $file (#file_names)
{
open(my $fh, '<:encoding(UTF-8)', $file)
or die "Could not open file '$file' $!";
while (my $row = <$fh>)
{
if($row =~ "notepad")
{
#here I want to extract the part of the line which I have highligted above(notepad/*version_number) into a variable.
}
}
}
Above is the code snippet from my script. Hope this helps.
Sample line from log file:
02/13/2014 22:39:51:464227 some_text notepad/v1.10.12 some_text
It's possible I've oversimplified your task, but I think this is all you would need:
foreach $file (#file_names)
{
open(my $fh, '<:encoding(UTF-8)', $file)
or die "Could not open file '$file' $!";
while (my $row = <$fh>)
{
my ($result) = $row =~ m|(notepad/v\d+\.\d+\.\d+)|;
if ($result) {
# we have a match
}
}
}
You already have a regex to find the match -- I wouldn't add a second one; I would just expand it to do the text and capture in one step. If $result is empty, then you pretty much know the match failed. If it is not, then you have your text.
If the version text is anything other than vx.x.x, then you would of course need to tweak the regex to support the variations.
-- edit, per OP's comment --
"everything from notepad/ until the next immediate whitespace"
my ($result) = $row =~ m|(notepad/\S+)|;
matches "notepad/" and then all non-whitespace characters.

How to grep capture a multiline pattern of a file in Perl

I have a file that looks something like this:
Random words go here
/attribute1
/attribute2
/attribute3="all*the*things*I'm*interested*in*are*inside*here**
and*it*goes*into*the*next*line.*blah*blah*blah*foo*foo*foo*foo*
bar*bar*bar*bar*random*words*go*here*until*the*end*of*the*sente
nce.*I*think*we*have*enough*words"
I want to grep the file for the line \attribute3= then I want to save the string found inside the quotation marks to a separate variable.
Here's what I have so far:
#!/bin/perl
use warnings; use strict;
my $file = "data.txt";
open(my $fh, '<', $file) or die $!;
while (my $line = <$fh>) {
if ($line =~ /\/attribute3=/g){
print $line . "\n";
}
}
That's printing out /attribute3="all*the*things*I'm*interested*in*are*inside*here** but
I want all*the*things*I'm*interested*in*are*inside*here**and*it*goes*into*the*next*line.*blah*blah*blah*foo*foo*foo*foo*bar*bar*bar*bar*random*words*go*here*until*the*end*of*the*sentence.*I*think*we*have*enough*words.
So what I did next is:
#!/bin/perl
use warnings; use strict;
my $file = "data.txt";
open(my $fh, '<', $file) or die $!;
my $part_I_want;
while (my $line = <$fh>) {
if ($line =~ /\/attribute3=/g){
$line =~ /^/\attribute3=\"(.*?)/; # capture everything after the quotation mark
$part_I_want .= $1; # the capture group; save the stuff on line 1
# keep adding to the string until we reach the closing quotation marks
next (unless $line =~ /\"/){
$part_I_want .= $_;
}
}
}
The code above doesn't work. How do I grep capture a multiline pattern between two characters (in this case it's quotation marks)?
my $str = do { local($/); <DATA> };
$str =~ /attribute3="([^"]*)"/;
$str = $1;
$str =~ s/\n/ /g;
__DATA__
Random words go here
/attribute1
/attribute2
/attribute3="all*the*things*I'm*interested*in*are*inside*here**
and*it*goes*into*the*next*line.*blah*blah*blah*foo*foo*foo*foo*
bar*bar*bar*bar*random*words*go*here*until*the*end*of*the*sente
nce.*I*think*we*have*enough*words"
Read the entire file into a single variable and use /attribute3=\"([^\"]*)\"/ms
From the command line:
perl -n0e '/\/attribute3="(.*)"/s && print $1' foo.txt
This is basically what you had, but the 0 flag is the equivalent of undef $/ within the code. From the man page:
-0[octal/hexadecimal]
specifies the input record separator ($/) as an octal or hexadecimal number. If there are no digits, the null character is the separator.

How to store regex results in Perl for building substitution strings?

I looked at this question for starters, but I'm not sure I need a hash table to store intermediate results. If so great, but I'm new to Perl, so unsure.
It seems like this would have to be done in a loop, to store each result in a scalar and then apply, then move to the next line. But again I'm new to this.
Scan lines for pattern. In this case, HTML. Yes, I know about HTML and regex, but without regex, how can I build strings dynamically from a search pattern?
If pattern matches, use formed string A to derive new string form B.
Scan lines again and substitute B for A.
In other words:
$stringA = 'alias="#[found by $pattern]"'
$stringB = 'alias="#[prepended string] . [found by $pattern] . [appended string]"'
What I have so far:
my $pattern = 'alias="#(.*?)"';
my %seen = (); # ?
sub read_file {
my ($file) = #_;
open FILE, '<:encoding(UTF-8)', $file or die "Could not open '$file' for reading $!";
local $/ = undef;
while ( my $line = <FILE> ) {
if ( $line =~ /($pattern)/ ) {
$seen{$1}; # store results
return $line;
}
}
close FILE;
}
use Data::Dumper;
say Dumper( \%seen );
I think you want
$line =~ s/($pattern)/ transform($1) /eg;
where transform($1) is the code that derives B from A ($1).
As for a non-regex solution, XPaths can be used as means of identifying HTML nodes using a language that even simpler than regex patterns.
my $xpath = '//#alias[starts-with(., "#")]';
my $doc = XML::LibXML->new->parse_html_file($qfn);
for my $node ($doc->findnodes($xpath)) {
transform($node);
}
$doc->toFile($qfn);
Several comments are in the code. Sample output is below.
Not sure if this does what you want, but hopefully something in it will help at all.
use strict;
use warnings;
my $pattern = 'alias="#(.*?)"';
my %seen = (); # defines an empty hash
sub read_file {
my ($file) = #_;
# open using lexical filehandle
open (my $fp, '<:encoding(UTF-8)', $file)
or die "Could not open '$file' for reading $!";
local $/ = undef; # effects 'slurp mode', that is, lets you read the entire file into one scalar.
my $line = <$fp>;
close ($fp); # it's all read in, so it can be safely closed here.
# loop and use the g modifier to process every match.
# see the perlre man page for full discussion of modifiers
while ( $line =~ /($pattern)/smg ) {
$seen{$1} = 0 if (!exists ($seen{$1}));
++$seen{$1};
}
}
# There was not call to read_file. This is just a "serving suggestion:"
my $filename = $ARGV[0] || die "USAGE: $0 filename\n";
read_file ($filename);
use Data::Dumper;
print Dumper( \%seen ); # use 'print', not 'say'
I ran it with some sample data as indicated by the egrep output:
$ egrep '<(foo|bar)' index.html
<foo alias="#foobar">it's foo!</foo>
<bar alias="#barfoo">it's bar!</bar>
And here is the result:
$ perl foo.pl index.html
$VAR1 = {
'alias="#foobar"' => 1,
'alias="#barfoo"' => 1
};
$

Why is this substitution involving the end of each line adding more to the beginning of the line?

I have some document into which I want to add something at the beginning and at the end of each line. The original document looks like this:
firstLine
secondline
I want to turn it into this:
put 'firstLine';
put 'secondline';
By using the following Perl script, I can only turn it into this:
put 'firstLine';
';put 'secondline';
It seems that there is a $ at the end of the first line and at the beginning of the second line. Could someone help me to figure out what is wrong with the following Perl script?
use File::Find;
use strict;
my ($filename, #lines, $oldterm, $newterm); #,$File::Find::name);
my $dir = ".";
open MYFILE, ">error.txt" or die $!;
find(\&edits, $dir);
sub edits() {
$filename = $File::Find::name;
if (grep(/\.txt$/, $filename)) { #only process the perl files
# open the file and read data
# die with grace if it fails
open(FILE, "<$filename") or die "Can't open $filename: $!\n";
#lines = <FILE>;
close FILE;
# open same file for writing, reusing STDOUT
open(STDOUT, ">$filename") or die "Can't open $filename: $!\n";
# walk through lines, putting into $_, and substitute 2nd away
for (#lines) {
s/(&.+)/' "$1" '/ig;
s/^/put '/ig;
s/$/';/ig;
print;
}
#Finish up
close STDOUT;
}
}
don't use regular expressions at all: you already have the lines separated in the #lines array:
for ( #lines ) {
chomp; # remove newline at the end of the implicit variable $_
print "puts '$_'\n";
}
If you do it in one step you should have better luck. Something like:
s/^(&.+)$/put '$1';/im;