Perl regex to find keywords and not variables - regex

I'm trying to create a regex as following :
print $time . "\n"; --> match only print because time is a variable ($ before)
$epoc = time(); --> match only time
My regex for the moment is /(?-xism:\b(print|time)\b)/g but it match time in $time in the first example.
Check here.
I tried things like [^\$] but then it doesn't match print anymore.
(I will have more keyword like print|time|...|...)
Thanks

Parsing perl code is a common and useful teaching tool since the student must understand both the parsing techniques and the code that they're trying to parse.
However, to do this properly, the best advice is to use PPI
The following script parses itself and outputs all of the barewords. If you wanted to, you could compare the list of barewords to the ones that you're trying to match. Note, this will avoid things within strings, comments, etc.
use strict;
use warnings;
use PPI;
#my $src = do {local $/; <DATA>}; # Could analyze the smaller code in __DATA__ instead
my $src = do {
local #ARGV = $0;
local $/;
<>;
};
# Load a document
my $doc = PPI::Document->new( \$src );
# Find all the barewords within the doc
my $barewords = $doc->find( 'PPI::Token::Word' );
for (#$barewords) {
print $_->content, "\n";
}
__DATA__
use strict;
use warnings;
my $time = time;
print $time . "\n";
Outputs:
use
strict
use
warnings
use
PPI
my
do
local
local
my
PPI::Document
new
my
find
for
print
content
__DATA__

What you need is a negative lookbehind (?<!\$), it's zero-width so it doesn't "consume" characters.
(?<!\$)a means match a if not preceded with a literal $. Note that we escaped $ since it means end of string (or line depending on the m modifier).
Your regex will look like (?-xism:\b(?<!\$)(print|time)\b).
I'm wondering why you are turning off the xism modifiers. They are off by default.So just use /\b(?<!\$)(?:print|time)\b/g as pattern.
Online demo
SO regex reference

Related

Perl how do you assign a varanble to a regex match result

How do you create a $scalar from the result of a regex match?
Is there any way that once the script has matched the regex that it can be assigned to a variable so it can be used later on, outside of the block.
IE. If $regex_result = blah blah then do something.
I understand that I should make the regex as non-greedy as possible.
#!/usr/bin/perl
use strict;
use warnings;
# use diagnostics;
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Outlook';
my #Qmail;
my $regex = "^\\s\*owner \#";
my $sentence = $regex =~ "/^\\s\*owner \#/";
my $outlook = Win32::OLE->new('Outlook.Application')
or warn "Failed Opening Outlook.";
my $namespace = $outlook->GetNamespace("MAPI");
my $folder = $namespace->Folders("test")->Folders("Inbox");
my $items = $folder->Items;
foreach my $msg ( $items->in ) {
if ( $msg->{Subject} =~ m/^(.*test alert) / ) {
my $name = $1;
print " processing Email for $name \n";
push #Qmail, $msg->{Body};
}
}
for(#Qmail) {
next unless /$regex|^\s*description/i;
print; # prints what i want ie lines that start with owner and description
}
print $sentence; # prints ^\\s\*offense \ # not lines that start with owner.
One way is to verify a match occurred.
use strict;
use warnings;
my $str = "hello what world";
my $match = 'no match found';
my $what = 'no what found';
if ( $str =~ /hello (what) world/ )
{
$match = $&;
$what = $1;
}
print '$match = ', $match, "\n";
print '$what = ', $what, "\n";
Use Below Perl variables to meet your requirements -
$` = The string preceding whatever was matched by the last pattern match, not counting patterns matched in nested blocks that have been exited already.
$& = Contains the string matched by the last pattern match
$' = The string following whatever was matched by the last pattern match, not counting patterns matched in nested blockes that have been exited already. For example:
$_ = 'abcdefghi';
/def/;
print "$`:$&:$'\n"; # prints abc:def:ghi
The match of a regex is stored in special variables (as well as some more readable variables if you specify the regex to do so and use the /p flag).
For the whole last match you're looking at the $MATCH (or $& for short) variable. This is covered in the manual page perlvar.
So say you wanted to store your last for loop's matches in an array called #matches, you could write the loop (and for some reason I think you meant it to be a foreach loop) as:
my #matches = ();
foreach (#Qmail) {
next unless /$regex|^\s*description/i;
push #matches_in_qmail $MATCH
print;
}
I think you have a problem in your code. I'm not sure of the original intention but looking at these lines:
my $regex = "^\\s\*owner \#";
my $sentence = $regex =~ "/^\s*owner #/";
I'll step through that as:
Assign $regexto the string ^\s*owner #.
Assign $sentence to value of running a match within $regex with the regular expression /^s*owner $/ (which won't match, if it did $sentence will be 1 but since it didn't it's false).
I think. I'm actually not exactly certain what that line will do or was meant to do.
I'm not quite sure what part of the match you want: the captures, or something else. I've written Regexp::Result which you can use to grab all the captures etc. on a successful match, and Regexp::Flow to grab multiple results (including success statuses). If you just want numbered captures, you can also use Data::Munge
You can do the following:
my $str ="hello world";
my ($hello, $world) = $str =~ /(hello)|(what)/;
say "[$_]" for($hello,$world);
As you see $hello contains "hello".
If you have older perl on your system like me, perl 5.18 or earlier, and you use $ $& $' like codequestor's answer above, it will slow down your program.
Instead, you can use your regex pattern with the modifier /p, and then check these 3 variables: ${^PREMATCH}, ${^MATCH}, and ${^POSTMATCH} for your matching results.

Matching multiline string in file using perl regex

I am reading in another perl file and trying to find all strings surrounded by quotations within the file, single or multiline. I've matched all the single lines fine but I can't match the mulitlines without printing the entire line out, when I just want the string itself. For example, heres a snippet of what I'm reading in:
#!/usr/bin/env perl
use warnings;
use strict;
# assign variable
my $string = 'Hello World!';
my $string4 = "chmod";
my $string3 = "This is a fun
multiple line string, please match";
so the output I'd like is
'Hello World!';
"chmod";
"This is a fun multiple line string, please match";
but I am getting:
'Hello World!';
my $string4 = "chmod";
my $string3 = "This is a fun
multiple line string, please match";
This is the code I am using to find the strings - all file content is stored in #contents:
my #strings_found = ();
my $line;
for(#contents) {
$line .= $_;
}
if($line =~ /(['"](.?)*["'])/s) {
push #strings_found,$1;
}
print #strings_found;
I am guessing I am only getting 'Hello World!'; correctly because I am using the $1 but I am not sure how else to find the others without looping line by line, which I would think would make it hard to find the multi line string as it doesn't know what the next line is.
I know my regex is reasonably basic and doesn't account for some caveats but I just wanted to get the basic catch most regex working before moving on to more complex situations.
Any pointers as to where I am going wrong?
Couple big things, you need to search in a while loop with the g modifier on your regex. And you also need to turn off greedy matching for what's inside the quotes by using .*?.
use strict;
use warnings;
my $contents = do {local $/; <DATA>};
my #strings_found = ();
while ($contents =~ /(['"](.*?)["'])/sg) {
push #strings_found, $1;
}
print "$_\n" for #strings_found;
__DATA__
#!/usr/bin/env perl
use warnings;
use strict;
# assign variable
my $string = 'Hello World!';
my $string4 = "chmod";
my $string3 = "This is a fun
multiple line string, please match";
Outputs
'Hello World!'
"chmod"
"This is a fun
multiple line string, please match"
You aren't the first person to search for help with this homework problem. Here's a more detailed answer I gave to ... well ... you ;) finding words surround by quotations perl
regexp matching (in perl and generally) are greedy by default. So your regexp will match from 1st ' or " to last. Print the length of your #strings_found array. I think it will always be just 1 with the code you have.
Change it to be not greedy by following * with a ?
/('"*?["'])/s
I think.
It will work in a basic way. Regexps are kindof the wrong way to do this if you want a robust solution. You would want to write parsing code instead for that. If you have different quotes inside a string then greedy will give you the 1 biggest string. Non greedy will give you the smallest strings not caring if start or end quote are different.
Read about greedy and non greedy.
Also note the /m multiline modifier.
http://perldoc.perl.org/perlre.html#Regular-Expressions

print the name gathered after a regular expression in perl

Ok so i need to gather the name of people in a txt file using perl, i can find the names and print them out with NAME: in front using a regular expression but i need to gather just the persons name I want to do this using regular expressions, because there is multiple different names to gather in each file.
Example of file input:
NAME: Bigelow, Patrick R DATE: 28 Apr 2014
code so far:
if (/NAME:/){
my #arr = /NAME:\s\S*\s\S*\s\S*/g or next;
print "$_\n" for #arr;
}
Better use two lines of code, one for stripping out the names(between Name: ... Date:), and then split the names for comma.
my ($tmp) = /(?<=NAME: )(.*?)(?=\s+DATE:)/g or next;
my #arr = split(/,\s+/, $tmp);
You can use s/regex/replace/ to do what you want. I avoid using post fix loops (easy to miss and are a pain to fix when you add more lines) and the use of $_ (does not improve readability, is inconstantly implemented, and is global in scope):
use strict;
use strict;
use warnings;
use autodie;
use feature qw(say);
use constant {
TEXT_FILE => 'file.txt',
NAME_LINE => qr/^NAME:\s+/,
};
open my $name_fh, "<", TEXT_FILE; # Autodie eliminates need for testing if this worked
while ( my $line = <DATA> ) {
chomp $line;
next unless $line =~ NAME_LINE; # Skip lines if they don't contain name
( my $name = $line ) =~ s/NAME:\s+(.+?)\s+DATE.*/$1/;
say qq("$name");
}
close $name_fh;
The regular expresion s/NAME:\s+(.+?) +DATE.*/$1/ needs a bit of explaining:
The parentheses (...) are a capture group. I can refer to them by the $1 on the substitution side.
The +? is a non greedy qualifier. I am matching the smallest group I can that will match the regular expression. The trick is that \s+DATE.* will match as big as possible. Between the two, I eliminate all of the spaces from the end of the name to the DATE string this way. It's really not that efficient, but it works well.
The qq(...) is just a fancy way of doing double quotes. It makes it easy to put quotes in your quotes this way.
I use say instead of print because say will automatically add in the \n on the end. this becomes important where adding . "\n" to the end of something can cause problems.
For example:
print join ": ", #foo . "\n"; # Doesn't work.
say join ": ", #foo; # This works fine.

Perl regular expression - eliminate

Am a newbie here. I use glimpse in my Perl script to get the path of files.
For example
/home/user/Proj/A/Apps/App.pm
/home/user/Proj/B/Apps.pm
I need to fetch the part after Proj i.e; the output should be
A/Apps/App.pm
B/Apps.pm
If you want to use regex/replace you could do something like:
$str =~ s!.*/Proj/!!;
You have various options here. When it's always at /home/user/Proj/, I prefer the second way. If not, you can use the first way as well. The best way is a substr (when its a static length):
use 5.014;
use strict;
use warnings;
my $s_a = "/home/user/Proj/A/Apps/App.pm";
my $s_b = "/home/user/Proj/B/Apps.pm";
say $s_a =~ s{.*Proj/}{}r;
say $s_b =~ s{.*Proj/}{}r;
say $s_a =~ s{/home/user/Proj/}{}r;
say $s_b =~ s{/home/user/Proj/}{}r;
say substr $s_a, 16;
say substr $s_b, 16;
output:
A/Apps/App.pm
B/Apps.pm
A/Apps/App.pm
B/Apps.pm
A/Apps/App.pm
B/Apps.pm
If you want to modifiy an existing variable to remove the first part of the path then it's simple: just use the substitution operator s/// to remove the first part of the string up to /Proj/. I've used alternative delimiters s||| here to avoid having to escape the slashes in the pattern.
use strict;
use warnings;
my #paths = qw{
/home/user/Proj/A/Apps/App.pm
/home/user/Proj/B/Apps.pm
};
for my $path (#paths) {
$path =~ s|.*/Proj/||;
print $path, "\n";
}
output
A/Apps/App.pm
B/Apps.pm
But if you want to leave your path variable as it is and copy the tail portion to another variable, then I think it's best to use a regular expression to capture the wanted part, like this
for my $path (#paths) {
my ($tail) = $path =~ m|/Proj/(.+)|;
print $tail, "\n";
}
The output is identical.

Manipulating backreferences for substitution in perl

As a part of an attempt to replace scientific numbers with decimal numbers I want to save a backreference into a string variable, but it doesn't work.
My input file is:
,8E-6,
,-11.78E-16,
,-17e+7,
I then run the following:
open FILE, "+<C:/Perl/input.txt" or die $!;
open(OUTPUT, "+>C:/Perl/output.txt") or die;
while (my $lines = <FILE>){
$find = "(?:,)(-?)(0|[1-9][0-9]*)(\.)?([0-9]*)?([eE])([+\-]?)([0-9]+)(?:,)";
$noofzeroesbeforecomma = eval("$7-length($4)");
$replace = '"foo $noofzeroesbeforecomma bar"';
$lines =~ s/$find/$replace/eeg;
print (OUTPUT $lines);
}
close(FILE);
I get
foo bar
foo bar
foo bar
where I would have expected
foo 6 bar
foo 14 bar
foo 7 bar
$noofzeroesbeforecomma seems to be empty or non-existant.
Even with the following adjustment I get an empty result
$noofzeroesbeforecomma = $2;
Only inserting $2 directly in the replace string gives me something (which is then, unfortunately, not what I want).
Can anyone help?
I'm running Strawberry Perl (5.16.1.1-64bit) on a 64-bit Windows 7 machine, and quite inexperienced with Perl
Your main problem is not using
use strict;
use warnings;
warnings would have told you
Use of uninitialized value $7 in concatenation (.) or string at ...
Use of uninitialized value $4 in concatenation (.) or string at ...
I would recommend you try and find a module that can handle scientific notation, rather than trying to hack your own.
Your code, in a working order might look something like this. As you can see, I have put a q() around your eval string to avoid it being evaluated before $7 and $4 exists. I also removed the eval itself, since while double eval on an eval is somewhat excessive.
use strict;
use warnings;
while (my $lines = <DATA>) {
my $find="(?:,)(-?)(0|[1-9][0-9]*)(\.)?([0-9]*)?([eE])([+\-]?)([0-9]+)(?:,)";
my $noof = q|$7-length($4)|;
$lines =~ s/$find/$noof/eeg;
print $lines;
}
__DATA__
,8E-6,
,-11.78E-16,
,-17e+7,
Output:
6
14
7
As a side note, not using strict is asking for trouble. Doing it while using a variable name such as $noofzeroesbeforecomma is asking for twice the trouble, as it is rather easy to make typos.
This is not about backreferences but the original problem, transforming numbers from scientific notation. I'm sure there are some cases in which this fails:
#!/usr/bin/env perl
use strict;
use warnings;
use bignum;
for (<DATA>) {
next unless /([+-]?\d+(?:\.\d+)?)[Ee]([+-]\d+)/;
print $1 * 10 ** $2 . "\n";
}
__DATA__
,8E-6,
,-11.78E-16,
,-17e+7,
Output:
0.000008
-0.000000000000001178
-170000000
I suggest you use the Regexp::Common::number plugin for the Regexp::Common module which will find all real numbers for you and allow you to replace those that have an exponent marker
This code shows the idea. using the -keep option makes the module put each component into one of the $N variables. The exponent marker - e or E - is in $7, so the number can be transformed depending on whether this was present
use strict;
use warnings;
use Regexp::Common;
my $real_re = $RE{num}{real}{-keep};
while (<>) {
s/$real_re/ $7 ? sprintf '%.20f', $1 : $1 /eg;
print;
}
output
Given your example input, this code produces the following. The values can be tidied up further using additional code in the substitution
,0.00000800000000000000,
,-0.00000000000000117800,
,-170000000.00000000000000000000,
The problem is that Perl can handle all those types of expressions. And since the standard item of data in Perl is the string, you would only need to capture the expression to use it. So, take this expression:
/(-?\d+(?:.\d+)?[Ee][+-]?\d+)/
to extract it from the surrounding text and use sprintf to format it, like Borodin showed.
However, if it helps you to see a better case of what you tried to do, this works better
my ( $whole, $frac, $expon )
= $line =~ m/(?:,)-?(0|[1-9]\d*)(?:\.(\d*))?[eE]([+\-]?\d+)(?:,)/
;
my $num = $expon - length( $frac );
Why not capture the sign with the exponent anyway, if you're going to do arithmetic with it?
It's better to name your captures and eschew eval when it's not necessary.
The substitution--as is--doesn't make much sense.
Really, since neither the symbols or the digits can be case sensitive, just put a (?i) at the beginning, and avoid the E "character class" [Ee]:
/((?i)-?\d+(?:.\d+)?e[+-]?\d+)/