perl: refactor s/.../.../g -> while {}? - regex

I've got a monstrous eval-substitution; here's a simplified version
$ perl -wpe 's#(for )(\w+)#$1 . "user " . qx/id $2/#ge'
which replaces e.g.
Stats for root are bad
Stats for user uid=0(root) gid=0(root) groups=0(root)
are bad
Is there an idiom to turn the s/.../.../g into a loop? Something like
while (m#(for )(\w+)#) {
# explicitly replace match with expression computed over several LOCs
}
Or maybe somehow use map()?

This idiom is to use s///eg. It undeniably better than the alternative you are seeking.
s{pat}{ repl }eg;
is equivalent to
my $out = '';
my $last_pos = 0;
while (m{pat}g) {
$out .= substr($_, $last_pos, $-[0] - $last_pos) . do { repl };
$last_pos = $+[0];
}
$_ = $out . substr($_, $last_pos);
Because you hinted that there would be more than one statements to be executed in the replacement expression, I'd write your code as follows:
s{for \K(\w+)}{
...
...
}eg;
The advantage of curlies is that they can be nested.

Related

Perl Grepping from an Array

I need to grep a value from an array.
For example i have a values
#a=('branches/Soft/a.txt', 'branches/Soft/h.cpp', branches/Main/utils.pl');
#Array = ('branches/Soft/a.txt', 'branches/Soft/h.cpp', branches/Main/utils.pl','branches/Soft/B2/c.tct', 'branches/Docs/A1/b.txt');
Now, i need to loop #a and find each value matches to #Array. For Example
It works for me with grep. You'd do it the exact same way as in the More::ListUtils example below, except for having grep instead of any. You can also shorten it to
my $got_it = grep { /$str/ } #paths;
my #matches = grep { /$str/ } #paths;
This by default tests with /m against $_, each element of the list in turn. The $str and #paths are the same as below.
You can use the module More::ListUtils as well. Its function any returns true/false depending on whether the condition in the block is satisfied for any element in the list, ie. whether there was a match in this case.
use warnings;
use strict;
use Most::ListUtils;
my $str = 'branches/Soft/a.txt';
my #paths = ('branches/Soft/a.txt', 'branches/Soft/b.txt',
'branches/Docs/A1/b.txt', 'branches/Soft/B2/c.tct');
my $got_match = any { $_ =~ m/$str/ } #paths;
With the list above, containing the $str, the $got_match is 1.
Or you can roll it by hand and catch the match as well
foreach my $p (#paths) {
print "Found it: $1\n" if $p =~ m/($str)/;
}
This does print out the match.
Note that the strings you show in your example do not contain the one to match. I added it to my list for a test. Without it in the list no match is found in either of the examples.
To test for more than one string, with the added sample
my #strings = ('branches/Soft/a.txt', 'branches/Soft/h.cpp', 'branches/Main/utils.pl');
my #paths = ('branches/Soft/a.txt', 'branches/Soft/h.cpp', 'branches/Main/utils.pl',
'branches/Soft/B2/c.tct', 'branches/Docs/A1/b.txt');
foreach my $str (#strings) {
foreach my $p (#paths) {
print "Found it: $1\n" if $p =~ m/($str)/;
}
# Or, instead of the foreach loop above use
# my $match = grep { /$str/ } #paths;
# print "Matched for $str\n" if $match;
}
This prints
Found it: branches/Soft/a.txt
Found it: branches/Soft/h.cpp
Found it: branches/Main/utils.pl
When the lines with grep are uncommented and foreach ones commented out I get the corresponding prints for the same strings.
The slashes dot in $a will pose a problem so you either have to escape them it when doing regex match or use a simple eq to find the matches:
Regex match with $a escaped:
my #matches = grep { /\Q$a\E/ } #array;
Simple comparison with "equals":
my #matches = grep { $_ eq $a } #array;
With your sample data both will give an empty array #matches because there is no match.
This Solved My Question. Thanks to all especially #zdim for the valuable time and support
my #SVNFILES = ('branches/Soft/a.txt', 'branches/Soft/b.txt');
my #paths = ('branches/Soft/a.txt', 'branches/Soft/b.txt',
'branches/Docs/A1/b.txt', 'branches/Soft/B2/c.tct');
foreach my $svn (#SVNFILES)
{
chomp ($svn);
my $m = grep { /$svn/ } (#paths);
if ( $m eq '0' ) {
print "Files Mismatch\n";
exit 1;
}
}
You should escape characters like '/' and '.' in any regex when you need it as a character.
Likewise :
$a="branches\/Soft\/a\.txt"
Retry whatever you did with either grep or perl with that. If it still doesn't work, tell us precisely what you tried.

How can I match these function calls, and extract the nmae of the function and the first argument?

I am trying to parse an array of elements. Those who match a pattern like the following:
Jim("jjanson", Customer.SALES);
I want to create a hash table like Jim => "jjanson"
How can I do this?
I can not match the lines using:
if($line =~ /\s*[A-Za-z]*"(.*),Customer.*\s*/)
You're not matching either the '(' after the name, nor the ' ' after the comma, before "Customer.".
I can get 'jjanson"' using this expression:
/\s*[A-Za-z]\(*"(.*), Customer.*\s*/
But I assume you don't want jjanson", so we need to modify it like so. (I tend to include the negative character class when I'm looking for simply-delimited stuff. So, in this case I'll make the expression "[^"]*"
/\s*[A-Za-z]\(*"([^"]+)", Customer.*\s*/
Also, I try not to depend upon whitespace, presence or number, I'm going to replace the space with \s*. That you didn't notice that you skipped the whitespace is a good illustration of the need to say "ignore a bunch of whitespace".
/\s*[A-Za-z]\(*"([^"]+)",\s*Customer.*\s*/
Now it's only looking for the sequence ',' + 'Customer' in the significant characters. Functionally, the same, if more flexible.
But since you only do one capture, I can't see what you'd map to what. So I'll do my own mapping:
my %records;
while ( my $line = $source->()) { # simply feed for a source of lines.
my ( $first, $user, $tag )
= $line = m/\s*(\p{Alpha}+)\s*\(\s*"([^"]+)",\s*Customer\.(\S+?)\)\/
;
$records{ $user }
= { first => $first
, username => $user
, tag => $tag
};
}
This is much more than you would tend to need in a one-off, quick solution. But I like to store as much of my input as seems significant.
Note that Jim("jjanson", Customer.SALES); matches the syntax of a function call with two arguments. You can thus abuse string eval:
#!/usr/bin/env perl
use strict;
use warnings;
use YAML::XS;
my $info = extract_first_arg(q{ Jim("jjanson", Customer.SALES);} );
print Dump $info;
sub extract_first_arg {
my $call = shift;
my ($name) = ($call =~ m{ \A \s* (\w+) }x);
unless ($name) {
warn "Failed to find function name in '$call'";
return;
}
my $username = eval sprintf(q{
package My::DangerZone;
no strict;
local *{ %s } = sub { $_[0] };
%s
}, $name, $call);
return { $name => $username };
}
Output:
---
Jim: jjanson
Or, you can abuse autoloading:
our $AUTOLOAD;
print Dump eval 'no strict;' . q{ Jim("jjanson", Customer.SALES); };
sub AUTOLOAD {
my ($fn) = ($AUTOLOAD =~ /::(\w+)\z/);
return { $fn => $_[0] };
}
I would not necessarily recommend using these methods, especially on input that is not in your control, and in a situation where this script has access to sensitive facilities.
On the other hand, I have, in the right circumstances, utilized this kind of thing in transforming one given set of information into something that can be used elsewhere.
Try this:
$line = 'Jim("jjanson", Customer.SALES)';
my %hashStore = (); #Jim("jjanson"
if($line=~m/^\s*([^\(\)]*)\(\"([^\"]*)\"/g) { $hashStore{$1} = $2; }
use Data::Dumper;
print Dumper \%hashStore;
Output:
$VAR1 = {
'Jim' => 'jjanson'
};

Expanding [optionals], groupings, and the | or operator in text

I am trying to expand sentences that incorporate [ ] to indicate optionals, ( ) to indicate grouping, and | to indicate the or operator and enumerate all possibilities. So for example:
"Hey [there] you [hood]." should return four sentences:
Hey there you hood.
Hey there you.
Hey you hood.
Hey you.
The end goal would look like:
Input: "(His|Her) dog was [very|extremely] confused."
Output: His dog was very confused.
His dog was extremely confused.
His dog was confused.
Her dog was very confused.
Her dog was extremely confused.
Her dog was confused.
I am doing it using regex matching and recursion. I have searched both CPAN and SO under the phrases:
Expanding text
expanding sentences
expanding conditionals
expanding optionals
expanding groupings
with no luck.
Thanks.
I have edited this question largely to better reflect its evolution and removed large portions which were made obsolete as the question evolved. The question above is the question that most of the answers below are attempting to address.
My current state is the following:
After wrestling with the problem above for a day I have two solutions very close to what I want. One is my own and the second is PLT's below. However, I have decided to try a fundamentally different approach.
Using regular expressions and manually parsing these sentences seems like a very ugly way of doing things. So I have decided to instead write a grammar for my "language" and use a parser-generator to parse it for me.
This gives me an additional layer of abstraction and avoids the following scenario described by Damian Conway in Perl Best Practices: [about regexps]
cut-and-paste-and-modify-slightly-and-oh-now-it-doesn't-work-at-all-so-let's-modify-it-some-more-and-see-if-that-helps-no-it-didn't-but-we're-commited-now-so-maybe-if-we-change-that-bit-instead-hmmmm-that's-closer-but-still-not-quite-right-maybe-if-I-made-that-third-repetition-non-greedy-instead-oops-now-it's-back-to-not-matching-at-all-perhaps-I-should-just-post-it-to-PerlMonks.org-and-see-if-they-know-what's-wrong
It also makes it much easier if the grammar of these expressions were to change and I needed to support other constructs later on.
Last update:
I solved my problem using an open source toolkit. This will transcribe a JSGF version of my input and generate a finite-state transducer. From there you can walk through the FST to generate all possible outcomes.
Ok, another complete revision of the answer. This will work as intended. :) It now also expands nested parens. Newline is still the delimeter, but I added a way to quickly change it to something more complicated if the need arises.
Basically, I started with replacing brackets with parens + pipe, since [word ] and (|word ) are equivalent.
I then extracted all the encapsulating parens, e.g. both (you |my friend) and (you |my (|friendly ) friend ). I then expanded the nested parens into regular parens, e.g. (you |my (|friendly ) friend ) was replaced with (you |my friendly friend |my friend ).
With that done, the words could be processed with the original subroutine.
Remains to be tested on more complicated expansions, but it works fine during my testing.
Here's the revised code:
use strict;
use warnings;
sub addwords {
my ($aref, #words) = #_;
my #total;
for my $start (#$aref) {
for my $add (#words) {
push #total, $start . $add;
}
}
return #total;
}
sub expand_words {
my $str = shift;
my #sentences = ('');
for my $word (word_split($str)) {
if ($word =~ /^([(])([^)]+)[)]$/) {
my #options = split /\|/, $2;
push #options, '' if ($1 eq '[');
#sentences = addwords(\#sentences, #options);
} else {
#sentences = addwords(\#sentences, $word);
}
}
return #sentences;
}
sub fix_parens {
my $str = shift;
$str =~ s/\[/(|/g;
$str =~ s/\]/)/g;
return $str;
}
sub fix_nested {
my #array = #_;
my #return;
for (my $i=0; $i <= $#array; ) {
my $inc = 1;
my ($co, $cc);
do {
$co = () = $array[$i] =~ /\(/g;
$cc = () = $array[$i] =~ /\)/g;
if ( $co > $cc ) {
$array[$i] .= $array[$i + $inc++];
}
} while ( $co > $cc );
push #return, expand_nest($array[$i]);
$i += $inc;
}
return #return;
}
sub expand_nest {
my $str = shift;
my $co = () = $str =~ /\(/g;
return $str unless ($co > 1);
while ($str =~ /([^|(]+\([^)]+\)[^|)]+)/) {
my $match = $1;
my #match = expand_words($match);
my $line = join '|', #match;
$match =~ s/([()|])/"\\" . $1/ge;
$str =~ s/$match/$line/ or die $!;
}
return $str;
}
sub word_split {
my $str = shift;
my $delimeter = "\n";
$str = fix_parens($str);
$str =~ s/([[(])/$delimeter$1/g;
$str =~ s/([])])/$1$delimeter/g;
my #tot = split /$delimeter/, $str;
#tot = fix_nested(#tot);
return #tot;
}
my $str = "Hey [there ](you|my [friendly ]friend) where's my [red|blue]berry?";
my #sentences = expand_words($str);
print "$_\n" for (#sentences);
print scalar #sentences . " sentences\n";
Will produce the output:
Hey you where's my berry?
Hey you where's my redberry?
Hey you where's my blueberry?
Hey my friend where's my berry?
Hey my friend where's my redberry?
Hey my friend where's my blueberry?
Hey my friendly friend where's my berry?
Hey my friendly friend where's my redberry?
Hey my friendly friend where's my blueberry?
Hey there you where's my berry?
Hey there you where's my redberry?
Hey there you where's my blueberry?
Hey there my friend where's my berry?
Hey there my friend where's my redberry?
Hey there my friend where's my blueberry?
Hey there my friendly friend where's my berry?
Hey there my friendly friend where's my redberry?
Hey there my friendly friend where's my blueberry?
18 sentences
Data::Generate. I found this while searching for combination which is the mathematical term of what you're doing with your sets of words there.
Here is a rather simple solution, if you get past some of the ugly regexps, due to collisions between your syntax and the regexp syntax. It allows for both the [] and the () syntax, which in fact are very similar, [foo] is the same as (foo| ).
The basis is to replace each alternation by a marker #0, #1, #2... while storing them in an array. then replace the last marker, generating several phrases, then replace the next-to last marker in each of those phrases... until all markers have been replaced. Attentive readers of Higher-order Perl will no doubt find a more elegant way to do this.
#!/usr/bin/perl
use strict;
use warnings;
while( my $phrase=<DATA>)
{ my $original= $phrase;
$phrase=~s{\[([^\]]*)\]}{($1| )}g; # replace [c|d] by (c|d| )
my $alts=[]; my $i=0;
while( $phrase=~ s{\(([^)]*)\)}{#$i}) # replace (a|b) ... (c|d| ) by #0 ... #1
{ push #$alts, [ split /\|/, $1 ]; $i++; # store [ ['a', 'b'], [ 'c', 'd', ' '] ]
}
my $expanded=[$phrase]; # seed the expanded list with the phrase
while( #$alts) { expand( $alts, $expanded); } # expand each alternation, until none left
print "$original - ", join( " - ", #$expanded), "\n\n";
}
exit;
# expand the last #i of the phrase in all the phrases in $expanded
sub expand
{ my( $alts, $expanded)=#_;
my #these_alts= #{pop(#$alts)}; # the last alternations
my $i= #$alts; # the corresponding index in the phrases
#$expanded= map { my $ph= $_;
map { my $ph_e= $ph;
$ph_e=~ s{#$i}{$_}; # replace the marker #i by one option
$ph_e=~ s{ +}{ }; # fix double spaces
$ph_e;
} #these_alts # for all options
} #$expanded # for all phrases stored so far
}
__DATA__
(His|Her) dog was [very|extremely

Perl regex replacement string special variable

I'm aware of the match, prematch, and postmatch predefined variables. I'm wondering if there is something similar for the evaluated replacement part of the s/// operator.
This would be particularly useful in dynamic expressions so they don't have to be evaluated a 2nd time.
For example, I currently have %regexs which is a hash of various search and replace strings.
Here's a snippet:
while (<>) {
foreach my $key (keys %regexs) {
while (s/$regexs{$key}{'search'}/$regexs{$key}{'replace'}/ee) {
# Here I want to do something with just the replaced part
# without reevaluating.
}
}
print;
}
Is there a convenient way to do it? Perl seems to have so many convenient shortcuts, and it seems like a waste to have to evaluate twice (which appears to be the alternative).
EDIT: I just wanted to give an example: $regexs{$key}{'replace'} might be the string '"$2$1"' thus swapping the positions of some text in the string $regexs{$key}{'search'} which might be '(foo)(bar)' - thus resulting in "barfoo". The second evaluation that I'm trying to avoid is the output of $regexs{$key}{'replace'}.
Instead of using string eval (which I assume is what's going on with s///ee), you could define code references to do the work. Those code references can then return the value of the replacement text. For example:
use strict;
use warnings;
my %regex = (
digits => sub {
my $r;
return unless $_[0] =~ s/(\d)(\d)_/$r = $2.$1/e;
return $r;
},
);
while (<DATA>){
for my $k (keys %regex){
while ( my $replacement_text = $regex{$k}->($_) ){
print $replacement_text, "\n";
}
}
print;
}
__END__
12_ab_78_gh_
34_cd_78_yz_
I'm pretty sure there isn't any direct way to do what you're asking, but that doesn't mean it's impossible. How about this?
{
my $capture;
sub capture {
$capture = $_[0] if #_;
$capture;
}
}
while (s<$regexes{$key}{search}>
<"capture('" . $regexes{$key}{replace}) . "')">eeg) {
my $replacement = capture();
#...
}
Well, except to do it really properly you'd have to shoehorn a little more code in there to make the value in the hash safe inside a singlequotish string (backslash singlequotes and backslashes).
If you do the second eval manually you can store the result yourself.
my $store;
s{$search}{ $store = eval $replace }e;
why not assign to local vars before:
my $replace = $regexs{$key}{'replace'};
now your evaluating once.

Does the 'o' modifier for Perl regular expressions still provide any benefit?

It used to be considered beneficial to include the 'o' modifier at the end of Perl regular expressions. The current Perl documentation does not even seem to list it, certainly not at the modifiers section of perlre.
Does it provide any benefit now?
It is still accepted, for reasons of backwards compatibility if nothing else.
As noted by J A Faucett and brian d foy, the 'o' modifier is still documented, if you find the right places to look (one of which is not the perlre documentation). It is mentioned in the perlop pages. It is also found in the perlreref pages.
As noted by Alan M in the accepted answer, the better modern technique is usually to use the qr// (quoted regex) operator.
/o is deprecated. The simplest way to make sure a regex is compiled only once is to use use a regex object, like so:
my $reg = qr/foo$bar/;
The interpolation of $bar is done when the variable $reg is initialized, and the cached, compiled regex will be used from then on within the enclosing scope. But sometimes you want the regex to be recompiled, because you want it to use the variable's new value. Here's the example Friedl used in The Book:
sub CheckLogfileForToday()
{
my $today = (qw<Sun Mon Tue Wed Thu Fri Sat>)[(localtime)[6]];
my $today_regex = qr/^$today:/i; # compiles once per function call
while (<LOGFILE>) {
if ($_ =~ $today_regex) {
...
}
}
}
Within the scope of the function, the value of $today_regex stays the same. But the next time the function is called, the regex will be recompiled with the new value of $today. If he had just used:
if ($_ =~ m/^$today:/io)
...the regex would never be updated. So, with the object form you have the efficiency of /o without sacrificing flexibility.
The /o modifier is in the perlop documentation instead of the perlre documentation since it is a quote-like modifier rather than a regex modifier. That has always seemed odd to me, but that's how it is. Since Perl 5.20, it's now listed in perlre simply to note that you probably shouldn't use it.
Before Perl 5.6, Perl would recompile the regex even if the variable had not changed. You don't need to do that anymore. You could use /o to compile the regex once despite further changes to the variable, but as the other answers noted, qr// is better for that.
In the Perl 5 version 20.0 documentation
http://perldoc.perl.org/perlre.html
it states
Modifiers
Other Modifiers
…
o - pretend to optimize your code, but actually introduce bugs
which may be a humorous way of saying it was supposed to perform some kind of optimisation, but the implementation is broken.
Thus the option might be best avoided.
This is an optimization in the case that the regex includes a variable reference. It indicates that the regex does not change even though it has a variable within it. This allows for optimizations that would not be possible otherwise.
Here are timings for different ways to call matching.
$ perl -v | grep version
This is perl 5, version 20, subversion 1 (v5.20.1) built for x86_64-linux-gnu-thread-multi
$ perl const-in-re-once.pl | sort
0.200 =~ CONST
0.200 =~ m/$VAR/o
0.204 =~ m/literal-wo-vars/
0.252 =~ m,#{[ CONST ]},o
0.260 =~ $VAR
0.276 =~ m/$VAR/
0.336 =~ m,#{[ CONST ]},
My code:
#! /usr/bin/env perl
use strict;
use warnings;
use Time::HiRes qw/ tv_interval clock_gettime gettimeofday /;
use BSD::Resource qw/ getrusage RUSAGE_SELF /;
use constant RE =>
qr{
https?://
(?:[^.]+-d-[^.]+\.)?
(?:(?: (?:dev-)? nind[^.]* | mr02 )\.)?
(?:(?:pda|m)\.)?
(?:(?:news|haber)\.)
(?:.+\.)?
yandex\.
.+
}x;
use constant FINAL_RE => qr,^#{[ RE ]}(/|$),;
my $RE = RE;
use constant ITER_COUNT => 1e5;
use constant URL => 'http://news.trofimenkov.nerpa.yandex.ru/yandsearch?cl4url=www.forbes.ru%2Fnews%2F276745-visa-otklyuchila-rossiiskie-banki-v-krymu&lr=213&lang=ru';
timeit(
'=~ m/literal-wo-vars/',
ITER_COUNT,
sub {
for (my $i = 0; $i < ITER_COUNT; ++$i) {
URL =~ m{
^https?://
(?:[^.]+-d-[^.]+\.)?
(?:(?: (?:dev-)? nind[^.]* | mr02 )\.)?
(?:(?:pda|m)\.)?
(?:(?:news|haber)\.)
(?:.+\.)?
yandex\.
.+
(/|$)
}x
}
}
);
timeit(
'=~ m/$VAR/',
ITER_COUNT,
sub {
for (my $i = 0; $i < ITER_COUNT; ++$i) {
URL =~ m,^$RE(/|$),
}
}
);
timeit(
'=~ $VAR',
ITER_COUNT,
sub {
my $r = qr,^$RE(/|$),o;
for (my $i = 0; $i < ITER_COUNT; ++$i) {
URL =~ $r
}
}
);
timeit(
'=~ m/$VAR/o',
ITER_COUNT,
sub {
for (my $i = 0; $i < ITER_COUNT; ++$i) {
URL =~ m,^$RE(/|$),o
}
}
);
timeit(
'=~ m,#{[ CONST ]},',
ITER_COUNT,
sub {
for (my $i = 0; $i < ITER_COUNT; ++$i) {
URL =~ m,^#{[ RE ]}(/|$),
}
}
);
timeit(
'=~ m,#{[ CONST ]},o',
ITER_COUNT,
sub {
for (my $i = 0; $i < ITER_COUNT; ++$i) {
URL =~ m,^#{[ RE ]}(/|$),o
}
}
);
timeit(
'=~ CONST',
ITER_COUNT,
sub {
my $r = qr,^$RE(/|$),o;
for (my $i = 0; $i < ITER_COUNT; ++$i) {
URL =~ FINAL_RE
}
}
);
sub timeit {
my ($name, $iters, $code) = #_;
#my $t0 = [gettimeofday];
my $t0 = (getrusage RUSAGE_SELF)[0];
$code->();
#my $el = tv_interval($t0);
my $el = (getrusage RUSAGE_SELF)[0] - $t0;
printf "%.3f\t%-17s\t%.9f\n", $el, $name, $el / $iters
}
Yep and Nope
I ran a simple comparison using the follow script:
perl -MBenchmark=cmpthese -E 'my #n = 1..10000; cmpthese(10000, {string => sub{"a1b" =~ /a\d+c/ for #n}, o_flag => sub{"a1b" =~ /a\d+c/o for #n}, qr => sub{my $qr = qr/a\d+c/; "a1b" =~ /$qr/ for #n } })'
Here are the results:
Rate qr string o_flag
qr 760/s -- -72% -73%
string 2703/s 256% -- -5%
o_flag 2833/s 273% 5% --
So, clearly the /o flag is much faster than using qr.
But apparently the /o flag may cause bugs:
Perl regex /o optimization or bug?
One thing it, mystifyingly, does not do is, allow a ONCE block, at least at 5.8.8.
perl -le 'for (1..3){
print;
m/${\(print( "between 1 and 2 only"), 3)}/o and print "matched"
}'