Perl regex substitution using external parameters - regex

Consider the following example:
my $text = "some_strange_thing";
$text =~ s/some_(\w+)_thing/no_$1_stuff/;
print "Result: $text\n";
It prints
"Result: no_strange_stuff"
So far so good.
Now, I need to get both the match and replacement patterns from external sources (user input, config file, etc).
Naive solution appears to be like this:
my $match = "some_(\\w+)_thing";
my $repl = "no_\$1_stuff";
my $text = "some_strange_thing";
$text =~ s/$match/$repl/;
print "Result: $text\n";
However:
"Result: no_$1_stuff".
What's wrong? How can I get the same outcome with externally supplied patterns?

Solution 1: String::Substitution
Use String::Substitution package:
use String::Substitution qw(gsub_modify);
my $find = 'some_(\w+)_thing';
my $repl = 'no_$1_stuff';
my $text = "some_strange_thing";
gsub_modify($text, $find, $repl);
print $text,"\n";
The replacement string only interpolates (term used loosely) numbered match vars (like $1 or ${12}). See "interpolate_match_vars" for more information.
This module does not save or interpolate $& to avoid the "considerable performance penalty" (see perlvar).
Solution 2: Data::Munge
This is a solution mentioned by Grinnz in the comments below.
The Data::Munge can be used the following way:
use Data::Munge;
my $find = qr/some_(\w+)_thing/;
my $repl = 'no_$1_stuff';
my $text = 'some_strange_thing';
my $flags = 'g';
print replace($text, $find, $repl, $flags);
# => no_strange_stuff
Solution 3: A quick'n'dirty way (if replacement won't contain double quotes and security is not considered)
DISCLAIMER: I provide this solution as this approach can be found online, but its caveats are not explained. Do not use it in production.
With this approach, you can't have a replacement string that includes a " double quotation mark and, since this is equivalent to handing whoever is writing the configuration file direct code access, it should not be exposed to Web users (as mentioned by Daniel Martin).
You can use the following code:
#!/usr/bin/perl
my $match = qr"some_(\w+)_thing";
my $repl = '"no_$1_stuff"';
my $text = "some_strange_thing";
$text =~ s/$match/$repl/ee;
print "Result: $text\n";
See IDEONE demo
Result:
Result: no_strange_stuff
You have to
Declare the replacement in '"..."' so as $1 could be later evaluated
Use /ee to force the double evaluation of the variables in the replacement.
A modifier available specifically to search and replace is the s///e evaluation modifier. s///e treats the replacement text as Perl code, rather than a double-quoted string. The value that the code returns is substituted for the matched substring. s///e is useful if you need to do a bit of computation in the process of replacing text.
You can use qr to instantiate pattern for the regex (qr"some_(\w+)_thing").

Essentially the same approach as the accepted solution, but I kept the initial lines the same as the problem statement, since I thought that might make it easier to fit into more situations:
my $match = "some_(\\w+)_thing";
my $repl = "no_\$1_stuff";
my $qrmatch = qr($match);
my $code = $repl;
$code =~ s/([^"\\]*)(["\\])/$1\\$2/g;
$code = qq["$code"];
if (!defined($code)) {
die "Couldn't find appropriate quote marks";
}
my $text = "some_strange_thing";
$text =~ s/$qrmatch/$code/ee;
print "Result: $text\n";
Note that this works no matter what is in $repl, whereas the naive solution has issues if $repl contains a double quote character itself, or ends in a backslash.
Also, assuming that you're going to run the three lines at the end (or something like it) in a loop, do make sure that you don't skip the qr line. It will make a huge performance difference if you skip the qr and just use s/$match/$code/ee.
Also, even though it's not as trivial to get arbitrary code execution with this solution as it is with the accepted one, it wouldn't surprise me if it's still possible. In general, I'd avoid solutions based on s///ee if the $match or $repl come from untrusted users. (e.g., don't build a web service out of this)
Doing this kind of replacement securely when $match and $repl are supplied by untrusted users should be asked as a different question if your use case includes that.

Related

Replacing with Named Captures and Precompiled Regular Expressions in Perl

I'm trying to compile a set of substitution regexes but I can't figure out how to delay interpolation of the capture variables in the replacement scalar I'm setting aside; here's a simple contrived example:
use strict;
use warnings;
my $from = "quick";
my $to = "zippy";
my $find = qr/${from} (?<a>(fox|dog))/;
my $repl = "$to $+{a}"; # Use of uninitialized value in concatenation (.) or string
my $s0 = "The quick fox...\n";
$s0 =~ s/${find}/${repl}/;
print($s0);
This doesn't work because repl is interpolated immediately and elicits "Use of uninitialized value in concatenation (.) or string"
If I use non-interpolating '' quotes it doesn't interpolate in the actual substitution so I get "The zippy $+{a}..."
Is there a trick to setting aside a replacement scalar that contains capture references?
You are getting the warning because you are using $+{a} before performing the match. qr// doesn't perform any matching; it's simply compiles the pattern. It's s/// that performs the match.
You presumably meant to use
my $repl = "$to \$+{a}";
But that simply outputs
The zippy \$+{a}...
You could use the following:
my $find = qr/quick (?<a>fox|dog)/;
my $s0 = "The quick fox...\n";
$s0 =~ s/$find/zippy $+{a}/;
print($s0);
But that hard codes the replacement expression. If you want this code to be dynamic, then what you are building is a template system.
I don't know of any template system with your specific desired syntax.
If you're ok with using the positional variables ($1) instead of named ones ($+{a}), you can use String::Substitution.
use String::Substitution qw( sub_modify );
my $find = qr/quick (?<a>fox|dog)/; # Or simply qr/\Q$from\E (fox|dog)/
my $repl = "zippy \$1";
my $s0 = "The quick fox...\n";
sub_modify($s0, $find, $repl);
print($s0);
The qr// only compiles a pattern. It does not perform a match, so it does not set anything in %+. Hence, the uninitialized warnings.
However, you can do that in the substitution so you don't need to prepare the replacement ahead of time:
s/$find/$to $+{a}/;
However, if you don't know what you want your replacement to be, you can eval code in the replacement side of the substitution that will then be the replacement. Here's a simple addition:
s/$find/ 2 + 2 /e;
You'd get the sum as the replacement:
The 4 jumped over the lazy dog
But here's the rub: That's code and it can do whatever code can do. How you construct that is very important and should never use unsanitized user input.
If you didn't know the string you wanted to put in there, you can construct it beforehand and store it in the variable you use in the replacement side. However, you are making Perl code to eval, so it needs to be a valid Perl string. The double quotes are part of the eval that you will eval later:
my $replacement = '"$to $+{a}"';
s/$find/$replacement/;
Like that, you get the literal string value from $replacement:
The "$to $+{a}" jumped over the lazy dog
Adding the /e means that we evaluate the replacement side as code:
s/$find/$replacement/e;
But, that code is $replacement, and ends up giving us the same result because it's just its string value:
The "$to $+{a}" jumped over the lazy dog
Now here's the fun part. We can eval again! Add another /e and the substitution will eval the first time, then take that result and eval it again:
$s0 =~ s/${find}/$replacement/ee;
The first round of the eval gets the literal text value of $replacement, which is "$to $+{a}" (including the double quotes). The second round takes "$to $+{a}" and evals that, filling in the variables with the values in the current lexical scope. The %+ is populated by the substitution already. Now you have your result:
The zippy fox jumped over the lazy dog
However, this isn't a trick you should pull out lightly. There might be a better way to attack your problem. You do this sort of thing when you bend anything else to your will.
You also have to be very careful that you do what you intend in the string that you construct. You are creating new Perl code. If you are using any sort of outside data that you didn't supply, someone can trick your program into running code that you didn't intend.
There are three good ways to do dynamic regex substitution at runtime:
String interpolation of variables s///
Callback for code execution s///e
Embedded code constructs in the regex.
See the examples below.
Normally a callback form, either via a function or Embedded regex code is used when logic is required to construct a replacement.
Otherwise, use a simple string interpolation on the replacement side.
use strict;
use warnings;
my $s0 = "";
my ($from, $to) = ("quick", "zippy") ;
sub getRepl {
my ($grp1, $grp2) = #_;
if ( $grp1 eq $from ) {
return "<$to $grp2>" }
else {
return "< $2>"
}
}
my $find = qr/(\Q${from}\E) (fox|dog)/;
# ======================================
# Substitution via string interpolation
$s0 = "The quick dog...\n";
$s0 =~ s/$find/[$to $2]/;
print $s0;
# ======================================
# Substitution via callback (eval)
$s0 = "The quick dog...\n";
$s0 =~ s/$find/ getRepl($1,$2) /e;
print $s0;
# ==================================================
# Substitution via regex embedded code constructs
my $repl = "";
my $RxCodeEmbed = qr/(\Q${from}\E)(?{$repl = '(' . $to}) (fox|dog)(?{$repl .= ' ' . $^N . ')'})/;
$s0 = "The quick dog...\n";
$s0 =~ s/$RxCodeEmbed/$repl/;
print $s0;
Outputs
The [zippy dog]...
The <zippy dog>...
The (zippy dog)...

Remove certain characters from a regex group

I have a string that looks like this (key":["value","value","value"])
"emailDomains":["google.co.uk","google.com","google.com","google.com","google.co.uk"]
and I use the following regex to select from the string. (the regex is setup in a way where it wont select a string that looks like this "key":[{"key":"value","key":"value"}] )
(?<=:\[").*?(?="])
Resulting Selection:
google.co.uk","google.com","google.com","google.com","google.co.uk
I want to remove the " in that select string, and i was wondering if there was an easy way to do this using the replace command. Desired result...
"emailDomains":["google.co.uk, google.com, google.com, google.com, google.co.uk"]
How do I solve this problem?
If your string indeed has the form "key":["v1", "v2", ... "vN"], you can split off the part that needs to be changed, replace "," by a space in it, and re-assemble:
my #parts = split / (\["\s* | \s*\"]) /x, $string; #"
$parts[2] =~ s/",\s*"/ /g;
my $processed = join '', #parts;
The regex pattern for the separator in split is captured since in that case the separators are also in the returned list, what is helpful here for putting the string back together. Then, we need to change the third element of the array.
In this approach, we have to change a specific element in the array so if your format varies, even a little, this may not (or still may) be suitable.
This should of course be processed as JSON, using a module. If the format isn't sure, as indicated in a comment, it would be best to try to ensure that you have JSON. Picking bits and pieces like above (or below) is a road to madness once requirements slowly start evolving.
The same approach can be used in a regex, and this may in fact have an advantage to be able to scoop up and ignore everything preceding the : (with split that part may end up with multiple elements if the format isn't exactly as shown, what then affects everything)
$string =~ s{ :\["\s*\K (.*?) ( "\] ) }{
my $e = $2;
my $n = $1 =~ s/",\s*"/ /gr;
$n.$e
}ex;
Here /e modifier makes it so that the replacement side is evaluated as code, where we do the same as with the split above. Notes on regex
Have to save away $2 first, since it gets reset in the next regex
The /r modifier†, which doesn't change its target but rather returns the changed string, is what allows us to use substitution operator on the read-only $1
If nothing gets captured for $2, and perhaps for $1, that means that there was no match and the outcome is simply that $string doesn't change, quietly. So if this substitution should always work then you may want to add handling of such unexpected data
Don't need a $n above, but can return ($1 =~ s/",\s*"/ /gr) . $e
Or, using lookarounds as attempted
$string =~ s{ (?<=:\[") (.+?) (?="\]) }{ $1 =~ s/",\s*"/ /gr }egx;
what does reduce the amount of code, but may be trickier to work with later.
While this is a direct answer to the question I think it's least maintainable.
†  This useful modifier, for "non-destructive substitution," appeared in v5.14. In earlier Perl versions we would copy the string and run regex on that, with an idiom
(my $n = $1) =~ s/",\s*"/ /g;
In the lookarounds-example we then need a little more
$string =~ s{...}{ (my $n = $1) =~ s/",\s*"/ /g; $n }gr
since s/ operator returns the number of substitutions made while we need $n to be returned from that whole piece of code in {} (the replacement side), to be used as the replacement.
You can use this \G based regex to start the match with :[" and further captures the values appropriately and replaces matched text so that only comma is retained and doublequotes are removed.
(:\[")|(?!^)\G([^"]+)"(,)"
Regex Demo
Your text is almost proper JSON, so it's really easy to go the final inch and make it so, and then process that:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say postderef/;
no warnings qw/experimental::postderef/;
use JSON::XS; # Install through your OS package manager or a CPAN client
my $str = q/"emailDomains":["google.co.uk","google.com","google.com","google.com","google.co.uk"]/;
my $json = JSON::XS->new();
my $obj = $json->decode("{$str}");
my $fixed = $json->ascii->encode({emailDomains =>
join(', ', $obj->{'emailDomains'}->#*)});
$fixed =~ s/^\{|\}$//g;
say $fixed;
Try Regex: " *, *"
Replace with: ,
Demo

How can I use a pattern that includes regex special characters in Perl's s/// operator?

The pattern I am using in my s/// contains regex special characters. How can I quote them so that the regex engine treats them as literal characters.
Can I call a subroutine (quotemeta) in the pattern like this?
$string =~ s/quotemeta($regex)/$new/g;
How can I do it please?
You mean you want to quote the pattern in a s/// operator.
You can do that using the \Q and \E escapes:
$s =~ /\Q[a-z]\E//;
would look for the literal string [a-z] instead of a single character among a ... z.
You can find this, and other useful information in perldoc perlreref which is installed on your computer alongside perl.
Perl comes with excellent documentation. One should periodically skim/read all of it.
few ways:
my $quotedregex = quotemeta($regex);
$string =~ s/$quotedregex/$new/g;
or
$string =~ s/${\quotemeta($regex)}/$new/g;
or
$string =~ s/\Q$regex\E/$new/g;
People may well come here hoping to find a way to insert the return value of a subroutine into a regex pattern in situ. Note that this also applies to constants implemented with the constant pragma, which are implemented as inlined subroutines
There are ways to do that, the traditional ones being to take a reference to the subroutine's return value and dereference the result. This works either as a scalar reference or an anonymous array
There is also the Interpolation module, which provides tied hashes that look very much like procedure calls and so make your code neater and easier to read
Given this basic program
use strict;
use warnings 'all';
use Interpolation E => 'eval', mysub => \&mysub;
my $string = '<<<OLD>>>';
my $new = 'NEW';
sub mysub { 'OLD' }
any of the following will replace OLD with NEW in $string
Dereferencing a reference to the subroutine's return value
$string =~ s/${\mysub($regex)}/$new/;
Dereferencing an anonymous array containing the subroutine's return value
$string =~ s/#{[mysub($regex)]}/$new/;
Using Interpolation, a simple eval (the safe sort)
$string =~ s/$E{mysub($regex)}/$new/;
Using Interpolation with a tied hash %mysub to call the subroutine directly
$string =~ s/$mysub{$regex}/$new/;
This demonstrates an answer the original question using Interpolation and quotemeta
use strict;
use warnings 'all';
use Interpolation Q => \&CORE::quotemeta;
my $string = '<<<+*.$^>>>';
my $regex = '+*.$^';
my $new = 'NEW';
$string =~ s/$Q{$regex}/XXX/;
print $string, "\n";
output
<<<XXX>>>
But note that the built-in escape \Q is clearer and intended for this purpose, so you should use
$string =~ s/\Q$regex\E/XXX/
in this specific case

Perl - how to get values of tokens

I am searching how to get tokens values in properties file with Perl.
Given the source property:
my $source="application.1.hostname={{DNS_APP}}:{{PORT_APP}}/WHATEVER";
And given the target property:
my $target="application.1.hostname=test.test.com:8080/WHATEVER";
I would like to get the following result:
{{DNS_APP}}=test.test.com
{{PORT_APP}}=8080
I have no trouble to get the tokens with :
my #matches= ( $source =~ /({{.*?}})/g );
But then, how to match with their values ?
Is there an easy way, with perl regexps to get these substitutions ?
Another difficulty (but they are execption, so it is not a big deal if this problem is not addressed) is that, sometimes, $target can be
my $target="application.1.hostname=test.test.com/WHATEVER";
Or
my $target="application.1.hostname=test.test.com:8080/SOMETHINGELSE";
Or even
my $target="application.1.hostname=test.test.com/SOMETHINGELSE";
How to deal with that ?
I thank you in advance for you answers.
Regards.
OK, at a basic level, you can turn your thing into a named capture for a regex. There's a caveat though - you might need to restrict character sets.
But something like this might work:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $source = "application.1.hostname={{DNS_APP}}:{{PORT_APP}}/WHATEVER";
my $target = "application.1.hostname=test.test.com:8080/WHATEVER";
$source =~ s|\Q{{\E(\w+)\Q}}\E|(?<$1>.*)|g;
$source = qr/$source/;
print "Using Regex:", $source,"\n";
$target =~ m/$source/;
#%+ is the special named-capture hash. You can access $+{DNS_APP} for example
print Dumper \%+;
Note though - that .* is a greedy match, and that will mean without delimitors/anchors between patterns, this will break. You could perhaps define a more narrow character class - I would think \w normally, but you also have . so perhaps [\w.]+ - or maybe even .*? for non greedy matching instead. This depends rather on what would 'fit' with the types of patterns you're trying to match. If you need to do so with arbitrary patterns, I think you're going to need to need ... something like regex to define the match criteria in the first place.
If your 'targets' are purely that pattern - e.g. trailing static words - you can trim you initial pattern with s/\w+$// which will reduce it to:
application.1.hostname={{DNS_APP}}:{{PORT_APP}}/
Which you then regex transform to:
(?^:application.1.hostname=(?<DNS_APP>.*):(?<PORT_APP>.*)/)
And then get %+ of:
$VAR1 = {
'DNS_APP' => 'test.test.com',
'PORT_APP' => '8080'
};
As you're on 5.8.8 - my first advice is upgrade it, because it's 7 year old software, and is long since end of life.
This variable was added in Perl v5.10.0.
However you should be able to work around by:
my #match_names = $source =~ m|\Q{{\E(\w+)\Q}}\E|g; #capture 'names' of matches
$source =~ s|\Q{{\E(\w+)\Q}}\E|(.*)|g;
$source = qr/$source/;
print "Using Regex:", $source, "\n";
my %results;
my #matches = $target =~ m/$source/;
#results{#match_names} = #matches;
print Dumper \%results;
I'm pretty sure there's a way of capturing what matched from the s pattern replacement. If I figure out what it was, I'll update.
(As it stands:
my ( #match_names ) = $source =~ s|\Q{{\E(\w+)\Q}}\E|\(.*\)|g;
doesn't seem to work as I want - #match_names contains the number of replacements. )

Perl regex to find keywords and not variables

I'm trying to create a regex as following :
print $time . "\n"; --> match only print because time is a variable ($ before)
$epoc = time(); --> match only time
My regex for the moment is /(?-xism:\b(print|time)\b)/g but it match time in $time in the first example.
Check here.
I tried things like [^\$] but then it doesn't match print anymore.
(I will have more keyword like print|time|...|...)
Thanks
Parsing perl code is a common and useful teaching tool since the student must understand both the parsing techniques and the code that they're trying to parse.
However, to do this properly, the best advice is to use PPI
The following script parses itself and outputs all of the barewords. If you wanted to, you could compare the list of barewords to the ones that you're trying to match. Note, this will avoid things within strings, comments, etc.
use strict;
use warnings;
use PPI;
#my $src = do {local $/; <DATA>}; # Could analyze the smaller code in __DATA__ instead
my $src = do {
local #ARGV = $0;
local $/;
<>;
};
# Load a document
my $doc = PPI::Document->new( \$src );
# Find all the barewords within the doc
my $barewords = $doc->find( 'PPI::Token::Word' );
for (#$barewords) {
print $_->content, "\n";
}
__DATA__
use strict;
use warnings;
my $time = time;
print $time . "\n";
Outputs:
use
strict
use
warnings
use
PPI
my
do
local
local
my
PPI::Document
new
my
find
for
print
content
__DATA__
What you need is a negative lookbehind (?<!\$), it's zero-width so it doesn't "consume" characters.
(?<!\$)a means match a if not preceded with a literal $. Note that we escaped $ since it means end of string (or line depending on the m modifier).
Your regex will look like (?-xism:\b(?<!\$)(print|time)\b).
I'm wondering why you are turning off the xism modifiers. They are off by default.So just use /\b(?<!\$)(?:print|time)\b/g as pattern.
Online demo
SO regex reference