$& not resolved as part of perl substitution - regex

I have a perl script which searches and replaces data in multiple files. Since more than one word can be replaced in a file, I wrote a function that accepts the search and replace patterns as arrays. I then loop over the arrays in this function and perform the substitution. It works well but just for one particular file, I need to append something in front of the matched string( character #). Hence, I pass "#\$&" as my replace pattern. Its received properly but somehow the $& is never resolved. Instead the operation replaces the matched string with literal value of '#$&'. The same thing works if I directly use #$& in my substituion command in the readFile function. I know we may be able to achieve the result in other ways, but I really want to know why the same replacement pattern works when passed directly while it doesn't work when read as an array element.
I have commented the substitution command that works well for reference. Can anyone please help me spot the problem here ?
my #search= ("host\\s*(replication|all)");
my #replace= ("#\$&");
my $sLine = scalar #search;
my $rLine = scalar #replace;
my $data = ???;
for ( my $i=0; $i < $sLine; $i++)
{
print("\n search = $search[$i] replace = $replace[$i] \n");
#$data =~ s/$search[$i]/#$&/g; ==> this works
$data =~ s/$search[$i]/$replace[$i]/g; #==> this doesn't
}
print($data);

The difference between the working solution and the non-working solution is the same as the difference between
print "#$&"; # Prints `#` and the value of `$&`.
and
print "$replace[$i]"; # Prints the value of `$replace[$i]`.
You can use the following:
use String::Substitution qw( gsub_modify );
for my $i (0..$#search) {
gsub_modify($data, $search[$i], $replace[$i]);
}
This is a more in-depth explanation.
s/$search[$i]/#$&/g
is short for
s/$search[$i]/ "#$&" /eg
which is equivalent to
s/$search[$i]/ "#" . $& /eg # Replaces with `#` and the value of `$&`.
/e causes the replacement expression to be evaluated as Perl code, using its result as the replacement string.
On the other hand,
s/$search[$i]/$replace[$i]/g
is short for
s/$search[$i]/ "$replace[$i]" /eg
which is equivalent to
s/$search[$i]/ $replace[$i] /eg # Replaces with the value of `$replace[$i]`.

Related

Replacing with Named Captures and Precompiled Regular Expressions in Perl

I'm trying to compile a set of substitution regexes but I can't figure out how to delay interpolation of the capture variables in the replacement scalar I'm setting aside; here's a simple contrived example:
use strict;
use warnings;
my $from = "quick";
my $to = "zippy";
my $find = qr/${from} (?<a>(fox|dog))/;
my $repl = "$to $+{a}"; # Use of uninitialized value in concatenation (.) or string
my $s0 = "The quick fox...\n";
$s0 =~ s/${find}/${repl}/;
print($s0);
This doesn't work because repl is interpolated immediately and elicits "Use of uninitialized value in concatenation (.) or string"
If I use non-interpolating '' quotes it doesn't interpolate in the actual substitution so I get "The zippy $+{a}..."
Is there a trick to setting aside a replacement scalar that contains capture references?
You are getting the warning because you are using $+{a} before performing the match. qr// doesn't perform any matching; it's simply compiles the pattern. It's s/// that performs the match.
You presumably meant to use
my $repl = "$to \$+{a}";
But that simply outputs
The zippy \$+{a}...
You could use the following:
my $find = qr/quick (?<a>fox|dog)/;
my $s0 = "The quick fox...\n";
$s0 =~ s/$find/zippy $+{a}/;
print($s0);
But that hard codes the replacement expression. If you want this code to be dynamic, then what you are building is a template system.
I don't know of any template system with your specific desired syntax.
If you're ok with using the positional variables ($1) instead of named ones ($+{a}), you can use String::Substitution.
use String::Substitution qw( sub_modify );
my $find = qr/quick (?<a>fox|dog)/; # Or simply qr/\Q$from\E (fox|dog)/
my $repl = "zippy \$1";
my $s0 = "The quick fox...\n";
sub_modify($s0, $find, $repl);
print($s0);
The qr// only compiles a pattern. It does not perform a match, so it does not set anything in %+. Hence, the uninitialized warnings.
However, you can do that in the substitution so you don't need to prepare the replacement ahead of time:
s/$find/$to $+{a}/;
However, if you don't know what you want your replacement to be, you can eval code in the replacement side of the substitution that will then be the replacement. Here's a simple addition:
s/$find/ 2 + 2 /e;
You'd get the sum as the replacement:
The 4 jumped over the lazy dog
But here's the rub: That's code and it can do whatever code can do. How you construct that is very important and should never use unsanitized user input.
If you didn't know the string you wanted to put in there, you can construct it beforehand and store it in the variable you use in the replacement side. However, you are making Perl code to eval, so it needs to be a valid Perl string. The double quotes are part of the eval that you will eval later:
my $replacement = '"$to $+{a}"';
s/$find/$replacement/;
Like that, you get the literal string value from $replacement:
The "$to $+{a}" jumped over the lazy dog
Adding the /e means that we evaluate the replacement side as code:
s/$find/$replacement/e;
But, that code is $replacement, and ends up giving us the same result because it's just its string value:
The "$to $+{a}" jumped over the lazy dog
Now here's the fun part. We can eval again! Add another /e and the substitution will eval the first time, then take that result and eval it again:
$s0 =~ s/${find}/$replacement/ee;
The first round of the eval gets the literal text value of $replacement, which is "$to $+{a}" (including the double quotes). The second round takes "$to $+{a}" and evals that, filling in the variables with the values in the current lexical scope. The %+ is populated by the substitution already. Now you have your result:
The zippy fox jumped over the lazy dog
However, this isn't a trick you should pull out lightly. There might be a better way to attack your problem. You do this sort of thing when you bend anything else to your will.
You also have to be very careful that you do what you intend in the string that you construct. You are creating new Perl code. If you are using any sort of outside data that you didn't supply, someone can trick your program into running code that you didn't intend.
There are three good ways to do dynamic regex substitution at runtime:
String interpolation of variables s///
Callback for code execution s///e
Embedded code constructs in the regex.
See the examples below.
Normally a callback form, either via a function or Embedded regex code is used when logic is required to construct a replacement.
Otherwise, use a simple string interpolation on the replacement side.
use strict;
use warnings;
my $s0 = "";
my ($from, $to) = ("quick", "zippy") ;
sub getRepl {
my ($grp1, $grp2) = #_;
if ( $grp1 eq $from ) {
return "<$to $grp2>" }
else {
return "< $2>"
}
}
my $find = qr/(\Q${from}\E) (fox|dog)/;
# ======================================
# Substitution via string interpolation
$s0 = "The quick dog...\n";
$s0 =~ s/$find/[$to $2]/;
print $s0;
# ======================================
# Substitution via callback (eval)
$s0 = "The quick dog...\n";
$s0 =~ s/$find/ getRepl($1,$2) /e;
print $s0;
# ==================================================
# Substitution via regex embedded code constructs
my $repl = "";
my $RxCodeEmbed = qr/(\Q${from}\E)(?{$repl = '(' . $to}) (fox|dog)(?{$repl .= ' ' . $^N . ')'})/;
$s0 = "The quick dog...\n";
$s0 =~ s/$RxCodeEmbed/$repl/;
print $s0;
Outputs
The [zippy dog]...
The <zippy dog>...
The (zippy dog)...

Remove certain characters from a regex group

I have a string that looks like this (key":["value","value","value"])
"emailDomains":["google.co.uk","google.com","google.com","google.com","google.co.uk"]
and I use the following regex to select from the string. (the regex is setup in a way where it wont select a string that looks like this "key":[{"key":"value","key":"value"}] )
(?<=:\[").*?(?="])
Resulting Selection:
google.co.uk","google.com","google.com","google.com","google.co.uk
I want to remove the " in that select string, and i was wondering if there was an easy way to do this using the replace command. Desired result...
"emailDomains":["google.co.uk, google.com, google.com, google.com, google.co.uk"]
How do I solve this problem?
If your string indeed has the form "key":["v1", "v2", ... "vN"], you can split off the part that needs to be changed, replace "," by a space in it, and re-assemble:
my #parts = split / (\["\s* | \s*\"]) /x, $string; #"
$parts[2] =~ s/",\s*"/ /g;
my $processed = join '', #parts;
The regex pattern for the separator in split is captured since in that case the separators are also in the returned list, what is helpful here for putting the string back together. Then, we need to change the third element of the array.
In this approach, we have to change a specific element in the array so if your format varies, even a little, this may not (or still may) be suitable.
This should of course be processed as JSON, using a module. If the format isn't sure, as indicated in a comment, it would be best to try to ensure that you have JSON. Picking bits and pieces like above (or below) is a road to madness once requirements slowly start evolving.
The same approach can be used in a regex, and this may in fact have an advantage to be able to scoop up and ignore everything preceding the : (with split that part may end up with multiple elements if the format isn't exactly as shown, what then affects everything)
$string =~ s{ :\["\s*\K (.*?) ( "\] ) }{
my $e = $2;
my $n = $1 =~ s/",\s*"/ /gr;
$n.$e
}ex;
Here /e modifier makes it so that the replacement side is evaluated as code, where we do the same as with the split above. Notes on regex
Have to save away $2 first, since it gets reset in the next regex
The /r modifier†, which doesn't change its target but rather returns the changed string, is what allows us to use substitution operator on the read-only $1
If nothing gets captured for $2, and perhaps for $1, that means that there was no match and the outcome is simply that $string doesn't change, quietly. So if this substitution should always work then you may want to add handling of such unexpected data
Don't need a $n above, but can return ($1 =~ s/",\s*"/ /gr) . $e
Or, using lookarounds as attempted
$string =~ s{ (?<=:\[") (.+?) (?="\]) }{ $1 =~ s/",\s*"/ /gr }egx;
what does reduce the amount of code, but may be trickier to work with later.
While this is a direct answer to the question I think it's least maintainable.
†  This useful modifier, for "non-destructive substitution," appeared in v5.14. In earlier Perl versions we would copy the string and run regex on that, with an idiom
(my $n = $1) =~ s/",\s*"/ /g;
In the lookarounds-example we then need a little more
$string =~ s{...}{ (my $n = $1) =~ s/",\s*"/ /g; $n }gr
since s/ operator returns the number of substitutions made while we need $n to be returned from that whole piece of code in {} (the replacement side), to be used as the replacement.
You can use this \G based regex to start the match with :[" and further captures the values appropriately and replaces matched text so that only comma is retained and doublequotes are removed.
(:\[")|(?!^)\G([^"]+)"(,)"
Regex Demo
Your text is almost proper JSON, so it's really easy to go the final inch and make it so, and then process that:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say postderef/;
no warnings qw/experimental::postderef/;
use JSON::XS; # Install through your OS package manager or a CPAN client
my $str = q/"emailDomains":["google.co.uk","google.com","google.com","google.com","google.co.uk"]/;
my $json = JSON::XS->new();
my $obj = $json->decode("{$str}");
my $fixed = $json->ascii->encode({emailDomains =>
join(', ', $obj->{'emailDomains'}->#*)});
$fixed =~ s/^\{|\}$//g;
say $fixed;
Try Regex: " *, *"
Replace with: ,
Demo

Best way to deal with "Unescaped braces in regex" inside Perl regex

I recently started learning Perl to automate some mindless data tasks. I work on windows machines, but prefer to use Cygwin. Wrote a Perl script that did everything I wanted fine in Cygwin, but when I tried to run it with Strawberry Perl on Windows via CMD I got the "Unescaped left brace in regex is illegal here in regex," error.
After some reading, I am guessing my Cygwin has an earlier version of Perl and modern versions of Perl which Strawberry is using don't allow for this. I am familiar with escaping characters in regex, but I am getting this error when using a capture group from a previous regex match to do a substitution.
open(my $fh, '<:encoding(UTF-8)', $file)
or die "Could not open file '$file' $!";
my $fileContents = do { local $/; <$fh> };
my $i = 0;
while ($fileContents =~ /(.*Part[^\}]*\})/) {
$defParts[$i] = $1;
$i = $i + 1;
$fileContents =~ s/$1//;
}
Basically I am searching through a file for matches that look like:
Part
{
Somedata
}
Then storing those matches in an array. Then purging the match from the $fileContents so I avoid repeats.
I am certain there are better and more efficient ways of doing any number of these things, but I am surprised that when using a capture group it's complaining about unescaped characters.
I can imagine storing the capture group, manually escaping the braces, then using that for the substitution, but is there a quicker or more efficient way to avoid this error without rewriting the whole block? (I'd like to avoid special packages if possible so that this script is easily portable.)
All of the answers I found related to this error were with specific cases where it was more straightforward or practical to edit the source with the curly braces.
Thank you!
I would just bypass the whole problem and at the same time simplify the code:
my $i = 0;
while ($fileContents =~ s/(.*Part[^\}]*\})//) {
$defParts[$i] = $1;
$i = $i + 1;
}
Here we simply do the substitution first. If it succeeds, it will still set $1 and return true (just like plain /.../), so there's no need to mess around with s/$1// later.
Using $1 (or any variable) as the pattern would mean you have to escape all regex metacharacters (e.g. *, +, {, (, |, etc.) if you want it to match literally. You can do that pretty easily with quotemeta or inline (s/\Q$1//), but it's still an extra step and thus error prone.
Alternatively, you could keep your original code and not use s///. I mean, you already found the match. Why use s/// to search for it again?
while ($fileContents =~ /(.*Part[^\}]*\})/) {
...
substr($fileContents, $-[0], $+[0] - $-[0], "");
}
We already know where the match is in the string. $-[0] is the position of the start and $+[0] the position of the end of the last regex match (thus $+[0] - $-[0] is the length of the matched string). We can then use substr to replace that chunk by "".
But let's keep going with s///:
my $i = 0;
while ($fileContents =~ s/(.*Part[^\}]*\})//) {
$defParts[$i] = $1;
$i++;
}
$i = $i + 1; can be reduced to $i++; ("increment $i").
my #defParts;
while ($fileContents =~ s/(.*Part[^\}]*\})//) {
push #defParts, $1;
}
The only reason we need $i is to add elements to the #defParts array. We can do that by using push, so there's no need for maintaining an extra variable. This saves us another line.
Now we probably don't need to destroy $fileContents. If the substitution exists only for the benefit of this loop (so I doesn't re-match already extracted content), we can do better:
my #defParts;
while ($fileContents =~ /(.*Part[^\}]*\})/g) {
push #defParts, $1;
}
Using /g in scalar context attaches a "current position" to $fileContents, so the next match attempt starts where the previous match left off. This is probably more efficient because it doesn't have to keep rewriting $fileContents.
my #defParts = $fileContents =~ /(.*Part[^\}]*\})/g;
... Or we could just use //g in list context, where it returns a list of all captured groups of all matches, and assign that to #defParts.
my #defParts = $fileContents =~ /.*Part[^\}]*\}/g;
If there are no capture groups in the regex, //g in list context returns the list of all matched strings (as if there had been ( ) around the whole regex).
Feel free to choose any of these. :-)
As for the question of escaping, that's what quotemeta is for,
my $needs_escaping = q(some { data } here);
say quotemeta $needs_escaping;
what prints (on v5.16)
some\ \{\ data\ \}\ here
and works on $1 as well. See linked docs for details. Also see \Q in perlre (search for \Q), which is how this is used inside a regex, say s/\Q$1//;. The \E stops escaping (what you don't need).
Some comments.
Relying on deletion so that the regex keeps finding further such patterns may be a risky design. If it isn't and you do use it there is no need for indices, since we have push
my #defParts;
while ($fileContents =~ /($pattern)/) {
push #defParts, $1;
$fileContents =~ s/\Q$1//;
}
where \Q is added in the regex. Better yet, as explained in melpomene's answer the substitution can be done in the while condition itself
push #defParts, $1 while $fileContents =~ s/($pattern)//;
where I used the statement modifier form (postfix syntax) for conciseness.
With the /g modifier in scalar context, as in while (/($pattern)/g) { .. }, the search continues from the position of the previous match in each iteration, and this is a usual way to iterate over all instances of a pattern in a string. Please read up on use of /g in scalar context as there are details in its behavior that one should be aware of.
However, this is tricky here (even as it works) as the string changes underneath the regex. If efficiency is not a concern, you can capture all matches with /g in list context and then remove them
my #all_matches = $fileContents =~ /$patt/g;
$fileContents =~ s/$patt//g;
While inefficient, as it makes two passes, this is much simpler and clearer.
I expect that Somedata cannot possibly, ever, contain }, for instance as nested { ... }, correct? If it does you have a problem of balanced delimiters, which is far more rounded. One approach is to use the core Text::Balanced module. Search for SO posts with examples.

Perl regular expressions troubles

I have a variable $rowref->[5] which contains the string:
" 1.72.1.13.3.5 (ISU)"
I am using XML::Twig to build modify an XML file and this variable contains the information for the version number of something. So I want to get rid of the whitespaces and the (ISU). I tried to use a substitution and XML::Twig to set the attribute:
$artifact->set_att(version=> $rowref->[5] =~ s/([^0-9\.])//g)
Interestingly what I got in my output was
<artifact [...] version="9"/>
I don't understand what I am doing wrong. I checked with a regular expression tester and it seems fine. Can somebody spot my error?
The return value of s/// is the number of substitutions it made, which in your case is 9. If you are using at least perl 5.14, add the r flag to the substitution:
If the "/r" (non-destructive) option is used then it runs the
substitution on a copy of the string and instead of returning the
number of substitutions, it returns the copy whether or not a
substitution occurred. The original string is never changed when
"/r" is used. The copy will always be a plain string, even if the
input is an object or a tied variable.
Otherwise, go through a temporary variable like this:
my $version = $rowref->[5];
$version =~ s/([^0-9\.])//g;
$artifact->set_att(version => $version);
The regex substitution changes the varialbe in place but returns the number of substitutions it made (1 without the /g modifier, if it was succesful).
my $str = 'words 123';
my $ret = $str =~ s/\d/numbers/g;
say "Got $ret. String is now: $str";
You can do the substitution first, $rowref->[5] =~ s/...//;, and then use the changed variable.

Regex: Lookaround && some syntax

I am studying about regular expression and struck with the
lookaround concept
and
with few syntax.
After doing googling, I thought it is a right forum to ask for help.
Please help with this concept.
As I am not good with understanding the explanation.
It will be great if I get plenty of different examples to understand.
For me the modifer /e and || are new in regex please help me in understanding
the real use. Below is my Perl Script.
$INPUT1="WHAT TO SAY";
$INPUT2="SAY HI";
$INPUT3="NOW SAY![BYE]";
$INPUT4="SAYO NARA![BYE]";
$INPUT1=~s/SAY/"XYZ"/e; # /e What is this modifier is for
$INPUT2=~s/HI/"XYZ"/;
$INPUT3=~s/(?<=\[)(\w+)(?=])/ "123"|| $1 /e; #What is '||' is use for and what its name
$INPUT4=~s/BYE/"123"/e;
print "\n\nINPUT1 = $INPUT1 \n \n ";
print "\n\nINPUT2 = $INPUT2 \n \n ";
print "\n\nINPUT3 = $INPUT3 \n \n ";
print "\n\nINPUT4 = $INPUT4 \n \n ";
Have a read of perlrequick and perlretut.
The /e modifier of the s/// substitution operator treats the replacement as Perl code rather than as a string. For example:
$x = "5 10"
$x =~ s/(\d+) (\d+)/$1 + $2/e;
# $x is now 15
Instead of replacing $x with the string "$1 + $2", it evaluates the Perl code $1 + $2 - where $1 is 5 and $2 is 10 - and puts the result into $x.
The || is not a regex operator, it's a normal Perl operator. It is the logical-or operator: if the left-hand side is a true value (not 0 or ''), it returns the left side, otherwise it returns the right side. You can look up perl operators in perlop.
A standard substitution operator looks like this:
s/PATTERN/REPLACEMENT/
Where the PATTERN is matched, it is replaced with REPLACEMENT. REPLACEMENT is treated as a double-quoted string so that you can put variables in there and it will just work.
s/PATTERN/$var1/
You can use this to include pieces of the matched test in your replacement.
s/PA(TT)ERN/$1/
Sometimes, however, this isn't enough. Perhaps you want to process the text and run a subroutine to work out what the replacement is. Here's a really contrived example. Suppose you have text that contains floating point numbers and you want to replace them with integers. A first approach might look like this:
#!/usr/bin/perl
use strict;
use warnings;
$_ = '12.34 5.678';
s/(\d+\.\d+)/int($1)/g;
print "$_\n";
That doesn't work, of course. You end up with "int(12.34) int(5.678)". But that string is a piece of code which you want to run in order to get the correct answer. That's what the /e option does. It treats the replacement string as code, runs it and uses the output as the replacement.
Changing the line in the example above to
s/(\d+\.\d+)/int($1)/ge;
gives us the the required result.
Now that you understand /e I hope that you don't need an explanation of ||. It's just the standard or operator that you use all the time. In your example, it means "the replacement string is either '123' or the contents of $1'. Of course, that doesn't make much sense as '123' is always going to be true, so $1 will never be used. Perhaps you wanted it the other way round - $1 or '123'.