Double interpolation of regular expressions in Perl

Double interpolation of regular expressions in Perl - regex

I have a Perl program that stores regular expressions in configuration files. They are in the form:
regex = ^/d+$
Elsewhere, the regex gets parsed from the file and stored in a variable - $regex.
I then use the variable when checking the regex, e.g.
$lValid = ($valuetocheck =~ /$regex/);
I want to be able to include perl variables in the config file, e.g.
regex = ^\d+$stored_regex$
But I can't work out how to do it.
When regular expressions are parsed by Perl they get interpreted twice.
First the variables are expanded, and then the the regular expression itself is parsed.
What I need is a three stage process:
First interpolate $regex, then interpolate the variables it contains and then parse the resulting regular expression.
Both the first two interpolations need to be "regular expression aware". e.g. they should know that the string contain $ as an anchor etc...
Any ideas?

You can define the regexp in your configuration file like this:
regex = ^\d+(??{$stored_regex})$
But you will need to disable a security check in the block where you're using the regexp by doing this in your Perl program:
use re 'eval';

Using eval can help you here. Take a look at the following code it can precompile a regexp that's ready to be used latter:
my $compiled_regexp;
my $regexp = '^\d+$stored_regexp$';
my $stored_regexp = 'a';
eval "\$compiled_regexp = qr/$regexp/;";
print "$compiled_regexp\n";
The operator qr// can be used to precompile a regexp. It lets you build it but doesn't execute it yet. You can first build your regexps with it and then use them latter.

Your Perl variables are not in scope within your configuration file, and I think that's a good thing. eval is scary.
You would be better off implementing your own templating.
So in the config file:
regex = ^\d+__TEMPLATE_FIELD__$
In the config file reader:
# something like this for every template field you need
$regex =~ s/__TEMPLATE_FIELD__/$stored_regex/g;
When using:
$lValid = ($valuetocheck =~ m/$regex/)
Move these around depending on at what point you want the template substitution to apply.

A tangentially related gotcha: If you do double interpolation inline, and you also have substitution strings in variables, consider:
# the concat with doublequotes in the replacement string
# are to make them PART OF THE STRING, NOT THE STRING DELIMITERS,
# in other words, so the 2nd interpolation sees a double quoted string :
# eval eval $replace -> eval $1 hello world -> syntax error
# eval eval $replace -> eval "$1 hellow world" -> works ok
# see: http://www.perlmonks.org?node_id=687031
if($line =~ s/$search/'"' . $replace . '"'/ee) {
# STUFF...
}

Related

Perl regular expressions troubles

I have a variable $rowref->[5] which contains the string:
" 1.72.1.13.3.5 (ISU)"
I am using XML::Twig to build modify an XML file and this variable contains the information for the version number of something. So I want to get rid of the whitespaces and the (ISU). I tried to use a substitution and XML::Twig to set the attribute:
$artifact->set_att(version=> $rowref->[5] =~ s/([^0-9\.])//g)
Interestingly what I got in my output was
<artifact [...] version="9"/>
I don't understand what I am doing wrong. I checked with a regular expression tester and it seems fine. Can somebody spot my error?

The return value of s/// is the number of substitutions it made, which in your case is 9. If you are using at least perl 5.14, add the r flag to the substitution:
If the "/r" (non-destructive) option is used then it runs the
substitution on a copy of the string and instead of returning the
number of substitutions, it returns the copy whether or not a
substitution occurred. The original string is never changed when
"/r" is used. The copy will always be a plain string, even if the
input is an object or a tied variable.
Otherwise, go through a temporary variable like this:
my $version = $rowref->[5];
$version =~ s/([^0-9\.])//g;
$artifact->set_att(version => $version);

The regex substitution changes the varialbe in place but returns the number of substitutions it made (1 without the /g modifier, if it was succesful).
my $str = 'words 123';
my $ret = $str =~ s/\d/numbers/g;
say "Got $ret. String is now: $str";
You can do the substitution first, $rowref->[5] =~ s/...//;, and then use the changed variable.

Using string variables containing literal escapes in a Perl substitution

I'm new to Perl and I found behaviour which I don't understand and can't solve.
I'm making a small find and replace program and there are some things I need to do. I have bunch of files that I need to process. Then I have a list of find / replace rules in an external text file. In replacing there I need three special things:
Replacing utf-8 characters (Czech diacritics)
Work with adding/removing lines (so working in a slurp mode)
Use a regular expressions
I want a program that works alone, so I wrote it so that it takes three arguments:
The file to work on
What to find
What to replace.
I'm sending parameters in a loop from a bash script which parse the rules list and loads other files.
My problem is when I have a "\n" string in a rules list and I send it to the Perl script. If it's in the first part of replacement (in the find section) it looks for a newline correctly, but when it's in the second part (the replace section) it just prints \n instead of a newline.
I tried hardcoding "\n" to the string right into the variable instead of passing it from the list and then it works fine.
What's the reason Perl doesn't interpret the "\n" string there, and how can I make it work?
This is my code:
list.txt - One line from the external replacement list
1\. ?\\n?NÁZEV PŘÍPRAVKU;\\n<<K1>> NÁZEV PŘÍPRAVKU;
farkapitoly.sh - The bash script for parsing list.txt and cycling through all of the files and calling the Perl script
...
FILE="/home/tmp.txt"
while read LINE
do
FIND=`echo "$LINE" | awk -F $';' 'BEGIN {OFS = FS} {print $1}'`
REPLACE=`echo "$LINE" | awk -F $';' 'BEGIN {OFS = FS} {print $2}'`
perl -CA ./pathtiny.pl "$FILE" "$FIND" "$REPLACE"
done < list.txt
...
pathtiny.pl - The Perl script for find and replace
#!/usr/bin/perl
use strict;
use warnings;
use Modern::Perl;
use utf8; # Enable typing Unicode in Perl strings
use open qw(:std :utf8); # Enable Unicode to STDIN/OUT/ERR and filehandles
use Path::Tiny;
my $file = path("$ARGV[0]");
my $searchStr = "$ARGV[1]";
my $replaceStr = "$ARGV[2]";
# $replaceStr="\n<<K1>> NÁZEV PRÍPRAVKU"; # if I hardcode it here \n is replaced right away
print("Search String:", "$searchStr", "\n");
print("Replace String:", "$replaceStr", "\n\n");
my $guts = $file->slurp_utf8;
$guts =~ s/$searchStr/$replaceStr/gi;
$file->spew_utf8($guts);
If it's important, I'm using Linux Mint 13 64-bit on VirtualBox (under Win 8.1) and I have Perl v5.14.2. Every file is UTF-8 with Linux endings.
Example files can be found on pastebin. this should end up like this.
But examples varies a lot. I need a universal solution to write down newline in a replacement string so it replaces correctly.

The problem is that the replacement string is read literally from the file, so if your file contains
xx\ny
then you will read exactly those six characters. Also, the replacement part of a substitution is evaluated as if it was in double quotes. So your replacement string is "$replaceStr" which interpolates the variable and goes no further, so you will again have xx\nyy in the new string. (By the way, please avoid using capital letters in local Perl identifiers as in practice they are reserved for globals such as Module::Names.)
The answer lies in using eval, or its equivalent - the /e modifier on the substitution.
If I write
my $str = '<b>';
my $r = 'xx\ny';
$str =~ s/b/$r/;
then the replacement string is interpolated to xx\ny, as you have experienced.
A single /e modifier evaluates the replacement as an expression instead of just a double-quoted string, but of course $r as an expression is xx\ny again.
What you need is a second /e modifier, which does the same evaluation as a single /e and then does an additional eval of the result on top. For this it is cleanest if you use qq{ .. } as you need two levels of quotation.
If you write
$str =~ s/b/qq{"$r"}/ee
then perl will evaluate qq{"$r"} as an expression, giving "xx\nyy", which, when evaluated again will give you the string you need - the same as the expression 'xx' . "\n" . 'yy'.
Here's a full program
use strict;
use warnings;
my $s = '<b>';
my $r = 'xx\nyy';
$s =~ s/b/qq{"$r"}/ee;
print $s;
output
<xx
yy>
But don't forget that, if your replacement string contains any double quotes, like this
my $r = 'xx\n"yy"'
then they must be escaped before putting the through the substitution as the expression itself also uses double quotes.
All of this is quite hard to grasp, so you may prefer the String::Escape module which has an unbackslash function that will change a literal \n (and any other escapes) within a string to its equivalent character "\n". It's not a core module so you probably will need to install it.
The advantage is that you no longer need a double evaluation, as the replacement string can be just unbackslash $r which give the right result if it evaluated as an expression. It also handles double quotes in $r without any problem, as the expression doesn't use double quotes itself.
The code using String::Escape goes like this
use strict;
use warnings;
use String::Escape 'unbackslash';
my $s = '<b>';
my $r = 'xx\nyy';
$s =~ s/b/unbackslash $r/e;
print $s;
and the output is identical to that of the previous code.
Update
Here is a refactoring of your original program that uses String::Escape. I have removed Path::Tiny as I believe it is best to use Perl's built-in inplace-edit extension, which is documented under the General Variables section of perlvar.
#!/usr/bin/perl
use utf8;
use strict;
use warnings;
use 5.010;
use open qw/ :std :utf8 /;
use String::Escape qw/ unbackslash /;
our #ARGV;
my ($file, $search, $replace) = #ARGV;
print "Search String: $search\n";
print "Replace String: $replace\n\n";
#ARGV = ($file);
$^I = '';
while (<>) {
s/$search/unbackslash $replace/eg;
print;
}

You got \n as a content of a string. (as two chacters 1: \ and second n, and not as one newline.
Perl interprets the \n as newline when it is as literal (e.g. it is in your code).
The quick-fix would be:
my $replaceStr=eval qq("$ARGV[2]"); #evaling a string causes interpreting the \n as literal
or, if you don't like eval, you can use the String-Escape cpan module. (the unbackslash function)

You're wanting a literal string to be treated as if it were a double quoted string. To do that you'll have to translate any backslash followed by another character.
The other experts have shown you how to do that over the entire string (which is risky since it uses eval with unvalidated data). Alternatively, you could use a module, String::Escape, which requires an install (not a high bar, but too high for some).
However, the following does a translation of the return value string itself in a safe way, and then it can be used like a normal value in your other search and replace:
use strict;
use warnings;
my $r = 'xx\nyy';
$r =~ s/(\\.)/qq{"$1"}/eeg; # Translate \. as a double quoted string would
print $r;
Outputs:
xx
yy

Is possible use a foreach loop of Perl to process different files with R?

im tried to process different files like an input to R script, for this I use a foreach loop in Perl, but R send me a warning:
Problem while running this R command:
a <- read.table(file="~/Desktop/ncRNA/Data/Inputs/Boxplot_all/All/$newquery")
Error:
file(file, "rt") : cannot open the connection
Calls: read.table -> file
In addition: Warning message:
In file(file, "rt") :
cannot open file '/Users/cristianarleyvelandiahuerto/Desktop/ncRNA/Data/Inputs/Boxplot_all/All/$newquery': No such file or directory
Execution halted
My code is:
#!/usr/bin/perl
use strict;
use warnings;
use Statistics::R;
use Data::Dumper;
my $R = Statistics::R->new();
my #query = (
'~/Desktop/ncRNA/Data/Inputs/Boxplot_all/All/dvex_all_rRNA_ce.RF00001.txt',
'~/Desktop/ncRNA/Data/Inputs/Boxplot_all/All/dvex_all_rRNA_ce_60.RF00001.txt',
'~/Desktop/ncRNA/Data/Inputs/Boxplot_all/All/dvex_all_rRNA_ce_70.RF00001.txt',
'~/Desktop/ncRNA/Data/Inputs/Boxplot_all/All/dvex_all_rRNA_ce_80.RF00001.txt',
'~/Desktop/ncRNA/Data/Inputs/Boxplot_all/All/dvex_all_rRNA_ce_90.RF00001.txt'
);
foreach my $query(#query) {
my $newquery = $query;
$newquery =~ s/(.*)\/(dvex_all.*)(\.txt)/$2$3/g;
print "$newquery\n";
$R->run(q`a <- read.table(file="~/Desktop/ncRNA/Data/Inputs/Boxplot_all/All/$newquery")`);
$R->run(q`res <- summary(a$V2)`);
my $output_value = $R->get('res');
print "Statistical Summary = $output_value\n";
}
With regex I changed the name of the input, but R don't recognizes this like as file. Can I do that? Some suggestions? Thanks!

You have:
$R->run(q`...`);
i.e., you're using the q operator. String interpolation is not done with q. The immediate solution is to use
$R->run(qq`...`);

You used Perl's quote operator q(), which has the same semantics as a single quoted string — that is, no variable interpolation. If you don't want to use a double quoted string (or the equivalent qq() operator), then you have to concatenate the variable into your query:
use feature 'say';
my $workdir = "~/Desktop/ncRNA/Data/Inputs/Boxplot_all/All/";
for my $query (#query) {
(my $newquery = $query) =~ s{\A.*/(?=dvex_all.*[.]txt\z)}{};
say $newquery;
# actually, here you should escape any <"> double quotes in $workdir and $newquery.
$R->run(q#a <- read.table(file="# . $workdir . $newquery . q#")#);
$R->run(q#res <- summary(a$V2)#);
my $output_value = $R->get('res');
say "Statistical Summary = $output_value";
}
Other improvements I made:
proper intendation is the first step to correct code
The say function is more comfortable than print.
The substitution has a better delimiter, and now cleary shows what it does: Deleting the path from the filename. Actually, we should be using one of the cross-platform modules that do this.
I used the substitute in copy idiom (my $copy = $orig) =~ s///. In perl5 v16, you can use the /r flag instead: my $copy = $orig =~ s///r.
The /g flag for the regex is useless.
I anchored the match at the start and end of the string.
The q`` strings now have a more visible delimiter

I don't know r. but it looks like you should concatenate your string.
$R->run(qa <- read.table(file="~/Desktop/ncRNA/Data/Inputs/Boxplot_all/All/".$newquery));

R will not concatenate a string using a "$" operator. That "$query" is inside an R function argument so you cannot rely on Perl-operators. (Caveat: I'm not a Perl-user, so I'm making an assumption that your code is creating an R object within the loop named 'query') If I'm right then you may need to use paste0():
The file argument might look like:
file=paste0("~/Desktop/ncRNA/Data/Inputs/Boxplot_all/All/", query)

perl replacing serialized strings from sql dump

I'm having to replace fqdn's inside a SQL dump for website migration purposes. I've written a perl filter that's supposed to take STDIN, replace the serialized strings containing the domain name that's supposed to be replaced, replace it with whatever argument is passed into the script, and output to STDOUT.
This is what I have so far:
my $search = $ARGV[0];
my $replace = $ARGV[1];
my $offset_s = length($search);
my $offset_r = length($replace);
my $regex = eval { "s\:([0-9]+)\:\\\"(https?\://.*)($search.*)\\\"" };
while (<STDIN>) {
my #fs = split(';', $_);
foreach (#fs) {
chomp;
if (m#$regex#g) {
my ( $len, $extra, $str ) = ( $1, $2, $3 );
my $new_len = $len - $offset_s + $offset_r;
$str =~ eval { s/$search/$replace/ };
print 's:' . $new_len . ':' . $extra . $str . '\"'."\n";
}
}
}
The filter gets passed data that may look like this (this is taken from a wordpress dump, but we're also supposed to accommodate drupal dumps:
INSERT INTO `wp_2_options` VALUES (1,'siteurl','http://to.be.replaced.com/wordpress/','yes'),(125,'dashboard_widget_options','
a:2:{
s:25:\"dashboard_recent_comments\";a:1:{
s:5:\"items\";i:5;
}
s:24:\"dashboard_incoming_links\";a:2:{
s:4:\"home\";s:31:\"http://to.be.replaced.com/wordpress\";
s:4:\"link\";s:107:\"http://blogsearch.google.com/blogsearch?scoring=d&partner=wordpress&q=link:http://to.be.replaced.com/wordpress/\";
}
}
','yes'),(148,'theme_175','
a:1:{
s:13:\"courses_image\";s:37:\"http://to.be.replaced.com/files/image.png\";
}
','yes')
The regex works if I don't have any periods in my $search. I've tried escaping the periods, i.e. domain\.to\.be\.replaced, but that didn't work. I'm probably doing this either in a very roundabout way or missing something obvious. Any help would be greatly appreciated.

There is no need to evaluate (eval) your regular expression because of including variables in them. Also, to avoid the special meaning of metacharacters of those variables like $search, escape them using quotemeta() function or including the variable between \Q and \E inside the regexp. So instead of:
my $regex = eval { "s\:([0-9]+)\:\\\"(https?\://.*)($search.*)\\\"" };
Use:
my $regex = qr{s\:([0-9]+)\:\\\"(https?\://.*)(\Q$search\E.*)\\\"};
or
my $quoted_search = quotemeta $search;
my $regex = qr{s\:([0-9]+)\:\\\"(https?\://.*)($quoted_search.*)\\\"};
And the same advice for this line:
$str =~ eval { s/$search/$replace/ };

you have to double the escape char \ in your $search variable for the interpolated string to contain the escaped periods.
i.e. domain\.to\.be\.replaced -> domain.to.be.replaced (not wanted)
while domain\\.to\\.be\\.replaced -> domain\.to\.be\.replaced (correct).

I'm not sure your perl regex would replace the DNS in string matching several times the old DNS (in the same serialized string).
I made a gist with a script using bash, sed and one big perl regex for this same problem. You may give it a try.
The regex I use is something like that (exploded for lisibility, and having -7 as the known difference between domain names lengths):
perl -n -p -i -e '1 while s#
([;|{]s:)
([0-9]+)
:\\"
(((?!\\";).)*?)
(domain\.to\.be\.replaced)
(.*?)
\\";#"$1".($2-7).":\\\"$3new.domain.tld$6\\\";"#ge;' file
Which is maybe not the best one but at least it seems to de the job. The g option manages lines containing several serialized strings to cleanup and the while loop redo the whole job until no replacement occurs in serilized strings (for strings containing several occurences of the DNS). I'm not fan enough of regex to try a recursive one.

Regex as a command line arg for filtering lines with a particular value

I want to be able to take an argument from the command line and use it as a regular expression within my script to filter lines from my file. A simple example
$ perl script.pl id_4
In script.pl:
...
my $exp = shift;
while(my $line = <$fh){
if($line =~ /$exp/){
print $line,"\n";
}
}
...
My actual script is a bit more complicated and does other manipulations to the line to extract information and produce a different output. My problem is that I have situations where I want to filter out every line that contains "id_4" instead of only select lines containing "id_4". Normally this could be achieved by
if($line !~ /$exp/)
but, if possible, I don't want to alter my script to accept a more complex set of arguments (e.g. use !~ if second parameter is "ne", and =~ if not).
Can anyone think of a regex that I can use (beside a long "id_1|id_2|id_3|id_5...") to filter out lines containing one particular value out of many possibilities? I fear I'm asking for the daft here, and should probably just stick to the sensible and accept a further argument :/.

Why choose? Have both.
my $exp = join "|", grep !/^!/, #ARGV;
my #not = grep /^!/, #ARGV;
s/^!// for #not;
my $exp_not = join "|", #not;
...
if (( $line =~ $exp ) && ( $line !~ $exp_not )) {
# do stuff
}
Usage:
perl script.pl orange soda !light !diet

There is a way to invert regular expressions, so you can do matches like "all strings which do not contain a match for subexpr". Without the operators which express this directly (i.e. using only the basic positive-matching regex operators), it is still possible but leads to large and unwieldy regular expressions (possibly, combinatorial explosion in the regex size).
For a simple example, look at my answer to this question: how to write a regex which matches everything but the string "help". (It's a quite a simplification that the match is anchored to start and end.) Match all letter/number combos but specific word?
Traditional Unix tools have hacks for situations when you want to just invert the match of the expression as a whole: grep versus grep -v. Or vi: :g/pat/ versus :v/pat/, etc. In this way, the implementors ducked out implementing the difficult regex operators that don't fit into the simple NFA construction approach.
The easiest thing is to do the same thing and have a convention for coarse-grained negation: an include pattern and an exclude pattern.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Double interpolation of regular expressions in Perl - regex

You can define the regexp in your configuration file like this: regex = ^\d+(??{$stored_regex})$ But you will need to disable a security check in the block where you're using the regexp by doing this in your Perl program: use re 'eval';

Related

Perl regular expressions troubles

Using string variables containing literal escapes in a Perl substitution

Is possible use a foreach loop of Perl to process different files with R?

perl replacing serialized strings from sql dump

Regex as a command line arg for filtering lines with a particular value

Categories

Resources