Perl regex as user search input (sanitisation) - regex

I need to make sure that regex, that is passed as user input will not accidentally be terminated and turn into arbitrary Perl code, but at the same time work for basic filtering purposes.
Important! This part of the code is run in user-jailed mode, meaning that potentially, it can only be self-exploited. Apart from this, UI is only exposed to particular user, and potentially run against limited number of files, thus potential DoS risks are very minimal.
In order to reach my goal, I created custom function that would first quotemeta all, and later un-escape needed only for regex to run characters.
Example:
# Allow short range of special chars to be left unescaped
# to let regex work, while at the same time prevent possible
# command injection or premature regex termination
my $mask = $in{'mask'};
sub quotemeta_dangerous
{
my ($string) = #_;
$string = quotemeta($string);
$string =~ s/\\\\/\\/g;
$string =~ s/\\\+/+/g;
$string =~ s/\\\*/*/g;
$string =~ s/\\\$/\$/g;
$string =~ s/\\\^/\^/g;
$string =~ s/\\\(/\(/g;
$string =~ s/\\\)/\)/g;
$string =~ s/\\\{/\{/g;
$string =~ s/\\\}/\}/g;
$string =~ s/\\\[/\[/g;
$string =~ s/\\\]/\]/g;
$string =~ s/\\\?/?/g;
$string =~ s/\\\././g;
$string =~ s/\\\-/-/g;
return $string;
}
my $sanitized_mask = quotemeta_dangerous($mask);
if ($filename =~ /$sanitized_mask/) {
# matched
}
Questions:
Whether my solution above will help me to achieve my goals safely, considering mentioned, important side notes. What are the potential risks that I don't see here?
As side, but familiar question, when further running substitutions, does the replace part can be injected/exploited as well, and if it is, how to safely run substitutions in contents on matched files?
Example:
$file_contents =~ s/\Q$text_to_find\E/$text_to_replace_with/g;
Is $text_to_replace_with can be avoided here as security risk, when passed from user as it is?

I'm not sure what you mean by terminated. As for running arbitrary Perl code, you can't do that from user input (unless the program enables it explicitly with e.g. eval() or use re 'eval'). If you could just inject Perl code from user input, your function wouldn't protect against it: It lets through e.g. (?{system+qq(rm -rf ~)}) in runnable form (runnable, that is, if it were part of the code, not input data).
What you can do with a user input regex is create a DoS: Make it consume a lot of CPU and hang the program. Your function does not protect against that. For example, try:
'aaaaaaaaaa' =~ /(((\1?[a-z]*)*)*)*[b-z]/
Or with an even longer chain of a's. (There are probably ways to shorten this code; I was just throwing random bits together to see whether they finished matching quickly.)
If you want to guard against that, have a look at RE2:
RE2 was designed and implemented with an explicit goal of being able to handle regular expressions from untrusted users without risk.
You can use it in your code by doing
{
use re::engine::RE2 -strict => 1;
# now regexes compiled in this scope will use the RE2 engine
...
}
That's easy to answer. There's no danger here; $text_to_replace_with is simply treated as a string.
(If you want to create danger, you need either
/e and eval(), or
/ee, which is the same thing.
Technically you don't need /e, but that still leaves a very visible eval() in your code. Again, you can't attack this as a user; you have to code it in.)

Related

Remove certain characters from a regex group

I have a string that looks like this (key":["value","value","value"])
"emailDomains":["google.co.uk","google.com","google.com","google.com","google.co.uk"]
and I use the following regex to select from the string. (the regex is setup in a way where it wont select a string that looks like this "key":[{"key":"value","key":"value"}] )
(?<=:\[").*?(?="])
Resulting Selection:
google.co.uk","google.com","google.com","google.com","google.co.uk
I want to remove the " in that select string, and i was wondering if there was an easy way to do this using the replace command. Desired result...
"emailDomains":["google.co.uk, google.com, google.com, google.com, google.co.uk"]
How do I solve this problem?
If your string indeed has the form "key":["v1", "v2", ... "vN"], you can split off the part that needs to be changed, replace "," by a space in it, and re-assemble:
my #parts = split / (\["\s* | \s*\"]) /x, $string; #"
$parts[2] =~ s/",\s*"/ /g;
my $processed = join '', #parts;
The regex pattern for the separator in split is captured since in that case the separators are also in the returned list, what is helpful here for putting the string back together. Then, we need to change the third element of the array.
In this approach, we have to change a specific element in the array so if your format varies, even a little, this may not (or still may) be suitable.
This should of course be processed as JSON, using a module. If the format isn't sure, as indicated in a comment, it would be best to try to ensure that you have JSON. Picking bits and pieces like above (or below) is a road to madness once requirements slowly start evolving.
The same approach can be used in a regex, and this may in fact have an advantage to be able to scoop up and ignore everything preceding the : (with split that part may end up with multiple elements if the format isn't exactly as shown, what then affects everything)
$string =~ s{ :\["\s*\K (.*?) ( "\] ) }{
my $e = $2;
my $n = $1 =~ s/",\s*"/ /gr;
$n.$e
}ex;
Here /e modifier makes it so that the replacement side is evaluated as code, where we do the same as with the split above. Notes on regex
Have to save away $2 first, since it gets reset in the next regex
The /r modifier†, which doesn't change its target but rather returns the changed string, is what allows us to use substitution operator on the read-only $1
If nothing gets captured for $2, and perhaps for $1, that means that there was no match and the outcome is simply that $string doesn't change, quietly. So if this substitution should always work then you may want to add handling of such unexpected data
Don't need a $n above, but can return ($1 =~ s/",\s*"/ /gr) . $e
Or, using lookarounds as attempted
$string =~ s{ (?<=:\[") (.+?) (?="\]) }{ $1 =~ s/",\s*"/ /gr }egx;
what does reduce the amount of code, but may be trickier to work with later.
While this is a direct answer to the question I think it's least maintainable.
†  This useful modifier, for "non-destructive substitution," appeared in v5.14. In earlier Perl versions we would copy the string and run regex on that, with an idiom
(my $n = $1) =~ s/",\s*"/ /g;
In the lookarounds-example we then need a little more
$string =~ s{...}{ (my $n = $1) =~ s/",\s*"/ /g; $n }gr
since s/ operator returns the number of substitutions made while we need $n to be returned from that whole piece of code in {} (the replacement side), to be used as the replacement.
You can use this \G based regex to start the match with :[" and further captures the values appropriately and replaces matched text so that only comma is retained and doublequotes are removed.
(:\[")|(?!^)\G([^"]+)"(,)"
Regex Demo
Your text is almost proper JSON, so it's really easy to go the final inch and make it so, and then process that:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say postderef/;
no warnings qw/experimental::postderef/;
use JSON::XS; # Install through your OS package manager or a CPAN client
my $str = q/"emailDomains":["google.co.uk","google.com","google.com","google.com","google.co.uk"]/;
my $json = JSON::XS->new();
my $obj = $json->decode("{$str}");
my $fixed = $json->ascii->encode({emailDomains =>
join(', ', $obj->{'emailDomains'}->#*)});
$fixed =~ s/^\{|\}$//g;
say $fixed;
Try Regex: " *, *"
Replace with: ,
Demo

Best way to deal with "Unescaped braces in regex" inside Perl regex

I recently started learning Perl to automate some mindless data tasks. I work on windows machines, but prefer to use Cygwin. Wrote a Perl script that did everything I wanted fine in Cygwin, but when I tried to run it with Strawberry Perl on Windows via CMD I got the "Unescaped left brace in regex is illegal here in regex," error.
After some reading, I am guessing my Cygwin has an earlier version of Perl and modern versions of Perl which Strawberry is using don't allow for this. I am familiar with escaping characters in regex, but I am getting this error when using a capture group from a previous regex match to do a substitution.
open(my $fh, '<:encoding(UTF-8)', $file)
or die "Could not open file '$file' $!";
my $fileContents = do { local $/; <$fh> };
my $i = 0;
while ($fileContents =~ /(.*Part[^\}]*\})/) {
$defParts[$i] = $1;
$i = $i + 1;
$fileContents =~ s/$1//;
}
Basically I am searching through a file for matches that look like:
Part
{
Somedata
}
Then storing those matches in an array. Then purging the match from the $fileContents so I avoid repeats.
I am certain there are better and more efficient ways of doing any number of these things, but I am surprised that when using a capture group it's complaining about unescaped characters.
I can imagine storing the capture group, manually escaping the braces, then using that for the substitution, but is there a quicker or more efficient way to avoid this error without rewriting the whole block? (I'd like to avoid special packages if possible so that this script is easily portable.)
All of the answers I found related to this error were with specific cases where it was more straightforward or practical to edit the source with the curly braces.
Thank you!
I would just bypass the whole problem and at the same time simplify the code:
my $i = 0;
while ($fileContents =~ s/(.*Part[^\}]*\})//) {
$defParts[$i] = $1;
$i = $i + 1;
}
Here we simply do the substitution first. If it succeeds, it will still set $1 and return true (just like plain /.../), so there's no need to mess around with s/$1// later.
Using $1 (or any variable) as the pattern would mean you have to escape all regex metacharacters (e.g. *, +, {, (, |, etc.) if you want it to match literally. You can do that pretty easily with quotemeta or inline (s/\Q$1//), but it's still an extra step and thus error prone.
Alternatively, you could keep your original code and not use s///. I mean, you already found the match. Why use s/// to search for it again?
while ($fileContents =~ /(.*Part[^\}]*\})/) {
...
substr($fileContents, $-[0], $+[0] - $-[0], "");
}
We already know where the match is in the string. $-[0] is the position of the start and $+[0] the position of the end of the last regex match (thus $+[0] - $-[0] is the length of the matched string). We can then use substr to replace that chunk by "".
But let's keep going with s///:
my $i = 0;
while ($fileContents =~ s/(.*Part[^\}]*\})//) {
$defParts[$i] = $1;
$i++;
}
$i = $i + 1; can be reduced to $i++; ("increment $i").
my #defParts;
while ($fileContents =~ s/(.*Part[^\}]*\})//) {
push #defParts, $1;
}
The only reason we need $i is to add elements to the #defParts array. We can do that by using push, so there's no need for maintaining an extra variable. This saves us another line.
Now we probably don't need to destroy $fileContents. If the substitution exists only for the benefit of this loop (so I doesn't re-match already extracted content), we can do better:
my #defParts;
while ($fileContents =~ /(.*Part[^\}]*\})/g) {
push #defParts, $1;
}
Using /g in scalar context attaches a "current position" to $fileContents, so the next match attempt starts where the previous match left off. This is probably more efficient because it doesn't have to keep rewriting $fileContents.
my #defParts = $fileContents =~ /(.*Part[^\}]*\})/g;
... Or we could just use //g in list context, where it returns a list of all captured groups of all matches, and assign that to #defParts.
my #defParts = $fileContents =~ /.*Part[^\}]*\}/g;
If there are no capture groups in the regex, //g in list context returns the list of all matched strings (as if there had been ( ) around the whole regex).
Feel free to choose any of these. :-)
As for the question of escaping, that's what quotemeta is for,
my $needs_escaping = q(some { data } here);
say quotemeta $needs_escaping;
what prints (on v5.16)
some\ \{\ data\ \}\ here
and works on $1 as well. See linked docs for details. Also see \Q in perlre (search for \Q), which is how this is used inside a regex, say s/\Q$1//;. The \E stops escaping (what you don't need).
Some comments.
Relying on deletion so that the regex keeps finding further such patterns may be a risky design. If it isn't and you do use it there is no need for indices, since we have push
my #defParts;
while ($fileContents =~ /($pattern)/) {
push #defParts, $1;
$fileContents =~ s/\Q$1//;
}
where \Q is added in the regex. Better yet, as explained in melpomene's answer the substitution can be done in the while condition itself
push #defParts, $1 while $fileContents =~ s/($pattern)//;
where I used the statement modifier form (postfix syntax) for conciseness.
With the /g modifier in scalar context, as in while (/($pattern)/g) { .. }, the search continues from the position of the previous match in each iteration, and this is a usual way to iterate over all instances of a pattern in a string. Please read up on use of /g in scalar context as there are details in its behavior that one should be aware of.
However, this is tricky here (even as it works) as the string changes underneath the regex. If efficiency is not a concern, you can capture all matches with /g in list context and then remove them
my #all_matches = $fileContents =~ /$patt/g;
$fileContents =~ s/$patt//g;
While inefficient, as it makes two passes, this is much simpler and clearer.
I expect that Somedata cannot possibly, ever, contain }, for instance as nested { ... }, correct? If it does you have a problem of balanced delimiters, which is far more rounded. One approach is to use the core Text::Balanced module. Search for SO posts with examples.

How can I use a pattern that includes regex special characters in Perl's s/// operator?

The pattern I am using in my s/// contains regex special characters. How can I quote them so that the regex engine treats them as literal characters.
Can I call a subroutine (quotemeta) in the pattern like this?
$string =~ s/quotemeta($regex)/$new/g;
How can I do it please?
You mean you want to quote the pattern in a s/// operator.
You can do that using the \Q and \E escapes:
$s =~ /\Q[a-z]\E//;
would look for the literal string [a-z] instead of a single character among a ... z.
You can find this, and other useful information in perldoc perlreref which is installed on your computer alongside perl.
Perl comes with excellent documentation. One should periodically skim/read all of it.
few ways:
my $quotedregex = quotemeta($regex);
$string =~ s/$quotedregex/$new/g;
or
$string =~ s/${\quotemeta($regex)}/$new/g;
or
$string =~ s/\Q$regex\E/$new/g;
People may well come here hoping to find a way to insert the return value of a subroutine into a regex pattern in situ. Note that this also applies to constants implemented with the constant pragma, which are implemented as inlined subroutines
There are ways to do that, the traditional ones being to take a reference to the subroutine's return value and dereference the result. This works either as a scalar reference or an anonymous array
There is also the Interpolation module, which provides tied hashes that look very much like procedure calls and so make your code neater and easier to read
Given this basic program
use strict;
use warnings 'all';
use Interpolation E => 'eval', mysub => \&mysub;
my $string = '<<<OLD>>>';
my $new = 'NEW';
sub mysub { 'OLD' }
any of the following will replace OLD with NEW in $string
Dereferencing a reference to the subroutine's return value
$string =~ s/${\mysub($regex)}/$new/;
Dereferencing an anonymous array containing the subroutine's return value
$string =~ s/#{[mysub($regex)]}/$new/;
Using Interpolation, a simple eval (the safe sort)
$string =~ s/$E{mysub($regex)}/$new/;
Using Interpolation with a tied hash %mysub to call the subroutine directly
$string =~ s/$mysub{$regex}/$new/;
This demonstrates an answer the original question using Interpolation and quotemeta
use strict;
use warnings 'all';
use Interpolation Q => \&CORE::quotemeta;
my $string = '<<<+*.$^>>>';
my $regex = '+*.$^';
my $new = 'NEW';
$string =~ s/$Q{$regex}/XXX/;
print $string, "\n";
output
<<<XXX>>>
But note that the built-in escape \Q is clearer and intended for this purpose, so you should use
$string =~ s/\Q$regex\E/XXX/
in this specific case

Perl regex substitution using external parameters

Consider the following example:
my $text = "some_strange_thing";
$text =~ s/some_(\w+)_thing/no_$1_stuff/;
print "Result: $text\n";
It prints
"Result: no_strange_stuff"
So far so good.
Now, I need to get both the match and replacement patterns from external sources (user input, config file, etc).
Naive solution appears to be like this:
my $match = "some_(\\w+)_thing";
my $repl = "no_\$1_stuff";
my $text = "some_strange_thing";
$text =~ s/$match/$repl/;
print "Result: $text\n";
However:
"Result: no_$1_stuff".
What's wrong? How can I get the same outcome with externally supplied patterns?
Solution 1: String::Substitution
Use String::Substitution package:
use String::Substitution qw(gsub_modify);
my $find = 'some_(\w+)_thing';
my $repl = 'no_$1_stuff';
my $text = "some_strange_thing";
gsub_modify($text, $find, $repl);
print $text,"\n";
The replacement string only interpolates (term used loosely) numbered match vars (like $1 or ${12}). See "interpolate_match_vars" for more information.
This module does not save or interpolate $& to avoid the "considerable performance penalty" (see perlvar).
Solution 2: Data::Munge
This is a solution mentioned by Grinnz in the comments below.
The Data::Munge can be used the following way:
use Data::Munge;
my $find = qr/some_(\w+)_thing/;
my $repl = 'no_$1_stuff';
my $text = 'some_strange_thing';
my $flags = 'g';
print replace($text, $find, $repl, $flags);
# => no_strange_stuff
Solution 3: A quick'n'dirty way (if replacement won't contain double quotes and security is not considered)
DISCLAIMER: I provide this solution as this approach can be found online, but its caveats are not explained. Do not use it in production.
With this approach, you can't have a replacement string that includes a " double quotation mark and, since this is equivalent to handing whoever is writing the configuration file direct code access, it should not be exposed to Web users (as mentioned by Daniel Martin).
You can use the following code:
#!/usr/bin/perl
my $match = qr"some_(\w+)_thing";
my $repl = '"no_$1_stuff"';
my $text = "some_strange_thing";
$text =~ s/$match/$repl/ee;
print "Result: $text\n";
See IDEONE demo
Result:
Result: no_strange_stuff
You have to
Declare the replacement in '"..."' so as $1 could be later evaluated
Use /ee to force the double evaluation of the variables in the replacement.
A modifier available specifically to search and replace is the s///e evaluation modifier. s///e treats the replacement text as Perl code, rather than a double-quoted string. The value that the code returns is substituted for the matched substring. s///e is useful if you need to do a bit of computation in the process of replacing text.
You can use qr to instantiate pattern for the regex (qr"some_(\w+)_thing").
Essentially the same approach as the accepted solution, but I kept the initial lines the same as the problem statement, since I thought that might make it easier to fit into more situations:
my $match = "some_(\\w+)_thing";
my $repl = "no_\$1_stuff";
my $qrmatch = qr($match);
my $code = $repl;
$code =~ s/([^"\\]*)(["\\])/$1\\$2/g;
$code = qq["$code"];
if (!defined($code)) {
die "Couldn't find appropriate quote marks";
}
my $text = "some_strange_thing";
$text =~ s/$qrmatch/$code/ee;
print "Result: $text\n";
Note that this works no matter what is in $repl, whereas the naive solution has issues if $repl contains a double quote character itself, or ends in a backslash.
Also, assuming that you're going to run the three lines at the end (or something like it) in a loop, do make sure that you don't skip the qr line. It will make a huge performance difference if you skip the qr and just use s/$match/$code/ee.
Also, even though it's not as trivial to get arbitrary code execution with this solution as it is with the accepted one, it wouldn't surprise me if it's still possible. In general, I'd avoid solutions based on s///ee if the $match or $repl come from untrusted users. (e.g., don't build a web service out of this)
Doing this kind of replacement securely when $match and $repl are supplied by untrusted users should be asked as a different question if your use case includes that.

Regex as a command line arg for filtering lines with a particular value

I want to be able to take an argument from the command line and use it as a regular expression within my script to filter lines from my file. A simple example
$ perl script.pl id_4
In script.pl:
...
my $exp = shift;
while(my $line = <$fh){
if($line =~ /$exp/){
print $line,"\n";
}
}
...
My actual script is a bit more complicated and does other manipulations to the line to extract information and produce a different output. My problem is that I have situations where I want to filter out every line that contains "id_4" instead of only select lines containing "id_4". Normally this could be achieved by
if($line !~ /$exp/)
but, if possible, I don't want to alter my script to accept a more complex set of arguments (e.g. use !~ if second parameter is "ne", and =~ if not).
Can anyone think of a regex that I can use (beside a long "id_1|id_2|id_3|id_5...") to filter out lines containing one particular value out of many possibilities? I fear I'm asking for the daft here, and should probably just stick to the sensible and accept a further argument :/.
Why choose? Have both.
my $exp = join "|", grep !/^!/, #ARGV;
my #not = grep /^!/, #ARGV;
s/^!// for #not;
my $exp_not = join "|", #not;
...
if (( $line =~ $exp ) && ( $line !~ $exp_not )) {
# do stuff
}
Usage:
perl script.pl orange soda !light !diet
There is a way to invert regular expressions, so you can do matches like "all strings which do not contain a match for subexpr". Without the operators which express this directly (i.e. using only the basic positive-matching regex operators), it is still possible but leads to large and unwieldy regular expressions (possibly, combinatorial explosion in the regex size).
For a simple example, look at my answer to this question: how to write a regex which matches everything but the string "help". (It's a quite a simplification that the match is anchored to start and end.) Match all letter/number combos but specific word?
Traditional Unix tools have hacks for situations when you want to just invert the match of the expression as a whole: grep versus grep -v. Or vi: :g/pat/ versus :v/pat/, etc. In this way, the implementors ducked out implementing the difficult regex operators that don't fit into the simple NFA construction approach.
The easiest thing is to do the same thing and have a convention for coarse-grained negation: an include pattern and an exclude pattern.