Perl tainting via regular expression - regex

Short version
In the code below, $1 is tainted and I don't understand why.
Long version
I'm running Foswiki on a system with perl v5.14.2 with -T taint check mode enabled.
Debugging a problem with that setup, I managed to construct the following SSCCE. (Note that I edited this post, the first version was longer and more complicated, and comments still refer to that.)
#!/usr/bin/perl -T
use strict;
use warnings;
use locale;
use Scalar::Util qw(tainted);
my $var = "foo.bar_baz";
$var =~ m/^(.*)[._](.*?)$/;
print(tainted($1) ? "tainted\n" : "untainted\n");
Although the input string $var is untainted and the regular expression is fixed, the resulting capture group $1 is tainted. Which I find really strange.
The perlsec manual has this to say about taint and regular expressions:
Values may be untainted by using them as keys in a hash; otherwise the
only way to bypass the tainting mechanism is by referencing
subpatterns from a regular expression match. Perl presumes that if
you reference a substring using $1, $2, etc., that you knew what you
were doing when you wrote the pattern.
I would imagine that even if the input were tainted, the output would still be untainted. To observe the reverse, tainted output from untainted input, feels like a strange bug in perl. But if one reads more of perlsec, it also points users at the SECURITY section of perllocale. There we read:
when use locale is in effect, Perl uses the tainting mechanism (see
perlsec) to mark string results that become locale-dependent, and
which may be untrustworthy in consequence. Here is a summary of the
tainting behavior of operators and functions that may be affected by
the locale:
Comparison operators (lt, le , ge, gt and cmp) […]
Case-mapping interpolation (with \l, \L, \u or \U) […]
Matching operator (m//):
Scalar true/false result never tainted.
Subpatterns, either delivered as a list-context result or as $1
etc. are tainted if use locale (but not use locale
':not_characters') is in effect, and the subpattern regular
expression contains \w (to match an alphanumeric character), \W
(non-alphanumeric character), \s (whitespace character), or \S
(non whitespace character). The matched-pattern variable, $&, $`
(pre-match), $' (post-match), and $+ (last match) are also
tainted if use locale is in effect and the regular expression contains
\w, \W, \s, or \S.
Substitution operator (s///) […]
[⋮]
This looks like it should be an exhaustive list. And I don't see how it could apply: My regex is not using any of \w, \W, \s or \S, so it should not depend on locale.
Can someone explain why this code taints the varibale $1?

There currently is a discrepancy between the documentation as quoted in the question, and the actual implementation as of perl 5.18.1. The problem are character classes. The documentation mentions \w,\s,\W,\S in what sounds like an exhaustive list, while the implementation taints on pretty much every use of […].
The right solution would probably be somewhere in between: character classes like [[:word:]] should taint, since it depends on locale. My fixed list should not. Character ranges like [a-z] depend on collation, so in my personal opinion they should taint as well. \d depends on what a locale considers a digit, so it, too, should taint even if it is neither one of the escape sequences mentioned so far nor a bracketed class.
So in my opinion, both the documentation and the implementation need fixing. Perl devs are working on this. For progress information, please look at the perl bug report I filed.
For a fixed list of characters, one viable workaround appears to be a formulation as a disjunction, i.e. (?:\.|_) instead of [._]. It is more verbose, but should work even with the current (in my opinion buggy) perl versions.

Related

Using a backreference as key in a hashmap within a regex-substitution?

I am learning the Perl language and I stumbled upon the following question:
Is it possible to use a backreference as a key in a substitution argument, e.g. something like:
$hm{"Cat"} = "Dog";
while(<>){
s/Cat/$hm{\1}/
print;
}
That is, I want to tell Perl to look up a key which is contained in a capture argument.
I know that this is a silly example. But I am just curious on the question as to whether it is possible to use such a key-lookup with a backreference in a substitution.
Use $1 instead.
While backrefs like \1 work in the substition part of a regex, it only works in string context. The $hm{KEY} is accesses an item in a hash. The KEY part can be a bareword or an expression. In an expression, \1 would be a “reference to a literal scalar with value 1” which would stringify as SCALAR(0x55776153ecb0), not a back-reference as in a string. Instead, we can access the value of captures in the regex with variables like $1.
But that requires us to capture a part of the regex. I would write it as:
s/(Cat)/$hm{$1}/;
As a rule of thumb, only use backrefs like \1 within a regex pattern. Everywhere else use capture variables like $1. If you use warnings, Perl will also tell you that \1 better written as $1, though it wouldn't have detected the particular issue in your case as the \1 was still valid syntax, albeit with different meaning.
If you are looking at really old code, you'll see people using the \1 form on the replacement side of the substitution. Sometimes you'll see it in really new code; it's a Perl 4 thing that still works, but Perl 5 added a warning. If you have warnings turned on, perl will tell you that (although I don't know when this warning started:
$ perl5.36.0 -wpe 's/cat(dog)/\1/'
\1 better written as $1 at -e line 1.
With diagnostics you get even more information about the warning:
$ perl5.36.0 -Mdiagnostics -wpe 's/cat(dog)/\1/'
\1 better written as $1 at -e line 1 (#1)
(W syntax) Outside of patterns, backreferences live on as variables.
The use of backslashes is grandfathered on the right-hand side of a
substitution, but stylistically it's better to use the variable form
because other Perl programmers will expect it, and it works better if
there are more than 9 backreferences.
There are many other warnings that Perl uses to show you better ways to do things.
A level above that is perlcritic, which is an opinionated set of policies about what some people find to be good style. It's not a terrible place to start before you develop your own ideas about what works for you or your team.
It has been explained that one wants to use a capture for that, not a backreference, like
perl -wE'$_=shift//q(hal); %h = (a => 7); s/(a)/$h{$1}/; say'
What I'd like to add is a note about what if there isn't in fact a key for that capture?
We often capture complex patterns, not simple literals, and it can happen that our anticipated keys don't cover every case that may come up. A way to check for that involves the modifier /e, which makes the replacement part be evaluated as code, and its return is then substituted into the string
perl -wE'$_=shift//q(hal); %h = (a => 7); s{(h)}{$h{$1} // "default"}e; say'
Now if the pattern is matched and captured but isn't a key (h) then the string default is substituted. An often sensible choice for default is the capture itself (if not a key put it back).
The replaccement part must now be syntactically correct code, so no bare literals, etc.

How to order regular expression alternatives to get longest match?

I have a number of regular expressions regex1, regex2, ..., regexN combined into a single regex as regex1|regex2|...|regexN. I would like to reorder the component expressions so that the combined expression gives the longest possible match at the beginning of a given string.
I believe this means reordering the regular expressions such that "if regexK matches a prefix of regexL, then L < K". If this is correct, is it possible to find out, in general, whether regexK can match a prefix of regexL?
Use the right regex flavor!
In some regex flavors, the alternation providing the longest match is the one that is used ("greedy alternation"). Note that most of these regex flavors are old (yet still used today), and thus lack some modern constructs such as back references.
Perl6 is modern (and has many features), yet defaults to the POSIX-style longest alternation. (You can even switch styles, as || creates an alternator that short-circuits to first match.) Note that the :Perl5/:P5 modifier is needed in order to use the "traditional" regex style.
Also, PCRE and the newer PCRE2 have functions that do the same. In PCRE2, it's pcre2_dfa_match. (See my section Relevant info about regex engine design section for more information about DFAs.)
This means, you can have ANY order of statements in a pipe and the result will always be the longest.
(This is different from the "absolute longest" match, as no amount of rearranging the terms in an alternation will change the fact that all regex engines traverse the string left-to-right. With the exception of .NET, apparently, which can go right-to-left. But traversing the string backwards wouldn't guarantee the "absolute longest" match either.) If you really want to find matches at (only) the beginning of a string, you should anchor the expression: ^(regex1|regex2|...).
According to this page*:
The POSIX standard, however, mandates that the longest match be returned. When applying Set|SetValue to SetValue, a POSIX-compliant regex engine will match SetValue entirely.
* Note: I do not have the ability to test every POSIX flavor. Also, some regex flavors (Perl6) have this behavior without being POSIX compliant overall.
Let me give you one specific example that I have verified on my own computer:
echo "ab c a" | sed -E 's/(a|ab)/replacement/'
The regex is (a|ab). When it runs on the string ab c a you get : replacement c a, meaning that you do, in fact, get the longest match that the alternator can provide.
This regex, for a more complex example, (a|ab.*c|.{0,2}c*d) applied to abcccd, will return abcccd.
Try it here!
More clarification: the regex engine will not go forward (in the search string) to see if there is an even longer match once it can match something. It will only look through the current list of alterations to see if another one will match a longer string (from the position where the initial match starts).
In other words, no matter the order of choices in an alteration, POSIX compliant regexes use the one that matches the most characters.
Other examples of flavors with this behavior:
Tcl ARE
POSIX ERE
GNU BRE
GNU ERE
Relevant information about regex engine design
This question asks about designing an engine, but the answers may be helpful to understand how these engines work. Essentially, DFA-based algorithms determine the common overlap of different expressions, especially those within an alternation. It might be worth checking out this page. It explains how alternatives can be combined into a single path:
Note: at some point, you might just want to consider using an actual programming language. Regexes aren't everything.
Longest Match
Unfortunately, there is no distinct logic to tell a regular expression
engine to get the longest match possible.
Doing so would/could create a cascading backtracking episode gone wild.
It is, by definition a complexity too great to deal with.
All regular expressions are processed from left to right.
Anything the engine can match first it will, then bail out.
This is especially true of alternations, where this|this is|this is here
will always match 'this is here' first and
will NEVER ever match this is nor this is here
Once you realize that, you can reorder the alternation into
this is here|this is|this which gives the longest match every time.
Of course this can be reduced to this(?:(?: is)? here)?
which is the clever way of getting the longest match.
Haven't seen any examples of the regex's you want to combine,
so this is just some general information.
If you show the regexes you're trying to combine, better solution could be
provided.
Alternation contents do affect each other, as well as whatever precedes or
follows the cluster can have an affect on which alternation gets matched.
If you have more questions just ask.
Addendum:
For #Laurel. This could always be done with a Perl 5 regex (>5.10)
because Perl can run code from within regex sub-expressions.
Since it can run code, it can count and get the longest match.
The rule of leftmost first, however, will never change.
If regex were thermodynamics, this would be the first law.
Perl is a strange entity as it tries to create a synergy between regex
and code execution.
As a result, it is possible to overload it's operators, to inject
customization into the language itself.
Their regex engine is no different, and can be customized the same way.
So, in theory, the regex below can be made into a regex construct,
a new Alternation construct.
I won't go into detail's here, but suffice it to say, it's not for the faint at heart.
If you're interested in this type of thing, see the perlre manpage under
section 'Creating Custom RE Engines'
Perl:
Note - The regex alternation form is based on #Laurel complex example
(a|ab.*c|.{0,2}c*d) applied to abcccd.
Visually, if made into a custom regex construct, would look similar to
an alternation (?:rx1||rx2||rx3) and I'm guessing this is how a lot of
Perl6 is done in terms of integrating regex engine directly into the language.
Also, if used as is, it's possible to construct this regex dynamically as needed.
And note that all the richness of Perl regex constructs are available.
Output
Longest Match Found: abcccd
Code
use strict;
use warnings;
my ($p1,$p2,$p3) = (0,0,0);
my $targ = 'abcccd';
# Formatted using RegexFormat7 (www.regexformat.com)
if ( $targ =~
/
# The Alternation Construct
(?=
( a ) # (1)
(?{ $p1 = length($^N) })
)?
(?=
( ab .* c ) # (2)
(?{ $p2 = length($^N) })
)?
(?=
( .{0,2} c*d ) # (3)
(?{ $p3 = length($^N) })
)?
# Check At Least 1 Match
(?(1)
(?(2)
(?(3)
| (?!)
)
)
)
# Consume Longest Alternation Match
( # (4 start)
(?(?{
$p1>=$p2 && $p1>=$p3
})
\1
| (?(?{
$p2>=$p1 && $p2>=$p3
})
\2
| (?(?{
$p3>=$p1 && $p3>=$p2
})
\3
)
)
)
) # (4 end)
/x ) {
print "Longest Match Found: $4\n";
} else {
print "Did not find a match!\n";
}
For sure a human might be able judging whther two given regexp are matching prefixes for some cases. In general this is an n-p-complete problem. So don't try.
In the best case combining the different regexp into a single one will give a suitable result cheap. However, I'm not aware of any algorithm that can take two arbitrary regexp and combine them in a way that the resulting regexp is still matching what any of the two would match. It would be n-p-complete also.
You must also not rely on ordering of alternatives. This depends on the internal execution logic of the regexp engine. It could easily be that this is reordering the alternatives internally beyond your control. So, a valid ordering with current engine mmight give wrong results with a different engine. (So, it could help as long as you stay with a single regexp engine implementation)
Best approach seems to me to simply execute all regexp, keep track of the matched length and then take the longest match.

greedy operator in regular expression is not working in Tcl 8.5

See this simple regexp code:
puts [ regexp -inline {^\-\-\S+?=\S+} "--tox=9.0" ]
The output is:
>--tox=9
It would seem that the second \S+ is being non-greedy! Only 1 character is being matched
In PERL, one can can see that the result is as I expected, see 1 line output:
perl -e '"--tox=9.0" =~/(^\-\-\S+?=\S+)/ ; print "${1}\n"'
--tox=9.0
How can I get the Perl behaviour in Tcl?
This is an inherent 'feature' of Tcl's regexp implementation. For instance, the below is from Henry Spencer (the one who did most if not all of Tcl's regexp work I believe)
It is very difficult to come up with an entirely satisfactory
definition of the behavior of mixed-greediness regular expressions.
Perl doesn't try: the Perl "specification" is a description of the
implementation, an inherently low-performance approach involving
trying one match at a time. This is unsatisfactory for a number of
reasons, not least being that it takes several pages of text merely to
describe it. (That implementation and its description are distant,
mutated descendants of one of my earlier regexp packages, so I share
some of the blame for this.)
When all quantifiers are greedy, the Tcl 8.2 regexp matches the
longest possible match (as specified in the POSIX standard's
regular-expression definition). When all are non-greedy, it matches
the shortest possible match. Neither of these desirable statements is
true of Perl.
The trouble is that it is very, very hard to write a generalization of
those statements which covers mixed-greediness regular expressions --
a proper, implementation-independent definition of what
mixed-greediness regular expressions should match -- and makes them
do "what people expect". I've tried. I'm still trying. No luck so
far.
The rules in the Tcl 8.2 regexp, which basically give the whole regexp
a long/short preference based on its subexpressions, are the best I've
come up with so far. The code implements them accurately. I agree
that they fall short of what's really wanted. It's trickier than it
looks.
Basically, expressions with mixed greedy and non-greedy quantifiers impacts both the simplicity of the implementation and the performance. So, the implementation makes it so that the first 'type' of quantifier is passed on to all other quantifiers.
In other words, if the first quantifier is greedy, all the others will be greedy. If the first is non-greedy, all the others will be non-greedy. And therefore, you cannot force a Tcl regexp to work like a Perl regexp (or maybe you can through exec and using the bash command version of perl, but I'm not familiar with this).
I would advise using negated classes and/or anchors instead of non-greedy.
Since I don't know the exact context of your question, I won't provide an alternative regexp, because that will depend on whether this is really the whole string you are trying to make a match on.
The Tcl regular expression engine is an automata-theoretic one instead of a stack-based one, so it has a very different approach to matching mixed greediness REs. In particular, for the sort of RE you're talking about, that will be interpreted as entirely non-greedy.
The simplest method of fixing this is to use a different RE. Remembering that \S is just a shorthand for [^\s], we can do this (excluding = from the first part):
puts [ regexp -inline {^--[^\s=]+=\S+} "--tox=9.0" ]
(I also changed \- to - as it's not a special character in Tcl's REs.)
The answer can be found here:
Unfortunately, the answer is that to get the same answer Perl gives,
you have to use Perl's exact regexp implementation.
In your case, I'd use both anchors, ^ and $:
puts [ regexp -inline {^\-\-\S+?=\S+$} "--tox=9.0" ]
The result is: --tox=9.0

boost::regex - \bb?

I have some badly commented legacy code here that makes use of boost::regex::perl. I was wondering about one particular construct before, but since the code worked (more or less), I was loath to touch it.
Now I have to touch it, for technical reasons (more precisely, current versions of Boost no longer accepting the construct), so I have to figure out what it does - or rather, was intended to do.
The relevant part of the regex:
(?<!(\bb\s|\bb|^[a-z]\s|^[a-z]))
The piece that gives me headaches is \bb. I know of \b, but I could not find mention of \bb, and looking for a literal 'b' would not make sense here. Is \bb some special underdocumented feature, or do I have to consider this a typo?
As Boost seems to be a regex engine for C++, and one of the compatibility modes is perl compatibility--if that is a "perl-compatible" expression, than the second 'b' can only be a literal.
It's a valid expression, pretty much a special case for words beginning with 'b'.
It seems to be the deciding factor that this is a c++ library, and that it's to give environments that aren't perl, perl-compatible regexes. Thus my original thought that perl might interpret the expression (say with overload::constant) is invalid. Yet it is still worth mentioning just for clarification purposes, regardless of how inadvisable it would be tweak an expression meaning "word beginning with 'b'".
The only caveat to that idea is that perhaps Boost out-performs Perl at it's own expressions and somebody would be using the Boost engine in a Perl environment, then all bets are off as to whether that could have been meant as a special expression. This is just one stab, given a grammar where '!!!' meant something special at the beginning of words, you could piggyback on the established meaning like this (NOT RECOMMENDED!)
s/\\bb\b/(?:!!!(\\p{Alpha})|\\bb)/
This would be something dumb to do, but as we are dealing with code that seems unfit for its task, there are thousands of ways to fail at a task.
(\bb\s|\bb|^[a-z]\s|^[a-z]) matches a b if it's not preceded by another word character, or any lowercase letter if it's at the beginning of the string. In either case, the letter may be followed by a whitespace character. (It could match uppercase letters too if case-insensitive mode is set, and the ^ could also match the beginning of a line if multiline mode is set.)
But inside a lookbehind, that shouldn't even have compiled. In some flavors, a lookbehind can contain multiple alternatives with different, fixed lengths, but the alternation has to be at the top level in the lookbehind. That is, (?<=abc|xy|12345) will work, but (?<=(abc|xy|12345)) won't. So your regex wouldn't work even in those flavors, but Boost's docs just say the lookbehind expression has to be fixed-length.
If you really need to account for all four of the possibilities matched by that regex, I suggest you split the lookbehind into two:
(?<!\bb|^[a-z])(?<!(?:\bb|^[a-z])\s)

How do I match a pattern with optional surrounding quotes?

How would one write a regex that matches a pattern that can contain quotes, but if it does, must have matching quotes at the beginning and end?
"?(pattern)"?
Will not work because it will allow patterns that begin with a quote but don't end with one.
"(pattern)"|(pattern)
Will work, but is repetitive. Is there a better way to do that without repeating the pattern?
You can get a solution without repeating by making use of backreferences and conditionals:
/^(")?(pattern)(?(1)\1|)$/
Matches:
pattern
"pattern"
Doesn't match:
"pattern
pattern"
This pattern is somewhat complex, however. It first looks for an optional quote, and puts it into backreference 1 if one is found. Then it searches for your pattern. Then it uses conditional syntax to say "if backreference 1 is found again, match it, otherwise match nothing". The whole pattern is anchored (which means that it needs to appear by itself on a line) so that unmatched quotes won't be captured (otherwise the pattern in pattern" would match).
Note that support for conditionals varies by engine and the more verbose but repetitive expressions will be more widely supported (and likely easier to understand).
Update: A much simpler version of this regex would be /^(")?(pattern)\1$/, which does not need a conditional. When I was testing this initially, the tester I was using gave me a false negative, which lead me to discount it (oops!).
I'll leave the solution with the conditional up for posterity and interest, but this is a simpler version that is more likely to work in a wider variety of engines (backreferences are the only feature being used here which might be unsupported).
This is quite simple as well: (".+"|.+). Make sure the first match is with quotes and the second without.
Depending on the language you're using, you should be able to use backreferences. Something like this, say:
(["'])(pattern)\1|^(pattern)$
That way, you're requiring that either there are no quotes, or that the SAME quote is used on both ends.
This should work with recursive regex (which needs longer to get right). In the meantime: in Perl, you can build a self-modifying regex. I'll leave that as an academic example ;-)
my #stuff = ( '"pattern"', 'pattern', 'pattern"', '"pattern' );
foreach (#stuff) {
print "$_ OK\n" if /^
(")?
\w+
(??{defined $1 ? '"' : ''})
$
/x
}
Result:
"pattern" OK
pattern OK
Generally #Daniel Vandersluis response would work. However, some compilers do not recognize the optional group (") if it is empty, therefore they do not detect the back reference \1.
In order to avoid this problem a more robust solution would be:
/^("|)(pattern)\1$/
Then the compiler will always detect the first group. This expression can also be modified if there is some prefix in the expression and you want to capture it first:
/^(key)=("|)(value)\2$/