What escapes are excluded when Perl regex interpolation is off? - regex

I'm curious as to what escape sequences get excluded from being matched in a Perl regular expression when interpolation is turned off, say by using an apostrophe (single-quote) as a delimiter for m'', and also why. The description of interpolation in perlop mentions that:
No interpolation is performed at this stage. Any backslashed sequences including \\ are treated at the stage to parsing regular expressions.
However, a testing of the escape sequence found in perlre, shows that not all escape sequences are treated the same.
So, I've tested all the simple escapes listed in the "Escape Sequences" section of the perlre, and found that some are "off" while some are "on". There appears to be a correspondence between the "on" and "off" escapes and the "character escapes" and "escape modifiers" descriptions in perlrebackslash, respectively. I haven't tested all the possible escapes listed on that page, just the ones from those two groups, thus far.
Even if I test all the possible escapes, I'm not sure I understand why some still work when interpolation is off, while others do not. Can anyone enlighten me?
update: As #tchrist suggested, here are some examples. I essentially used variations on the following shell code to test these against some user input from STDIN:
perl -e "use 5.012; while(<>) { say 'YES' if m'\t';}"
The escapes \e, \f, \n, \r, and \t, when used in a non-interpolated matching construct, such as m'\t' (etc.) will still match the special characters they escape instead of their literal string representations. This is the same matching behavior I see when I use an interpolated form of matching (e.g. m/\t/), which is what I meant by still "working".
On the other hand, modifiers like \L, \U, \l, and \u do not function the same inside of m'' as inside of m//. For example m'\uthis' does not match the input: "This is a string," while m/\uthis/ does match such an input. The first form will match the input: "\uthis is a string."

Its the difference between single quoted string and double quoted string, those rules are seperate from regex patterns
so m'$foo' is like '$foo' and not like "$foo"
use Data::Dump;
$foo = 12;
dd qr/$foo/i;
dd qr'$foo'i;
__END__
qr/12/i
qr/$foo/i
so if using interpolation, you're matching 12
and if you've disabled interpolation, you're matching $, the end of line (or string) followed by foo
More on this in http://perldoc.perl.org/perlop.html#Quote-and-Quote-like-Operators
update: on a side note, in addition to Data::Dump, both Data::Dumper and Data::Dump::Streamer "dump" qr'$foo'i erroneously as qr/$foo/i

Related

Perl search and replace using variables, string contains dot

I’m using a variable to search and replace a string using Perl.
I want to replace the string 23.0 with 23.0.1, so I tried this:
my $old="23.0";
my $new="23.0.1";
$_ =~ s/$old/$new/g;
The problem is that it also replaced the string 2310, so I tried:
my $old="23\.0"
and also /ee.
But can’t get the correct syntax for it to work. Can someone show me the correct syntax?
There are two things that will help you here:
The quotemeta function - that will escape meta characters. And also the \Q and \E regex flags, that stop regex interpolation.
print quotemeta "21.0";
Or:
my $old="23.0";
my $new="23.0.1";
my $str = "2310";
$str =~ s/\Q$old\E/$new/g;
print $str;
Just use single quotes and escape the dot.
my $old='23\.0';
To complement Sobrique's excellent answer, let me note that the reason your attempt with "23\.0" didn't work is that "23\.0" and "23.0" evaluate to the same string: in a double-quoted string literal, the backslash escape sequence \. simply evaluates to ..
There are several things you could do to avoid this:
If you indeed want to match a fixed string, and don't need or want to include any special regexp metacharacters in it, you can do as Sobrique suggest and use quotemeta or \Q to escape them.
In particular, this is almost always the correct solution if the string to be matched comes from user input. If you do want to allow some limited set of non-literal metacharacters, you can unescape those after running the pattern through quotemeta. For a simple example, here's a quick-and-dirty way to turn a basic glob-like pattern (using the metacharacters ? and * for "any character" and "any string of characters" repectively) into an equivalent regexp:
my $regexp = "^\Q$glob\E\$"; # quote and anchor the pattern
$regexp =~ s/\\\?/./g; # replace "?" (escaped to "\?" by \Q) with "."
$regexp =~ s/\\\*/.*/g; # replace "*" (escaped to "\*" by \Q) with ".*"
Conversely, if you want to have a literal regexp pattern in your code, without immediately matching it against something, you can use the qr// regexp-like quote operator, like this:
my $old = qr/\b23\.0(\.0)?\b/; # match 23.0 or 23.0.0 (but not 123.012!)
my $new = "23.0.1"; # just a literal string
s/$old/$new/g; # replace any string matching $old in $_ with $new
Note that qr// has other effects beyond just allowing you to use regexp syntax in a string literal: it actually pre-compiles the pattern into a special Regexp object, so that it doesn't need to be recompiled every time it's used later. In particular, as a side effect, the string representation of a qr// regexp literal will usually not exactly match the original content, although it will be equivalent as a regexp. For example, say qr/\b23\.0(\.0)?\b/ will, on my Perl version, output (?^u:\b23\.0(\.0)?\b).
You could also just use a normal double-quoted string literal, and double any backslashes in it, but that's (usually) less efficient than using qr//, and also less readable due to leaning toothpick syndrome.
Using a single-quoted string literal would be slightly better, since backslashes in a single-quoted string are only special when followed by another backslash or a single quote. Even so, readability can still suffer if you happen to need to match any literal backslashes in your regexp, not to mention that it's easy to create subtle bugs if you forget to double a backslash in those rare places where it's still needed.

Why does Perl complain about unmatched bracket within a \Q..\E regex section?

I have a regex in a variable, that includes a substring inside \Q...\E containing an opening bracket. I'm expecting that [ would be interpreted like a vanilla character by the parser since it's inside a \Q...\E section.
This seems to be the case when the regex comes as a literal in the program, but the parser fails on it when it comes in a variable.
Here's a simplified example.
This works:
$r = qr/\Qa[b\E\d+/;
if ("a[b1" =~ $r) { print "match\n"; }
This fails:
$v='\Qa[b\E\d+';
$r=qr/$v/;
It dies at line 2 with
Unmatched [ in regex; marked by <-- HERE in m/\Qa[ <-- HERE b\E\d+/
Why would Perl reject this? And only when interpolated from a variable and not inline with the same regex?
I can't see anything explaining it in the FAQ's How do I match a regular expression that's in a variable? or perlop's Regexp Quote-Like Operators.
This is with Perl 5.14.2 (Ubuntu 12.04) if the version matters, with default settings.
\Q has nothing to do with regular expressions. When the regex engine sees \Q, it doesn't recognize it, spits out a warning, and treats it like \\Q.
>perl -we"$re='\Qa'; qr/$re/
Unrecognized escape \Q passed through in regex; marked by <-- HERE in m/\Q <-- HERE a/ at -e line 1.
Like interpolation, \Q is recognized by double-quoted string literals and similar. Like interpolation, it's gotta be part of a literal (Perl code) to work.
>perl -E"$pat=q{\Q!}; say qr/$pat/"
(?^u:\Q!)
>perl -E"$pat=qq{\Q!}; say qr/$pat/"
(?^u:\!)
>perl -E"$x='!'; $pat=q{$x}; say qr/$pat/"
(?^u:$x)
>perl -E"$x='!'; $pat=qq{$x}; say qr/$pat/"
(?^u:!)
Solutions:
$v="\Qa[b\E\\d+";
$v=qr/\Qa[b\E\d+/;
$v=quotemeta('a[b').'\d+';
A Perl regular expression is first evaluated as if it was a simple double-quoted string. Any embedded variables are interpolated, and escape sequences that don't originate from interpolated variables are processed. This is the point when special operators like \L, \U and \Q...\E are acted on.
The processing stops there in double-quoted strings, but in regular expressions the string is then compiled.
In your example you have
$v = '\Qa[b\E\d+';
and because you have used single quotes, this string isn't changed at all.
You then interpolate it into a regular expression with
$r = qr/$v/;
but, because escape sequences inside interpolated variables are untouched, the string is passed as it is to the regex compiler, which reports that the expression is invalid because it contains an unmatched an unescaped open bracket. If you remove that bracket you still get an error; this time it is Unrecognized escape \Q passed through in regex showing that the \Q...\E hasn't been processed and appears as literals.
What would work is to change your assignment to $v to use double quotes instead, like this
my $v = "\Qa[b\E\\d+";
The backslash on \d has to be doubled up otherwise is would just vanish. Now the \Q...\E has been acted on, and $v is equal to a\[b\d+. Compiling this as a regex works fine.
The \Q and \E metacharacters are interpreted at the time the regex is parsed. They are not part of the regular expression itself. If \Q and \E appear inside a regex literal, they tell the parser to ignore characters that normally have special meaning inside regular expressions, including brackets. If \Q and \E appear in a single-quotes as part of a variable assignment, they are treated as literal strings. When this variable is then used inside a regex, the literal values become part of the regular expression. The backslashes are interpreted as escapes, so \Q matches a literal Q, and \E matches a literal E.
To see this, try compiling the regex and then printing it:
$v=qr/\Qa[b\E\d+/;
print "$v\n";
The output is:
(?-xism:a\[b\d+)
Note that the \Q and \E are gone, and the bracket has been escaped. If you assign a string that contains \Q and \E seperately inside single quotes:
$v='ab\Qcd\Eef';
$r=qr/$v/;
print "$r\n";
You get:
(?-xism:ab\Qcd\Eef)
This regex actually matches "abQcdEef":
$v='ab\Qcd\Eef';
$r=qr/$v/;
if("abQcdEef" =~ /$r/) {print "matches\n"} else {print "no match\n"}
result:
matches

What does it mean "you can’t hide the terminating delimiter of a pattern inside a regex construct" in the "Programming Perl"?

Sorry, but once again I need help to understand rather complicated snippet from the "Programming Perl" book. Here it is (what is obscure to me marked as bold):
patterns are parsed like double-quoted strings, all the normal double-quote conventions will work, including variable interpolation (unless you use single quotes
as the delimiter) and special characters indicated with backslash escapes. These are applied before the string is interpreted as a regular expression (This is one of the
few places in the Perl language where a string undergoes more than one pass of
processing). ...
Another consequence of this two-pass parsing is that the ordinary Perl tokener
finds the end of the regular expression first, just as if it were looking for the
terminating delimiter of an ordinary string. Only after it has found the end of the
string (and done any variable interpolation) is the pattern treated as a regular
expression. Among other things, this means you can’t “hide” the terminating
delimiter of a pattern inside a regex construct (such as a bracketed character class
or a regex comment, which we haven’t covered yet). Perl will see the delimiter
wherever it is and terminate the pattern at that point.
First, why it is said that Only after it has found the end of the string not the end of the regular expression which it was looking, as stated before?
Second, what does it mean you can’t “hide” the terminating delimiter of a pattern inside a regex construct? Why I can't hide the terminating delimiter /, whereas I can place it wherever I want either in the regexp directly /A\/C/ or in a interpolated variable (even without \):
my $s = 'A/';
my $p = 'A/C';
say $p =~ /$s/;
outputs 1.
While I was writing and re-reading my question I thought that this snippet tells about using a single-quote as a regexp delimiter, then it all seems quite cohesive. Is my assumption correct?
My appreciation.
It says "end of the string" instead of "end of the regular expression" because at that point it's treating the regex as if it were just a string.
It's trying to say that this does not work:
/foo[-/_]/
Even though normal regex metacharacters are not special inside [], Perl will see the regex as /foo[-/ and complain about an unterminated class.
It's trying to say that Perl does not parse the regex as it reads it. First it finds the end of the regex in your source code as if it were a quoted string, so the only special character is \. Then it interpolates any variables. Then it parses the result as a regular expression.
You can hide the terminating delimiter with \ because that works in ordinary strings. You can hide the delimiter inside an interpolated variable, because interpolation happens after the delimiter is found. If you use a bracketing delimiter (e.g. { } or [ ]), you can nest matching pairs of delimiters inside the regex, because q{} works like that too.
But you can't hide it inside any other regex construct.
Say you want to match a *. You would use
m/\*/
But what if you were using you used * as your delimiter? The following doesn't work:
m*\**
because it's interpreted as
m/*/
as seen in the following:
$ perl -e'm*\**'
Quantifier follows nothing in regex; marked by <-- HERE in m/* <-- HERE / at -e line 1.
Take the string literal
"a\"b"
It produces the string
a"b
Similarly, the match operator
m*a\*b*
produces the regex pattern
a*b
If you want to match a literal *, you have to use other means. In other words.
m*a\*b* === m/a*b/ matches pattern a*b
m*a\x{2A}b* === m/a\*b/ matches pattern a\*b

what does this regular expressions trying to match in TCL

I am a newbie of regular expressions, I try to understand what kind of string of the following regular expressions trying to match:
set result [regexp "$PersonName\\|\[^\\n]*\\|\[^\\n]*\\|\\s*0x$PersonId\\|\\s*$gender" [split $outPut \n]]
what does the regular expressions above trying to match?what is the value of result?
The complication here is that the regex specification is protected from the Tcl's string interpolation rules.
To detangle, you should think along these lines:
"$PersonName\\|\[^\\n]*\\|\[^\\n]*\\|\\s*0x$PersonId\\|\\s*$gender" is a double-quoted string, so the usual interpolation rules apply:
Each backslash escapes the following character;
Each $variable reference is substituted for its value;
[command ...] is substituted for the string returned by the executed command.
So each occurence of \\ is there to produce a single '\' character in the interpolated string, and \[ are meant to prevent Tcl from interpreting those [^\n] as commands (named "^\n") to be executed.
So if we suppose that the PersonName variable contains "Joe", PersonId contains DEAD and gender contains "male", Tcl will get Joe\|[^\n]*\|[^\n]*\|\s*0xDEAD\|\s*male after performing all substitutions on the source string.
Now the resulting string is passed to the RE engine which applies its own syntacting rules when it parses the string denoting a regex, as described in the re_syntax manual page.
According to these rules, each backslash, again, escapes the following character unless it's a special "character-entry escape" so here we have:
\s denotes "any whitespace character";
\| escapes the '|' making it lose its usual meaning—to introduce an alteration—so that it literally matches the character '|'.
The [^\n]* construct means "a longest series of zero or more characters not including the newline character". Read up on "character classes" in regexes for more info.
The value of result will be the number of times the regular expression matched. In the absence of the -all option, that will always be 0 or 1 (i.e., not-found/found).
Overall, that regular expression (which #kostix's answer explains well) is really ugly though. REs are a powerful tool, but you can get very confused with them very easily. Moreover, if you're splitting the output on newlines then you don't need to try to exclude them in the RE match; there will definitely be no newlines in the result of split in that case.
If we better understood what you were trying to do, we could direct you to far more effective methods of matching (e.g., using lsearch with suitable options, loading the data into an in-memory SQLite database).

Need to test for a "\\" (backslash) in this Reg Ex

Currently I use this reg ex:
"\bI([ ]{1,2})([a-zA-Z]|\d){2,13}\b"
It was just brought to my attention that the text that I use this against could contain a "\" (backslash). How do I add this to the expression?
Add |\\ inside the group, after the \d for instance.
This expression could be simplified if you're also allowing the underscore character in the second capture register, and you are willing to use metacharacters. That changes this:
([a-zA-Z]|\d){2,13}
into this ...
([\w]{2,13})
and you can also add a test for the backslash character with this ...
([\w\x5c]{2,13})
which makes the regex just a tad easier to eyeball, depending on your personal preference.
"\bI([\x20]{1,2})([\w\x5c]{2,13})\b"
See also:
WP Metacharacter
Metacharacters
Shorthand character class
Both #slavy13 and #dreftymac give you the basic solution with pointers, but...
You can use \d inside a character class to mean a digit.
You don't need to put blank into a character class to match it (except, perhaps, for clarity, though that is debatable).
You can use [:alpha:] inside a character class to mean an alpha character, [:digit:] to mean a digit, and [:alnum:] to mean an alphanumeric (specifically not including underscore, unlike \w). Note that these character classes might mean more characters than you expect; think of accented characters and non-arabic digits, especially in Unicode.
If you want to capture the whole of the information after the space, you need the repetition inside the capturing parentheses.
Contrast the behaviour of these two one-liners:
perl -n -e 'print "$2\n" if m/\bI( {1,2})([a-zA-Z\d\\]){2,13}\b/'
perl -n -e 'print "$2\n" if m/\bI( {1,2})([a-zA-Z\d\\]{2,13})\b/'
Given the input line "I a123", the first prints "3" and the second prints "a123". Obviously, if all you wanted was the last character of the second part of the string, then the original expression is fine. However, that is unlikely to be the requirement. (Obviously, if you're only interested in the whole lot, then using '$&' gives you the matched text, but it has negative efficiency implications.)
I'd probably use this regex as it seems clearest to me:
m/\bI( {1,2})([[:alnum:]\\]{2,13})\b/
Time for the obligatory plug: read Jeff Friedl's "Mastering Regular Expressions".
As I pointed out in my comment to slavy's post, \\ -> \b as a backslash is not a word character. So my suggestion is
/\bI([ ]{1,2})([\p{IsAlnum}\\]{2,13})(?:[^\w\\]|$)/
I assumed that you wanted to capture the whole 2-13 characters, not just the first one that applies, so I adjusted my RE.
You can make the last capture a lookahead if the engine supports it and you don't want to consume it. That would look like:
/\bI([ ]{1,2})([\p{IsAlnum}\\]{2,13})(?=[^\w\\]|$)/