Perl search and replace using variables, string contains dot - regex

I’m using a variable to search and replace a string using Perl.
I want to replace the string 23.0 with 23.0.1, so I tried this:
my $old="23.0";
my $new="23.0.1";
$_ =~ s/$old/$new/g;
The problem is that it also replaced the string 2310, so I tried:
my $old="23\.0"
and also /ee.
But can’t get the correct syntax for it to work. Can someone show me the correct syntax?

There are two things that will help you here:
The quotemeta function - that will escape meta characters. And also the \Q and \E regex flags, that stop regex interpolation.
print quotemeta "21.0";
Or:
my $old="23.0";
my $new="23.0.1";
my $str = "2310";
$str =~ s/\Q$old\E/$new/g;
print $str;

Just use single quotes and escape the dot.
my $old='23\.0';

To complement Sobrique's excellent answer, let me note that the reason your attempt with "23\.0" didn't work is that "23\.0" and "23.0" evaluate to the same string: in a double-quoted string literal, the backslash escape sequence \. simply evaluates to ..
There are several things you could do to avoid this:
If you indeed want to match a fixed string, and don't need or want to include any special regexp metacharacters in it, you can do as Sobrique suggest and use quotemeta or \Q to escape them.
In particular, this is almost always the correct solution if the string to be matched comes from user input. If you do want to allow some limited set of non-literal metacharacters, you can unescape those after running the pattern through quotemeta. For a simple example, here's a quick-and-dirty way to turn a basic glob-like pattern (using the metacharacters ? and * for "any character" and "any string of characters" repectively) into an equivalent regexp:
my $regexp = "^\Q$glob\E\$"; # quote and anchor the pattern
$regexp =~ s/\\\?/./g; # replace "?" (escaped to "\?" by \Q) with "."
$regexp =~ s/\\\*/.*/g; # replace "*" (escaped to "\*" by \Q) with ".*"
Conversely, if you want to have a literal regexp pattern in your code, without immediately matching it against something, you can use the qr// regexp-like quote operator, like this:
my $old = qr/\b23\.0(\.0)?\b/; # match 23.0 or 23.0.0 (but not 123.012!)
my $new = "23.0.1"; # just a literal string
s/$old/$new/g; # replace any string matching $old in $_ with $new
Note that qr// has other effects beyond just allowing you to use regexp syntax in a string literal: it actually pre-compiles the pattern into a special Regexp object, so that it doesn't need to be recompiled every time it's used later. In particular, as a side effect, the string representation of a qr// regexp literal will usually not exactly match the original content, although it will be equivalent as a regexp. For example, say qr/\b23\.0(\.0)?\b/ will, on my Perl version, output (?^u:\b23\.0(\.0)?\b).
You could also just use a normal double-quoted string literal, and double any backslashes in it, but that's (usually) less efficient than using qr//, and also less readable due to leaning toothpick syndrome.
Using a single-quoted string literal would be slightly better, since backslashes in a single-quoted string are only special when followed by another backslash or a single quote. Even so, readability can still suffer if you happen to need to match any literal backslashes in your regexp, not to mention that it's easy to create subtle bugs if you forget to double a backslash in those rare places where it's still needed.

Related

Unusual substitute without delimiter

I was unable to decipher what this regex does:
$c =~ s^.*/^^g;
I don't have access to the input or the output.
Does anyone know what it does?
The default delimiter for s/// is the slash, but you can use any printable character as an alternative.
So
$c =~ s^.*/^^g
is equivalent to
$c =~ s/.*\///g
Note that using the conventional delimiter requires the slash within the pattern itself to be escaped
Some options are better than others, and in the case where you're just trying to avoid escaping slashes within the pattern I think a pipe character | is better
I wouldn't hope to learn too much from this programmer. As you have experienced, ^ is a poor and confusing choice. Also, the /g modifier is superfluous, and $c is a terrible choice for an identifier
I would write
$c =~ s|.*/||
Here ^ is used as the delimiter.
We may use any printable character as a regex delimiter.
s^.*/^^g;
s/.*\///g;
Both regex are same
A non-standard delimiter is mostly used to avoid the need to escape the delimiter character within a regex pattern. For
$c = "this is a string with / slash";
Now your regex should be
$c =~ s/.*\///
^^
Here you are escaping the slash.
Both regex are same.
We will use whatever regex we want. #simbabque mentioned in comment.
s{foo}{bar}gs # here curly braces are delimiter
s[some][same] # here square bracket are delimeter.
And we will use character also a regex delimiter for our convenient
To avoid escaping we can use other delimiters.

How to escape a string that looks like a regular expression in Perl

I have a script that, among other things, searches a list of text files to replace a Windows path (text string) with another path.
The problem is that some of the folder names begin with a number and a dash. Perl seems to think that I am trying to invoke a regular expression here. I get the message, "Reference to nonexistent group in regex".
the string looks like this:
\\\BAGlobal\6-Engineering\3-Tech
I have quoted it like this:
my $find = "\\\\\\\BAGlobal\\\6-Engineering\\\3-Tech"
How do I escape the 6- and 3- ?
The problem is not the the dash in 6- but all the backslashes \.
It thinks that \3 and \6 are back-references to previously matched groups, like /foo(bar) foo\1/ would match the string foobar foobar.
If you use this in a pattern match you need to either include \Q and \E to add quoting, or apply the quotemeta built-in to your $find.
my $find = '\\\\\\\BAGlobal\\\6-Engineering\\\3-Tech';
$string =~ m/\Q$find\E/;
Or with quotemeta.
my $find = quotemeta '\\\\\\\BAGlobal\\\6-Engineering\\\3-Tech';
$string =~ m/$find/;
Also see perlre.
Note that your example code is probably wrong. The number of backslashes you have there is uneven, and double quotes "" interpolate, so each pair of backslashes \\ turn into one actual backslash in the string. But because you have 7 of them, the last one is seen as the escape for B, turning that into \B, which is not a valid escape sequence. I used single quotes '' in my code above.

Why does Perl complain about unmatched bracket within a \Q..\E regex section?

I have a regex in a variable, that includes a substring inside \Q...\E containing an opening bracket. I'm expecting that [ would be interpreted like a vanilla character by the parser since it's inside a \Q...\E section.
This seems to be the case when the regex comes as a literal in the program, but the parser fails on it when it comes in a variable.
Here's a simplified example.
This works:
$r = qr/\Qa[b\E\d+/;
if ("a[b1" =~ $r) { print "match\n"; }
This fails:
$v='\Qa[b\E\d+';
$r=qr/$v/;
It dies at line 2 with
Unmatched [ in regex; marked by <-- HERE in m/\Qa[ <-- HERE b\E\d+/
Why would Perl reject this? And only when interpolated from a variable and not inline with the same regex?
I can't see anything explaining it in the FAQ's How do I match a regular expression that's in a variable? or perlop's Regexp Quote-Like Operators.
This is with Perl 5.14.2 (Ubuntu 12.04) if the version matters, with default settings.
\Q has nothing to do with regular expressions. When the regex engine sees \Q, it doesn't recognize it, spits out a warning, and treats it like \\Q.
>perl -we"$re='\Qa'; qr/$re/
Unrecognized escape \Q passed through in regex; marked by <-- HERE in m/\Q <-- HERE a/ at -e line 1.
Like interpolation, \Q is recognized by double-quoted string literals and similar. Like interpolation, it's gotta be part of a literal (Perl code) to work.
>perl -E"$pat=q{\Q!}; say qr/$pat/"
(?^u:\Q!)
>perl -E"$pat=qq{\Q!}; say qr/$pat/"
(?^u:\!)
>perl -E"$x='!'; $pat=q{$x}; say qr/$pat/"
(?^u:$x)
>perl -E"$x='!'; $pat=qq{$x}; say qr/$pat/"
(?^u:!)
Solutions:
$v="\Qa[b\E\\d+";
$v=qr/\Qa[b\E\d+/;
$v=quotemeta('a[b').'\d+';
A Perl regular expression is first evaluated as if it was a simple double-quoted string. Any embedded variables are interpolated, and escape sequences that don't originate from interpolated variables are processed. This is the point when special operators like \L, \U and \Q...\E are acted on.
The processing stops there in double-quoted strings, but in regular expressions the string is then compiled.
In your example you have
$v = '\Qa[b\E\d+';
and because you have used single quotes, this string isn't changed at all.
You then interpolate it into a regular expression with
$r = qr/$v/;
but, because escape sequences inside interpolated variables are untouched, the string is passed as it is to the regex compiler, which reports that the expression is invalid because it contains an unmatched an unescaped open bracket. If you remove that bracket you still get an error; this time it is Unrecognized escape \Q passed through in regex showing that the \Q...\E hasn't been processed and appears as literals.
What would work is to change your assignment to $v to use double quotes instead, like this
my $v = "\Qa[b\E\\d+";
The backslash on \d has to be doubled up otherwise is would just vanish. Now the \Q...\E has been acted on, and $v is equal to a\[b\d+. Compiling this as a regex works fine.
The \Q and \E metacharacters are interpreted at the time the regex is parsed. They are not part of the regular expression itself. If \Q and \E appear inside a regex literal, they tell the parser to ignore characters that normally have special meaning inside regular expressions, including brackets. If \Q and \E appear in a single-quotes as part of a variable assignment, they are treated as literal strings. When this variable is then used inside a regex, the literal values become part of the regular expression. The backslashes are interpreted as escapes, so \Q matches a literal Q, and \E matches a literal E.
To see this, try compiling the regex and then printing it:
$v=qr/\Qa[b\E\d+/;
print "$v\n";
The output is:
(?-xism:a\[b\d+)
Note that the \Q and \E are gone, and the bracket has been escaped. If you assign a string that contains \Q and \E seperately inside single quotes:
$v='ab\Qcd\Eef';
$r=qr/$v/;
print "$r\n";
You get:
(?-xism:ab\Qcd\Eef)
This regex actually matches "abQcdEef":
$v='ab\Qcd\Eef';
$r=qr/$v/;
if("abQcdEef" =~ /$r/) {print "matches\n"} else {print "no match\n"}
result:
matches

Substitution with \s does not work as expected

I write regex to remove more than 1 space in a string. The code is simple:
my $string = 'A string has more than 1 space';
$string = s/\s+/\s/g;
But, the result is something bad: 'Asstringshassmoresthans1sspace'. It replaces every single space with 's' character.
There's a work around is instead of using \s for substitution, I use ' '. So the regex becomes:
$string = s/\s+/ /g;
Why doesn't the regex with \s work?
\s is only a metacharacter in a regular expression (and it matches more than just a space, for example tabs, linebreak and form feed characters), not in a replacement string. Use a simple space (as you already did) if you want to replace all whitespace by a single space:
$string = s/\s+/ /g;
If you only want to affect actual space characters, use
$string = s/ {2,}/ /g;
(no need to replace single spaces with themselves).
The answer to your question is that \s is a character class, not a literal character. Just as \w represents alphanumeric characters, it cannot be used to print an alphanumeric character (except w, which it will print, but that's beside the point).
What I would do, if I wanted to preserve the type of whitespace matched, would be:
s/\s\K\s*//g
The \K (keep) escape sequence will keep the initial whitespace character from being removed, but all subsequent whitespace will be removed. If you do not care about preserving the type of whitespace, the solution already given by Tim is the way to go, i.e.:
s/\s+/ /g
\s stands for matching any whitespace. It's equivalent to this:
[\ \t\r\n\f]
When you replace with $string = s/\s+/\s/g;, you are replacing one or more whitespace characters with the letter s. Here's a link for reference: http://perldoc.perl.org/perlrequick.html
Why doesn't the regex with \s work?
Your regex with \s does work. What doesn't work is your replacement string. And, of course, as others have pointed out, it shouldn't.
People get confused about the substitution operator (s/.../.../). Often I find people think of the whole operator as "a regex". But it's not, it's an operator that takes two arguments (or operands).
The first operand (between the first and second delimiters) is interpreted as a regex. The second operand (between the second and third delimiters) is interpreted as a double-quoted string (of course, the /e option changes that slightly).
So a substitution operation looks like this:
s/REGEX/REPLACEMENT STRING/
The regex recognises special characters like ^ and + and \s. The replacement string doesn't.
If people stopped misunderstanding how the substitution operator is made up, they might stop expecting regex features to work outside of regular expressions :-)

Difference between Perl regular expression delimiters /.../ and #...#

Today I came across two different syntaxes for a Perl regular expression match.
#I have a date string
my $time = '2012-10-29';
#Already familiar "m//":
$t =~ m/^(\d{4}-\d\d-\d\d)$/
#Completely new to me m##.
$t =~ m#^(\d{4}-\d\d-\d\d)#/
Now what is the difference between /expression/ and #expression#?
As everone else said, you can use any delimiter after the m.
/ has one special feature: you can use it by itself, e.g.
$string =~ /regexp/;
is equivalent to:
$string =~ m/regexp/;
Perl allows you to use pretty much any characters to delimit strings, including regexes. This is especially useful if you need to match a pattern that contains a lot of slash characters:
$slashy =~ m/\/\//; #Bad
$slashy =~ m|//|; #Good
According to the documentation, the first of those is an example of "leaning toothpick syndrome".
Most but not all characters behave in the same way when escaping. There is an important exception: m?...? is a special case that only matches a single time between calls to reset().
Another exception: if single quotes are used for the delimiter, no variable interpolation is done. You still have to escape $, though, as it is a special character matching the end of the line.
Nothing except what you have to escape in the regex. You can use any pair of matched characters you like.
$string = "http://example.com/";
$string =~ m!http://!;
$string =~ m#http://!#;
$string =~ m{http://};
$string =~ m/http:\/\//;
After the match or search/replace operator (the m and s, respectively) you can use any character as the delimiter, e.g. the # in your case. This also works with pairs of parenthesis: s{ abc (.*) def }{ DEF $1 ABC }x.
Advantages are that you don't have to escape the / (but the actual delimiter characters, of course). It's often used for clarity, especially when dealing with things like paths or protocols.
There is no difference; the "/" and "#" characters are used as delimiters for the expression. They simply mark the "boundary" of the expression, but are not part of the expression. In theory you can use most non-alphanumeric characters as a delimiter. Here is a link to the PHP manual (It doesn't matter that it is the PHP manual, the Regex syntax is the same, I just like it because it explains well) on Perl compatible regular expression syntax; read the part about delimiters