Perl 6 regex not terminated - regex

I have a Perl 6 code where I am doing the following:
if ($line ~~ /^\s*#/) { print "matches\n"; }
I'm getting this error:
===SORRY!===
Regex not terminated.
at line 2
------> <BOL>�<EOL>
Unable to parse regex; couldn't find final '/'
at line 2
------> <BOL>�<EOL>
expecting any of:
infix stopper
This is part of a Perl 5 code:
if ($line =~ /^\s*#/)
It used to work fine to identify lines that have an optional space and a #.
What's causing this error in Perl 6?

In Perl 6, everything from a lone1 # to the end of the line is considered a comment, even inside regexes.
To avoid this, make it a string literal by placing it inside quotes:
if $line ~~ / ^ \s* '#' / { say "matches"; }
(Escaping with \ should also work, but Rakudo seems to have a parsing bug which makes that not work when preceded by a space.
And quoting the character as shown here is the recommended way anyway – Perl 6 specifically introduced quoted strings inside regexes and made spaces insignificant by default, in order to avoid the backslash clutter that many Perl 5 regexes suffer from.)
More generally, all non-alphanumeric characters need to be quoted or escaped inside Perl 6 regexes in order to match them literally.
This is another deliberate non-backwards-compatible change from Perl 5, where this is a bit more complicated.
In Perl 6 there is a simple rule:
alphanumeric --> matches literally only when not escaped.
(When escaped, they either have special meanings, e.g. \s etc., or are forbidden.)
non-alphanumeric --> matches literally only when escaped.
(When not escaped, they either have special meanings, e.g. ., +, #, etc., or are forbidden.)
1 'Lone' meaning not part of a larger token such as a quoted string or the opener of an embedded comment.

A hash # is used as a comment marker in Perl 6 regexes.
Add a backslash \ to escape it like this
if ( $line =~ /^\s*\#/ )

Related

Perl regex to replace a underscore or forward slash with a dash

While there are several regex examples here showing the many variations, simply I just want to use regex in Perl to search 2 different strings with one string as an underscore(_) and the other string as a forward slash (/) and replace each string with a hyphen (-) plus string. I am using the delimiter backslash, however it is the incorrect output.
Input: Output:
_APPLE -APPLE
/APPLE -APPLE
Here is my code:
$string1 =~ s/\_\/APPLE/-APPLE
$string2 =~ s/\/\/APPLE/-APPLE
The code has an extra (escaped) / and would match strings with _/ (and // in the second case). That is not in your data, which has either _ or /, not both.
Also, there is no need to escape the _, and neither the / if it is not the delimiter.
To match either of a few characters the cleanest and most efficient is the character class
$string =~ s{[_/](\w+)}{-$1};
The alternation also works here
$string =~ s{(?:_|/)(\w+)}{-$1};
but it is more suitable when possibilities to match have more characters (word|another).
There are quite a few assumptions here, given how little is specified in the question. For one, \w also matches digits and _ along with letters. If you clarify the requirements I'll edit as needed.
I assume that the missing closing delimiter, needed for the code to compile, is a typo in posting.

How to escape a string that looks like a regular expression in Perl

I have a script that, among other things, searches a list of text files to replace a Windows path (text string) with another path.
The problem is that some of the folder names begin with a number and a dash. Perl seems to think that I am trying to invoke a regular expression here. I get the message, "Reference to nonexistent group in regex".
the string looks like this:
\\\BAGlobal\6-Engineering\3-Tech
I have quoted it like this:
my $find = "\\\\\\\BAGlobal\\\6-Engineering\\\3-Tech"
How do I escape the 6- and 3- ?
The problem is not the the dash in 6- but all the backslashes \.
It thinks that \3 and \6 are back-references to previously matched groups, like /foo(bar) foo\1/ would match the string foobar foobar.
If you use this in a pattern match you need to either include \Q and \E to add quoting, or apply the quotemeta built-in to your $find.
my $find = '\\\\\\\BAGlobal\\\6-Engineering\\\3-Tech';
$string =~ m/\Q$find\E/;
Or with quotemeta.
my $find = quotemeta '\\\\\\\BAGlobal\\\6-Engineering\\\3-Tech';
$string =~ m/$find/;
Also see perlre.
Note that your example code is probably wrong. The number of backslashes you have there is uneven, and double quotes "" interpolate, so each pair of backslashes \\ turn into one actual backslash in the string. But because you have 7 of them, the last one is seen as the escape for B, turning that into \B, which is not a valid escape sequence. I used single quotes '' in my code above.

Perl search and replace using variables, string contains dot

I’m using a variable to search and replace a string using Perl.
I want to replace the string 23.0 with 23.0.1, so I tried this:
my $old="23.0";
my $new="23.0.1";
$_ =~ s/$old/$new/g;
The problem is that it also replaced the string 2310, so I tried:
my $old="23\.0"
and also /ee.
But can’t get the correct syntax for it to work. Can someone show me the correct syntax?
There are two things that will help you here:
The quotemeta function - that will escape meta characters. And also the \Q and \E regex flags, that stop regex interpolation.
print quotemeta "21.0";
Or:
my $old="23.0";
my $new="23.0.1";
my $str = "2310";
$str =~ s/\Q$old\E/$new/g;
print $str;
Just use single quotes and escape the dot.
my $old='23\.0';
To complement Sobrique's excellent answer, let me note that the reason your attempt with "23\.0" didn't work is that "23\.0" and "23.0" evaluate to the same string: in a double-quoted string literal, the backslash escape sequence \. simply evaluates to ..
There are several things you could do to avoid this:
If you indeed want to match a fixed string, and don't need or want to include any special regexp metacharacters in it, you can do as Sobrique suggest and use quotemeta or \Q to escape them.
In particular, this is almost always the correct solution if the string to be matched comes from user input. If you do want to allow some limited set of non-literal metacharacters, you can unescape those after running the pattern through quotemeta. For a simple example, here's a quick-and-dirty way to turn a basic glob-like pattern (using the metacharacters ? and * for "any character" and "any string of characters" repectively) into an equivalent regexp:
my $regexp = "^\Q$glob\E\$"; # quote and anchor the pattern
$regexp =~ s/\\\?/./g; # replace "?" (escaped to "\?" by \Q) with "."
$regexp =~ s/\\\*/.*/g; # replace "*" (escaped to "\*" by \Q) with ".*"
Conversely, if you want to have a literal regexp pattern in your code, without immediately matching it against something, you can use the qr// regexp-like quote operator, like this:
my $old = qr/\b23\.0(\.0)?\b/; # match 23.0 or 23.0.0 (but not 123.012!)
my $new = "23.0.1"; # just a literal string
s/$old/$new/g; # replace any string matching $old in $_ with $new
Note that qr// has other effects beyond just allowing you to use regexp syntax in a string literal: it actually pre-compiles the pattern into a special Regexp object, so that it doesn't need to be recompiled every time it's used later. In particular, as a side effect, the string representation of a qr// regexp literal will usually not exactly match the original content, although it will be equivalent as a regexp. For example, say qr/\b23\.0(\.0)?\b/ will, on my Perl version, output (?^u:\b23\.0(\.0)?\b).
You could also just use a normal double-quoted string literal, and double any backslashes in it, but that's (usually) less efficient than using qr//, and also less readable due to leaning toothpick syndrome.
Using a single-quoted string literal would be slightly better, since backslashes in a single-quoted string are only special when followed by another backslash or a single quote. Even so, readability can still suffer if you happen to need to match any literal backslashes in your regexp, not to mention that it's easy to create subtle bugs if you forget to double a backslash in those rare places where it's still needed.

What do these qr{} regular expressions mean?

What do these mean?
qr{^\Q$1\E[a-zA-Z0-9_\-]*\Q$2\E$}i
qr{^[a-zA-Z0-9_\-]*\Q$1\E$}i
If $pattern is a Perl regular expression, what is $identity in the code below?
$identity =~ $pattern;
When the RHS of =~ isn't m//, s/// or tr///, a match operator (m//) is implied.
$identity =~ $pattern;
is the same as
$identity =~ /$pattern/;
It matches the pattern or pre-compiled regex $pattern (qr//) against the value of $identity.
The binding operator =~ applies a regex to a string variable. This is documented in perldoc perlop
The \Q ... \E escape sequence is a way to quote meta characters (also documented in perlop). It allows for variable interpolation, though, which is why you can use it here with $1 and $2. However, using those variables inside a regex is somewhat iffy, because they themselves are defined during the use of a capture inside a regex.
The character class bracket [ ... ] defines a range of characters which it will match. The quantifier that follows it * means that particular bracket must match zero or more times. The dashes denote ranges, such as a-z meaning "from a through z". The escaped dash \- means a literal dash.
The ^ and $ (the dollar sign at the end) denotes anchors, beginning and end of string respectively. The modifier i at the end means the match is case insensitive.
In your example, $identity is a variable that presumably contains a string (or whatever it contains will be converted to a string).
The perlre documentation is your friend here. Search it for unfamiliar regex constructs.
A detailed explanation is below, but it is so hairy that I wonder whether using a module such as Text::Balanced would be a superior approach.
The first pattern matches possibly empty delimited strings, and the delimiters are in $1 and $2, which we do not know until runtime. Say $1 is ( and $2 is ), then the first pattern matches strings of the form
()
(a)
(9)
(abcABC_012-)
and so on …
The second pattern matches terminated strings, where the terminator is in $1—also not known until runtime. Assuming the terminator is ], then the second pattern matches strings of the form
]
a]
Aa9a_9]
Using \Q...\E around a pattern removes any special regex meaning from the characters inside, as documented in perlop:
For the pattern of regex operators (qr//, m// and s///), the quoting from \Q is applied after interpolation is processed, but before escapes are processed. This allows the pattern to match literally (except for $ and #). For example, the following matches:
'\s\t' =~ /\Q\s\t/
Because $ or # trigger interpolation, you'll need to use something like /\Quser\E\#\Qhost/ to match them literally.
The patterns in your question do want to trigger interpolation but do not want any regex metacharacters to have special meaning, as with parentheses and square brackets above that are meant to match literally.
Other parts:
Circumscribed brackets delimit a character class. For example, [a-zA-Z0-9_\-] matches any single character that is upper- or lowercase A through Z (but with no accents or other extras), zero through nine, underscore, or hyphen. Note that the hyphen is escaped at the end to emphasize that it matches a literal hyphen rather and does not specify part of a range.
The * quantifier means match zero or more of the preceding subpattern. In the examples from your question, the star repeats character classes.
The patterns are bracketed with ^ and $, which means an entire string must match rather than some substring to succeed.
The i at the end, after the closing curly brace, is a regex switch that makes the pattern case-insensitive. As TLP helpfully points out in the comment below, this makes the delimiters or terminators match without regard to case if they contain letters.
The expression $identity =~ $pattern tests whether the compiled regex stored in $pattern (created with $pattern = qr{...}) matches the text in $identity. As written above, it is likely being evaluated for its side effect of storing capture groups in $1, $2, etc. This is a red flag. Never use $1 and friends unconditionally but instead write
if ($identity =~ $pattern) {
print $1, "\n"; # for example
}

Space character in regex is not recognised

I'm writing a simple program - please see below for my code with comments. Does anyone know why the space character is not recognised in line 10? When I run the code, it finds the :: but does not replace it with a space.
1 #!/usr/bin/perl
2
3 # This program replaces :: with a space
4 # but ignores a single :
5
6 $string = 'this::is::a:string';
7
8 print "Current: $string\n";
9
10 $string =~ s/::/\s/g;
11 print "New: $string\n";
Try s/::/ /g instead of s/::/\s/g.
The \s is actually a character class representing all whitespace characters, so it only makes sense to have it in the regular expression (the first part) rather than in the replacement string.
Use s/::/ /g. \s only denotes whitespace on the matching side, on the replacement side it becomes s.
Replace the \s with a real space.
The \s is shorthand for a whitespace matching pattern. It isn't used when specifying the replacement string.
Replace string should be a literal space, i.e.:
$string =~ s/::/ /g;