Why does adding the Perl /x switch stop my regex from matching? - regex

I'm trying to match:
JOB: fruit 342 apples to get
The code matches:
$line =~ /^JOB: fruit (\d+) apples to get/
But, when I add the /x switch in:
$line =~ /^JOB: fruit (\d+) apples to get/x
It does not match.
I looked into the /x switch, and it says it just lets you do comments. I don't know why adding /x stops my regex from matching.

The /x modifier tells Perl to ignore most whitespace that isn't escaped in the regex.
For example, let's just focus on apples to get. You could match it with:
$line =~ /apples to get/
But if you try:
$line =~ /apples to get/x
then Perl will ignore the spaces. So it would be like trying to match applestoget.
You can read more about it in perlre. They have this nice example of how you can use the modifier to make the code more readable.
# Delete (most) C comments.
$program =~ s {
/\* # Match the opening delimiter.
.*? # Match a minimal number of characters.
\*/ # Match the closing delimiter.
} []gsx;
They also mention how to match whitespace or # again while using the /x modifier.
Use of /x means that if you want real whitespace or # characters in
the pattern (outside a bracketed character class, which is unaffected
by /x), then you'll either have to escape them (using backslashes or
\Q...\E ) or encode them using octal, hex, or \N{} escapes.

Part of allowing comments is also ignoring literal white space. Use \s or [ ] for spaces you wish to match.
For example
$line =~ /^ #beginning of string
JOB:[ ]fruit[ ] #some literal text
(\d+) #capture digits to $1
[ ]apples[ ]to[ ]get #more literal text
/x
Notice all those spaces before the beginning of the comments. It would stink if they counted....

Related

How can I match start of the line or a character in Perl?

For example, this is ok:
my $str = 'I am $name. \$escape';
$str =~ s/[^\\]\K\$([a-z]+)/Bob/g;
print $str; # 'I am Bob. \$escape';
But below is not what I was expected.
my $str = '$name';
$str =~ s/[^\\]\K\$([a-z]+)/Bob/g;
print $str; # '$name';
How can I correct this?
How can I match start of the line or a character in Perl?
The circumflex inside a character class loses the meaning of the start-of-string anchor. Instead of a character class, you need to use a non-capturing group:
$str =~ s/(?:^|\\)\K\$([a-z]+)/Bob/g;
^^^^^^^^
This (?:^|\\) will either assert the position at the string start or will match \.
For those who understand the question as match only if the $ symbol is not escaped, the solution will be
$str =~ s/(?<!\\)(?:\\\\)*\K\$([a-z]+)/Bob/g;
Here, the (?<!\\) zero-width assertion is a negative lookbehind that fails the match if $ is preceded with \ symbol and (?:\\\\)* will consume any escaped backslashes (if present) before $ while \K match reset operator will discard all these backslashes from the match value.
If your goal is to match $ that are not escaped by a backslash, you can change your pattern to:
(?<!\\)(?:\\{2})*\K\$([a-z]+)
This way you don't have to use an alternation since the negative lookbehind matches a position not preceded by a backslash (that includes the start of the string).
In addition, (?:\\{2})* prevents to miss cases when a backslash before a $ is itself escaped with an other backslash. For example: \\$name

How to get this perl extended regex to work?

I have the following code
my #txt = ("Line 1. [foo] bar",
"Line 2. foo bar",
"Line 3. foo [bar]"
);
my $regex = qr/^
Line # Bare word
(\d+)\. # line number
\[ # Open brace
(\w+) # Text in braces
] # close brace
.* # slurp
$
/x;
my $nregex = qr/^\s*Line\s*(\d+)\.\s*\[\s*(\w+)\s*].*$/;
foreach (#txt) {
if ($_ =~ $regex) {
print "Lnum $1 => $2\n";
}
if ($_ =~ $nregex) {
print "N Lnum $1 => $2\n";
}
}
Output
N Lnum 1 => foo
I am expecting both the regexs to be equivalent and capture only the first line of the array. However only $nregex works!
How can $regex be fixed so that it also works identically (with the x option)?
Edit
Based on the response, updated the regex and it works.
my $regex = qr/^ \s*
Line \s* # Bare word
(\d+)\. \s* # line number
\[ \s* # Open brace
(\w+) \s* # Text in braces
] \s* # close brace
.* # slurp
$
/x;
Your two expressions are NOT the same. You need to have the \s* bits in the first one. The /x allows you to write neatly formatted expressions - with comments as you've noticed. As such, the spaces in the /x version are not considered significant, and will not contribute to any matching activity.
In other words, your /x version is the equivalent of
qr/^Line(\d+)\.\[(\w+)].*$/x
By the way, just having a plain space instead of \s* or \s+ would also fail many times; your sample data contains TWO spaces next to each other in a few places. These two places will not match a single space.
Final tip: when you MUST have at least one space in a certain position, you should use \s+ to enforce at least one space. You can surely figure out where that might be useful in your patterns once you know it is possible.

Match everything except for characters within quotes

After some playing around I came up with a way to capture characters within single/double quotes:
['"](?:[^'"]*?(?:\\")*)*["']
Not sure if this is entirely correct. In any event, I am now trying to match everything BUT these.
Example:
'stringA' '\"stringB\"' variableA variableB
The above regex matches: 'stringA' '\"stringB\"'
I would like to match variableA variableB
Is there a way I can achieve this with Perl? I was trying to use negative/positive lookahead/behinds but I encountered issues as my lookbehind had \s* which was not allowed.
Thanks for your help.
Use the PCRE verb (*SKIP)(*F),
['"](?:[^'"]*?(?:\\")*)*["'](*SKIP)(*F)|\S+
DEMO
['"](?:[^'"]*?(?:\\")*)*["'] Matches strings within double or single quotes.
(*SKIP)(*F) Causes the preceding pattern to fail. And try to match the pattern that was in the right side of | operator from the remaining strings.
\S+ Matches one or more non-space characters except the double or single quoted string.
You could use a long complicated regex like the following:
my #words = split m{
' (?: [^'\\]* | \\. )* ' (*SKIP)(*F)
|
" (?: [^"\\]* | \\. )* " (*SKIP)(*F)
|
\s+
}x, $_;
However, I would recommend using Text::ParseWords:
#!/usr/bin/perl -w
use strict;
use warnings;
use Text::ParseWords;
while (<DATA>) {
my #words = parse_line(qr{\s+}, 0, $_);
print "$_\n" for #words;
}
__DATA__
'stringA' '\"stringB\"' variableA variableB
Outputs:
stringA
\"stringB\"
variableA
variableB

~m, s and () in perl regexp

I am trying to get hold of regular expressions in Perl. Can anyone please provide any examples of what matches and what doesn't for the below regular expression?
$sentence =~m/.+\/(.+)/s
=~ is the binding operator; it makes the regex match be performed on $sentence instead of the default $_. m is the match operator; it is optional (e.g. $foo =~ /bar/) when the regex is delimited by / characters but required if you want to use a different delimiter.
s is a regex flag that makes . in the regex match any characters; by default . does not match newlines.
The actual regex is .+\/(.+); this will match one or more characters, then a literal / character, then one or more other characters. Because the initial .+ consumes as much as possible while still allowing the regex to succeed, it will match up to the last / in the string that has at least one character after it; then the (.+) will capture the characters that follow that / and make them available as $1.
So it is essentially capturing the final component of a filepath. Of foo/bar it will capture the bar, of foo/bar/ it will capture the bar/. Strings with only one component, like /foo or bar/ or baz will not match.
Any string, including multi-line strings, that contain a slash character somewhere in the middle of the string.
Matches:
foo/bar
asdf\nwrqwer/wrqwerqw # /s modifier allows '.' to match newlines
Doesn't match:
asdfasfdasf # no slash character
/asdfasdf # no characters before the slash
asdfasf/ # no characters after the slash
In addition, the entire substring that follows the last slash in the string will be captured and assigned to the variable $1.
Breakdown:
$sentence =~ — match $sentence with
m/ — the pattern consisting of
. — any character
+ — one or more times
\/ — then a forward-slash
( — and, saving in the $1 capture group,
.+ — any character one or more times
)
/s — allowing . to match newlines
See perldoc perlop for information about operators such as =~ and quote-like operators such as m//, and perldoc perlre about regular expressions and their options such as /s.

How can I highlight consecutive duplicate words with a Perl regular expression?

I want a Perl regular expression that will match duplicated words in a string.
Given the following input:
$str = "Thus joyful Troy Troy maintained the the watch of night..."
I would like the following output:
Thus joyful [Troy Troy] maintained [the the] watch of night...
This is similar to one of the Learning Perl exercises. The trick is to catch all of the repeated words, so you need a "one or more" quantifier on the duplication:
$str = 'This is Goethe the the the their sentence';
$str =~ s/\b((\w+)(?:\s+\2\b)+)/[\1]/g;
The features I'm about to use are described in either perlre, when they apply at a pattern, or perlop when they affect how the substitution operator does its work.
If you like the /x flag to add insignificant whitespace and comments:
$str =~ s/
\b
(
(\w+)
(?:
\s+
\2
\b
)+
)
/[\1]/xg;
I don't like that \2 though because I hate counting relative positions. I can use the relative backreferences in Perl 5.10. The \g{-1} refers to the immediately preceding capture group:
use 5.010;
$str =~ s/
\b
(
(\w+)
(?:
\s+
\g{-1}
\b
)+
)
/[\1]/xg;
Counting isn't all that great either, so I can use labeled matches:
use 5.010;
$str =~ s/
\b
(
(?<word>\w+)
(?:
\s+
\k<word>
\b
)+
)
/[\1]/xg;
I can label the first capture ($1) and access its value in %+ later:
use 5.010;
$str =~ s/
\b
(?<dups>
(?<word>\w+)
(?:
\s+
\k<word>
\b
)+
)
/[$+{dups}]/xg;
I shouldn't really need that first capture though since it's really just there to refer to everything that matched. Sadly, it looks like ${^MATCH} isn't set early enough for me to use it in the replacement side. I think that's a bug. This should work but doesn't:
$str =~ s/
\b
(?<word>\w+)
(?:
\s+
\k<word>
\b
)+
/[${^MATCH}]/pgx; # DOESN'T WORK
I'm checking this on blead, but that's going to take a little while to compile on my tiny machine.
This works:
$str =~ s/\b((\w+)\s+\2)\b/[\1]/g;
You can try:
$str = "Thus joyful Troy Troy maintained the the watch of night...";
$str =~s{\b(\w+)\s+\1\b}{[$1 $1]}g;
print "$str"; # prints Thus joyful [Troy Troy] maintained [the the] watch of night...
Regex used: \b(\w+)\s+\1\b
Explanation:
\b: word bondary
\w+: a word
(): to remember the above word
\s+: whitespace
\1: the remembered word
It effectively finds two full words separated by whitespace and places [ ] around them.
EDIT:
If you want to preserve the amount of whitespace between the words you can use:
$str =~s{\b(\w+)(\s+)\1\b}{[$1$2$1]}g;
Try the following:
$str =~ s/\b(\S+)\b(\s+\1\b)+/[\1]/g;