Perl regular expression (regex) fails when I make it optional [duplicate]

Perl regular expression (regex) fails when I make it optional [duplicate] - regex

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I am running the following snippet of code on Perl 5.22:
DB<41> x "up 34 days, 22:04 and more" =~ m/.*?(?:(\d+) days).*$/
0 34
The above code works as expected and pulls out the 34 from "34 days".
My question comes in when I make the capture group optional by adding a ? at the end of it like this:
DB<4> x "up 34 days, 22:04 and more" =~ m/.*?(?:(\d+) days)?.*$/
0 undef
Why does it no longer match the 34? I have searched the web, but couldn't find any questions that matched mine (if you do have a link that explains it, that would be fantastic).
Thanks, in advance, for your time.

Regexes work from left to right, always; and quantifiers always try first to match as much as they can, or as little as they can when made non-greedy (like .*?). When they reach an unmatchable state, only then they will back up and try a new match (backtracking). The key to regexes is working around what the regex engine will try first.
.*? will first try to match the empty string at the beginning of the string, since that's the least it can match. In the case of the first regex, that will not result in a successful overall match, so it eventually backtracks until .*? matches "up " so that the following group can match "34 days". But if you make the following group optional, the first thing it will try is to match initial pattern of .*? to the empty string followed by (?:(\d+) days)? matching the empty string (since it cannot match digits followed by "days" at that particular position, but it can match the empty string) followed by .* matching the rest of the string followed by the end of the string; a successful match.
Regexp::Debugger can be nice to visualize the behavior, as well as https://regex101.com/ (just beware that PCRE is not exactly the same as Perl regex).

Since both, .*? and (?:(\d+) days)? match the empty string and .*$ then matches any other string, i.e. also the the whole input string.
If you check the following
use strict;
use warnings;
my $s = "up 34 days, 22:04 and more";
if ($s =~ m/.*?(?:(\d+) days)(.*)$/) {
print("first:\n $1=\"$1\"\n \$2=\"$2\"\n");
}
if ($s =~ m/.*?(?:(\d+) days)?(.*)$/) {
print("second:\n \$1=\"$1\"\n \$2=\"$2\"\n");
}
you'll get
first:
34="34"
$2=", 22:04 and more"
second:
$1=""
$2="up 34 days, 22:04 and more"
as output (and a warning about $1 being undefined that you can ignore here) which illustrates that.

Related

regex negative lookbehind matching when expected not to

Can someone help me understand why the following regex is matching when i would expect it not to match.
String to check against
/opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med_Transfer/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt
regex
(?<!Transfer\/)\w*PINPUK.*(?:csv|txt)$
I was expecting this to not match as the string Transfer/ appears before 0 or more word chars followed by the string PINPUK. If I change the pattern from \w* to \w{6} to explicitly match 6 word chars this correctly returns no match.
Can someone help me understand why with the 0 or more quantifier on my "word" character results in the regex giving a match?

Your regex pattern (?<!Transfer/)\w*PINPUK.*(?:csv|txt)$ is looking for \w*PINPUK not immediately preceded by Transfer/
Given the string
/opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med_Transfer/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt
the regex engine will start by matching \w*PINPUK with CBD99_PINPUK
But that is preceded by Transfer/ so the engine backtracks and finds BD99_PINPUK
That is preceded by C, which isn't Transfer/, so the match is successful
As for a fix, just put the slash outside the look-behind
(?<!Transfer)/\w*PINPUK.*(?:csv|txt)$
That forces the \w* to begin right after the slash, and the pattern now correctly fails

Borodin has given an excellent explanation of why this doesn't work and a solution for this case (move a /). Sometimes something simple like that isn't possible though so here I'll explain an alternate work around that might be useful
Things will match as you expect if you move the \w* inside the negative look-behind. Like so:
(?<!Transfer\/\w*)PINPUK.*(?:csv|txt)$
Unfortunately Perl doesn't allow this, negative look-behinds must be fixed width. But still, there is a way to perform one match: match in reverse
^(?:vsc|txt).*KUPNIP(?!\w*\/refsnarT)
This uses a variable length negative look-ahead, something Perl does allow. Putting all this together in a script we get
use strict;
use warnings;
use feature 'say';
my $string_matches = '/opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med_Transfer/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt';
say "Trying $string_matches";
if ( reverse($string_matches) =~ /^(?:vsc|txt).*KUPNIP(?!\w*\/refsnarT)/ ) {
say 'It matched';
} else {
say 'No match';
}
say '';
my $string_doesnt_match = '/opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt';
say "Trying $string_doesnt_match";
if ( reverse($string_doesnt_match) =~ /^(?:vsc|txt).*KUPNIP(?!\w*\/refsnarT)/ ) {
say 'It matched';
} else {
say 'No match';
}
Which outputs
Trying /opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med_Transfer/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt
No match
Trying /opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt
It matched

Lookbehind does not work as expected

I am trying to understand the lookbehind.
This example I am trying doesn't work as I expected. I wanted to try to form a regex that would match John but not John.
The following:
$ perl -e '
my $var = "John.";
if( $var =~ m/J*/) {
print "Matches!\n";
}
'
Matches!
matches up to and including . of course. The problem is the following:
$ perl -e '
my $var = "John.";
if( $var =~ m/J*(?<![.])/) {
print "Matches!\n";
}
'
Matches!
For the latter I expected that the regex would match John. consuming >.< (the period)
Then at the next position it would look behind and realize that it consumed a period (.) and would reject the match.
Is my understanding wrong? What am I messing up here?
Update:
Same result also for my $var = "John. ";
Update 2:
My question is not about how to match only John and not John.
But to understand how lookbehind works and if it is not supposed to work in this case why.

The * is a quantification operator, not a placeholder. So A* means zero or more A characters. Without any further context, this always matches, e.g. "foo" =~ /J*/ is true.
What you intended to write was /J.*/ which does what you've actually described.
Now let's look what happens when we do "John." =~ /(J.*(?<![.]))/:
The regex engine sees J, which matches.
The next pattern is .*, which matches ohn..
Next the assertion (?<![.]) is tested, which fails.
The regex engine therefore backtracks.
We try .* again, but this time only match ohn.
Next the assertion (?<![.]) is tested, which suceeds.
In the above regex, I enclosed the pattern in a capture group, which we can now read out:
$ perl -E'"John." =~ /(J.*(?<![.]))/ and say "<$1>" or say "No match"'
<John>
It is often more efficient to use a character class instead of assertions and .* quantifications, so that we can avoid backtracking:
/J[^.]*/
However, this is not strictly equivalent to the above regexes.

This regexp:
/John(?![.])/
will match John but not John. It uses a negative look-ahead assertion (rather than look-behind).
If you want to match full names other than 'John', you'll need to be a bit more specific about what you do and don't want to allow in the match, as putting J* will match zero or more J's.
Edit: Obviously I misread the * per #amon's post. Look-ahead vs. look-behind still applies.

How does pattern matching work in Perl?

I want to know how pattern matching works in Perl.
My code is:
my $var = "VP KDC T. 20, pgcet. 5, Ch. 415, Refs %50 Annos";
if($var =~ m/(.*)\,(.*)/sgi)
{
print "$1\n$2";
}
I learnt that the first occurrence of comma should be matched. but here the last occurrence is being matched. The output I got is:
VP KDC T. 20, pgcet. 5, Ch. 415
Refs %50 Annos
Can someone please explain me how this matching works?

From docs:
By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match
So, first (.*) will take as much as possible.
Simple workaround is using non-greedy quantifier: *?. Or match not every character, but all except comma: ([^,]*).

Greedy and Ungreedy Matching
Perl regular expressions normally match the longest string possible.
For instance:
my($text) = "mississippi";
$text =~ m/(i.*s)/;
print $1 . "\n";
Run the preceding code, and here's what you get:
ississ
It matches the first i, the last s, and everything in between them. But what if you want to match the first i to the s most closely following it? Use this code:
my($text) = "mississippi";
$text =~ m/(i.*?s)/;
print $1 . "\n";
Now look what the code produces:
is
Clearly, the use of the question mark makes the match ungreedy. But theres another problem in that regular expressions always try to match as early as possible.
Source: http://www.troubleshooters.com/codecorn/littperl/perlreg.htm

Use question mark in your regex:
if($var =~ m/(.*?)\,(.*)/sgi)
{
print "$1\n$2";
}
So:
(.*)\, means: "match as much characters as you can as long as there will be a comma after them"
(.*?)\, means: "match any characters until you stumble upon a comma"

(.*)\, -you might expect that it will match till the first comma.
But it is greedy enough to match all the xcharacters it came across untill last comma instead of the first comma.
so
it matches till the last command.
and the second match is the rest of the line.
to avoid greedy pattern match adda ? after *

Regular expression for number search

I need a regular expression that will find a number(s) that is not inside parenthesis.
Example abcd 1 (35) (df)
It would only see the 1.
Is this very complex? I've tried and had no luck.
Thanks for any help

An easy solution is to first remove the unwanted values:
my $string = "abcd 12 (35) (df) 2311,22";
$string =~ s/\(\d+\)//g; # remove numbers within parens
my #numbers = $string =~ /\d+/g; # extract the numbers

This is quite hard but something like this will probably do:
^(?:\()(\d+)(?:[^)])|(?:[^(0-9]|^)(\d+)(?:[^)0-9]|^)|(?:[^(])(\d+)(?:\))$
The problem is to match (123, 123) and also to not match the string 123 as the number 2 between the non-parentheses characters 1 and 3. Also there are probably some edge cases for start of and end of string.
My suggestion is to not use a regex for this. Maybe a regex that matches numbers and then use the capture info to check if the surrounding characters are not parentheses.

The regular expression would be:
^[a-z]+ ([0-9]+) \([0-9]+\) \([a-z]+\)$
The result is the first (and only) matching group of the regex.
Maybe you want to remove the ^ and $ if the regex should not match only if it’s the content of a whole single line. You can also use [a-zA-Z] or [[:alpha:]]. This depends on the regular expression engine you use and, of course, the content you want to match.
Example perl code:
if (m/^[a-z]+ ([0-9]+) \([0-9]+\) \([a-z]+\)$/) {
print("$1\n");
}
Please note that your question contains not enough information to make a good answer possible (you did not say anything about the general format of your expression, for example if you want to match integers or floating points)

How about
/(?:^|[^\d(])(\d+)(?:[^\d)]|$)/
? This matches a string of digits (\d+) that are
preceded by the beginning of the string, or a character that is not a digit or an open parenthesis ((?:^|[^\d(]))
succeeded by the end of the string, or by a character that is not a digit or a close parenthesis ((?:[^\d)]|$))

How can I get my Perl regex not to use special characters from an interpolated variable? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
How can I escape meta-characters when I interpolate a variable in Perl's match operator?
I am using the following regex to search for a string $word in the bigger string $referenceLine as follows :
$wordRefMatchCount =()= $referenceLine =~ /(?=\b$word\b)/g
The problem happens when my $word substring contains some (, etc. Because it takes it as a part of the regex rather than the string to match and gives the following error :
Unmatched ( in regex; marked by <-- HERE in
m/( <-- HERE ?=\b( darsheel safary\b)/
at ./bleu.pl line 119, <REFERENCE> line 1.
Can somone please tell me a solution to this? I think If I could somehow get perl to understand that we want to look for the whole $word as it is without evaluating it, it might work out.

Use
$wordRefMatchCount =()= $referenceLine =~ /(?=\b\Q$word\E\b)/g
to tell the regex engine to treat every character in $word as a literal character.
\Q marks the start, \E marks the end of a literal string in Perl regex.
Alternatively, you could do
$quote_word = quotemeta($word);
and then use
$wordRefMatchCount =()= $referenceLine =~ /(?=\b$quote_word\b)/g
One more thing (taken up here from the comments where it's harder to find:
Your regex fails in your example case because of the word boundary anchor \b. This anchor matches between a word character and a non-word character. It only makes sense if placed around actual words, i. e. \bbar\b to ensure that only bar is matched, not foobar or barbaric. If you put it around non-words (as in \b( darsheel safary\b) then it will cause the match to fail (unless there is a letter, digit or underscore right before the ().

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perl regular expression (regex) fails when I make it optional [duplicate] - regex

Related

regex negative lookbehind matching when expected not to

Lookbehind does not work as expected

How does pattern matching work in Perl?

Regular expression for number search

How can I get my Perl regex not to use special characters from an interpolated variable? [duplicate]

Categories

Resources