regex negative lookbehind matching when expected not to

regex negative lookbehind matching when expected not to - regex

Can someone help me understand why the following regex is matching when i would expect it not to match.
String to check against
/opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med_Transfer/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt
regex
(?<!Transfer\/)\w*PINPUK.*(?:csv|txt)$
I was expecting this to not match as the string Transfer/ appears before 0 or more word chars followed by the string PINPUK. If I change the pattern from \w* to \w{6} to explicitly match 6 word chars this correctly returns no match.
Can someone help me understand why with the 0 or more quantifier on my "word" character results in the regex giving a match?

Your regex pattern (?<!Transfer/)\w*PINPUK.*(?:csv|txt)$ is looking for \w*PINPUK not immediately preceded by Transfer/
Given the string
/opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med_Transfer/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt
the regex engine will start by matching \w*PINPUK with CBD99_PINPUK
But that is preceded by Transfer/ so the engine backtracks and finds BD99_PINPUK
That is preceded by C, which isn't Transfer/, so the match is successful
As for a fix, just put the slash outside the look-behind
(?<!Transfer)/\w*PINPUK.*(?:csv|txt)$
That forces the \w* to begin right after the slash, and the pattern now correctly fails

Borodin has given an excellent explanation of why this doesn't work and a solution for this case (move a /). Sometimes something simple like that isn't possible though so here I'll explain an alternate work around that might be useful
Things will match as you expect if you move the \w* inside the negative look-behind. Like so:
(?<!Transfer\/\w*)PINPUK.*(?:csv|txt)$
Unfortunately Perl doesn't allow this, negative look-behinds must be fixed width. But still, there is a way to perform one match: match in reverse
^(?:vsc|txt).*KUPNIP(?!\w*\/refsnarT)
This uses a variable length negative look-ahead, something Perl does allow. Putting all this together in a script we get
use strict;
use warnings;
use feature 'say';
my $string_matches = '/opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med_Transfer/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt';
say "Trying $string_matches";
if ( reverse($string_matches) =~ /^(?:vsc|txt).*KUPNIP(?!\w*\/refsnarT)/ ) {
say 'It matched';
} else {
say 'No match';
}
say '';
my $string_doesnt_match = '/opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt';
say "Trying $string_doesnt_match";
if ( reverse($string_doesnt_match) =~ /^(?:vsc|txt).*KUPNIP(?!\w*\/refsnarT)/ ) {
say 'It matched';
} else {
say 'No match';
}
Which outputs
Trying /opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med_Transfer/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt
No match
Trying /opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt
It matched

Related

Lookbehind does not work as expected

I am trying to understand the lookbehind.
This example I am trying doesn't work as I expected. I wanted to try to form a regex that would match John but not John.
The following:
$ perl -e '
my $var = "John.";
if( $var =~ m/J*/) {
print "Matches!\n";
}
'
Matches!
matches up to and including . of course. The problem is the following:
$ perl -e '
my $var = "John.";
if( $var =~ m/J*(?<![.])/) {
print "Matches!\n";
}
'
Matches!
For the latter I expected that the regex would match John. consuming >.< (the period)
Then at the next position it would look behind and realize that it consumed a period (.) and would reject the match.
Is my understanding wrong? What am I messing up here?
Update:
Same result also for my $var = "John. ";
Update 2:
My question is not about how to match only John and not John.
But to understand how lookbehind works and if it is not supposed to work in this case why.

The * is a quantification operator, not a placeholder. So A* means zero or more A characters. Without any further context, this always matches, e.g. "foo" =~ /J*/ is true.
What you intended to write was /J.*/ which does what you've actually described.
Now let's look what happens when we do "John." =~ /(J.*(?<![.]))/:
The regex engine sees J, which matches.
The next pattern is .*, which matches ohn..
Next the assertion (?<![.]) is tested, which fails.
The regex engine therefore backtracks.
We try .* again, but this time only match ohn.
Next the assertion (?<![.]) is tested, which suceeds.
In the above regex, I enclosed the pattern in a capture group, which we can now read out:
$ perl -E'"John." =~ /(J.*(?<![.]))/ and say "<$1>" or say "No match"'
<John>
It is often more efficient to use a character class instead of assertions and .* quantifications, so that we can avoid backtracking:
/J[^.]*/
However, this is not strictly equivalent to the above regexes.

This regexp:
/John(?![.])/
will match John but not John. It uses a negative look-ahead assertion (rather than look-behind).
If you want to match full names other than 'John', you'll need to be a bit more specific about what you do and don't want to allow in the match, as putting J* will match zero or more J's.
Edit: Obviously I misread the * per #amon's post. Look-ahead vs. look-behind still applies.

PERL Regular Expression - Return result not including conditional statement

I'm new to regex and I have a scenario where regex will be useful.
My requirement is quite simple, I to want detect if the word NET is present in a string, and extract the digits that follow it without including the word NET or the spaces that follow it.
In my particular case following the word NET are several white space characters, and the number of these can vary as they're used as padding.
My Input string is as follows
NET 4.800 g
The reg ex I have concocted is as follows
(?<=NET)\s*(\d{0,4}\.\d{1,3})
This produces a result close to what I'm attempting to do.
It performs a positive look-ahead on the characters NET and then matches as many white space characters that follow. Finally I select up to four digits, a period and up to three more digits.
The problem lies in that I'm grabbing the indeterminate number of padding spaces before the number. All I actually want is the number it self.
I did attempt putting \s* into the lookahead, but this failed. Does anyone have any suggestions as to where I'm going wrong here?

I suspect that you are using $& to capture your string, and not $1. The variable $& contains the entire matching string, which then includes your spaces, but not your lookbehind assertion. This sounds like your problem description: That you need to exclude a variable amount of spaces, but you get the error about "variable length lookbehind assertions are not supported".
This would be quite an easy question to answer if you had included your code. You should always do that: Always show.
So... I assume you have something like:
if (/your_regex/) {
$match = $&;
}
Then you should change it to
if (/your_regex/) {
$match = $1;
}
This way, only the string inside the parenthesis will be captured, and \s* outside it will be discarded.
With this proper way of matching, which can also be made in a simpler way, you can simplify your regex. Showing a strict and a flexible version:
use strict;
use warnings;
use Data::Dumper;
my $str = "NET 4.800 g";
my ($number) = $str =~ /^NET\s*(\d{0,4}\.\d{0,3})\sg$/; # strict match
print Dumper $number; # $VAR1 = '4.800';
my ($simple) = $str =~ /NET\s*([\d.]+)/; # flexible match
print Dumper $simple; # $VAR1 = '4.800';
In the strict match, we use anchors at beginning ^ and end $. We make sure that the string starts with NET and ends with g, and account for the exact numbers and spaces we expect to find between.
The flexible match simply looks for NET and captures the number that comes after it. This can take place anywhere in the string, and even match partially.

Need help in matching regexp

I am having a string say
my $str = "FILLER-1-1,EQPT:MN,EQPT_MISSING,NSA,04-30,15-07-13,NEND,NA";
I want to match a pattern say
my $pattern = "FILLER-1-1";
I am using the below regexp
$reg = $str =~ /$pattern/;
This is working fine
Now the problem is it is also matching if our string is
FILLER-1-10/FILLER-1-11/FILLER-1-12 so on ...
I dont want to match this. Also I don't want my regexp to be like
$reg = $str =~ /$pattern\W+/;
This one is working against the above mentioned issue but \W may come or not come. In some strings it can come while in other it may not come. So i need the regexp to match only FILLER-1-1 without using \W+ and it should match specifically FILLER-1-10
Note: If somebody is doing -(minus) rating to my question, please let me know what's wrong in the code. It will be appreciable if the person write the comment too

As \w matches [a-zA-Z0-9], you can use the zero-width assumption \b, which denotes a change in \w state (called a "word boundary", hence the "b" shortcut):
/FILLER-1-1\b/
This means that there needs to be a character that differs from the previous word state - a word state change.
It will match
FILLER-1-1.
FILLER-1-1&
FILLER-1-1,
It will not match
FILLER-1-1a
FILLER-1-16
Read more about it here.

If you want to match FILLER at the start of the input (line) followed by two numbers, this simple regex should work:
/~FILLER-\d+-\d+/
~ matches the beginning of the input
\d matches any digit ([0-9])
+ matches at least one, but can match any number

use ? quantifier like so:
/FILLER-\d-\d\W?/
The \W? means not a word zero or one time

Problems with perl regex

I need a perl regex to match A.CC3 on a line begining with something followed by anything then, my 'A.CC3 " and then anything...
I am surprised this (text =~ /^\W+\CC.*\A\.CC\[3].*/) is not working
Thanks

\A is an escape sequence that denotes beginning of line, or ^ like in the beginning of your regex. Remove the backslash to make it match a literal A.
Edit: You also seem to have \C in there. You should only use backslash to escape meta characters such as period ., or to create escape sequences, such as \Q .. \E.
At its simplest, a regex to match A.CC3 would be
$text =~ /A\.CC3/
That's all you need. This will match any string with A.CC3 in it. In the comments you mention the string you are matching is this:
my $text = "//%CC Unused Static Globals, A.CC3, Halstead Progam Volume";
You might want to avoid partial matches, in which case you can use word boundary \b
$text =~ /\bA\.CC3\b/
You might require that a line begins with //%
$text =~ m#^//%.*\bA\.CC3\b#
Of course, only you know which parts of the string should be matched and in what way. "Something followed by anything followed by A.CC3 followed by anything" really just needs the first simple regex.

It doesn't seem like you're trying to capture anything. If that's the case, and all you need to do is find lines that contain A.CC3 then you can simply do
if ( index( $str, 'A.CC3' ) >= 0 ) # Found it...
No need for a regex.

Try to give this a shot:
^.*?A\.CC.*$
That will match anything until it reaches A, then a literal ., followed by CC, then anything until end of string.

It depends what you want to match. If you want to pull back the whole line in which the A.CC3 pattern occurs then something like this should work:
^.*A\.CC3.*$

Ungreedy regexp in Perl

I'm trying to capture a string that is like this:
document.all._NameOfTag_ != null ;
How can I capture the substring:
document.all._NameOfTag_
and the tag name:
_NameOfTag_
My attempt so far:
if($_line_ =~ m/document\.all\.(.*?).*/)
{
}
but it's always greedy and captures _NameOfTag_ != null

The lazy (.*?) will always match nothing, because the following greedy .* will always match everything.
You need to be more specific:
if($_line_ =~ m/document\.all\.(\w+)/)
will only match alphanumeric characters after document.all.

Your problem is the lazy quantifier. A lazy quantifier will always first try and rescind matching to the next component in the regex and will consume the text only if said next component does not match.
But here your next component is .*, and .* matches everything until the end of your input.
Use this instead:
if ($_line_ =~ m/document\.all\.(\w+)/)
And also note that it is NOT required that all the text be matched. A regex needs only to match what it has to match, and nothing else.

Try the following instead, personal I find it much clearer:
document\.all\.([^ ]*)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

regex negative lookbehind matching when expected not to - regex

Related

Lookbehind does not work as expected

PERL Regular Expression - Return result not including conditional statement

Need help in matching regexp

Problems with perl regex

Ungreedy regexp in Perl

Categories

Resources