Ungreedy regexp in Perl

Ungreedy regexp in Perl - regex

I'm trying to capture a string that is like this:
document.all._NameOfTag_ != null ;
How can I capture the substring:
document.all._NameOfTag_
and the tag name:
_NameOfTag_
My attempt so far:
if($_line_ =~ m/document\.all\.(.*?).*/)
{
}
but it's always greedy and captures _NameOfTag_ != null

The lazy (.*?) will always match nothing, because the following greedy .* will always match everything.
You need to be more specific:
if($_line_ =~ m/document\.all\.(\w+)/)
will only match alphanumeric characters after document.all.

Your problem is the lazy quantifier. A lazy quantifier will always first try and rescind matching to the next component in the regex and will consume the text only if said next component does not match.
But here your next component is .*, and .* matches everything until the end of your input.
Use this instead:
if ($_line_ =~ m/document\.all\.(\w+)/)
And also note that it is NOT required that all the text be matched. A regex needs only to match what it has to match, and nothing else.

Try the following instead, personal I find it much clearer:
document\.all\.([^ ]*)

Related

regex negative lookbehind matching when expected not to

Can someone help me understand why the following regex is matching when i would expect it not to match.
String to check against
/opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med_Transfer/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt
regex
(?<!Transfer\/)\w*PINPUK.*(?:csv|txt)$
I was expecting this to not match as the string Transfer/ appears before 0 or more word chars followed by the string PINPUK. If I change the pattern from \w* to \w{6} to explicitly match 6 word chars this correctly returns no match.
Can someone help me understand why with the 0 or more quantifier on my "word" character results in the regex giving a match?

Your regex pattern (?<!Transfer/)\w*PINPUK.*(?:csv|txt)$ is looking for \w*PINPUK not immediately preceded by Transfer/
Given the string
/opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med_Transfer/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt
the regex engine will start by matching \w*PINPUK with CBD99_PINPUK
But that is preceded by Transfer/ so the engine backtracks and finds BD99_PINPUK
That is preceded by C, which isn't Transfer/, so the match is successful
As for a fix, just put the slash outside the look-behind
(?<!Transfer)/\w*PINPUK.*(?:csv|txt)$
That forces the \w* to begin right after the slash, and the pattern now correctly fails

Borodin has given an excellent explanation of why this doesn't work and a solution for this case (move a /). Sometimes something simple like that isn't possible though so here I'll explain an alternate work around that might be useful
Things will match as you expect if you move the \w* inside the negative look-behind. Like so:
(?<!Transfer\/\w*)PINPUK.*(?:csv|txt)$
Unfortunately Perl doesn't allow this, negative look-behinds must be fixed width. But still, there is a way to perform one match: match in reverse
^(?:vsc|txt).*KUPNIP(?!\w*\/refsnarT)
This uses a variable length negative look-ahead, something Perl does allow. Putting all this together in a script we get
use strict;
use warnings;
use feature 'say';
my $string_matches = '/opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med_Transfer/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt';
say "Trying $string_matches";
if ( reverse($string_matches) =~ /^(?:vsc|txt).*KUPNIP(?!\w*\/refsnarT)/ ) {
say 'It matched';
} else {
say 'No match';
}
say '';
my $string_doesnt_match = '/opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt';
say "Trying $string_doesnt_match";
if ( reverse($string_doesnt_match) =~ /^(?:vsc|txt).*KUPNIP(?!\w*\/refsnarT)/ ) {
say 'It matched';
} else {
say 'No match';
}
Which outputs
Trying /opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med_Transfer/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt
No match
Trying /opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt
It matched

PCRE Regular expression : only one matching

I want to catch strings which respond to a pattern in a subject string.
Patterns examples: ##name##, ##address##, ##bankAccount##, ...
Subject example: This is the template with patterns : ##name##Your bank account is : ##bankAccount##Your address is : ##address##
With the following regex: .*(#{2}[a-zA-Z]*#{2}).*, only the last pattern is matched.
How to capture all the patterns, not just the last or first ?

Now that I've formatted your regex properly, the problem shows. A * in your regex was hidden since markdown took it to make the text italics.
Your opening .* matches greedily as much as it can, only backing up enough to let (#{2}[a-zA-Z]*#{2}) match. This matches the last pattern found in the line, everything before it having been matched by the .*.

You need to remove .* as I mentioned in my comment, and use preg_match_all:
$re = '~#{2}[a-zA-Z]*#{2}~';
preg_match_all($re, "##name##, ##address##, ##bankAccount##", $m);
print_r($m);
See the PHP demo
The .*#{2}[a-zA-Z]*#{2}.* matched 0 or more characters other than a newline at first, grabbing the whole line, and then backtracked until the last occurrence of #{2}[a-zA-Z]*#{2} pattern, and the last .* only grabbed the rest of the line. Removing the .* and using preg_match_all, all substrings matching the #{2}[a-zA-Z]*#{2} pattern can be extracted.

Go Regexp to Match Characters Between

I have content I am trying to remove from a string
s:=`Hello! <something>My friend</something>this is some <b>content</b>.`
I want to be able to replace <b>content</b> and <something>My friend</something> so that the string is then
`Hello! this is some .`
So basically, I want to be able to remove anything between <.*>
But the problem is that the regex matches <something>My friend</something> this is some <b>content</b> because golang is matching the first < to the very last >

* is a greedy operator meaning it will match as much as it can and still allow the remainder of the regular expression to match. In this case, I would suggest using negated character classes since backreferences are not supported.
s := "Hello! <something>My friend</something>this is some <b>content</b>."
re := regexp.MustCompile("<[^/]*/[^>]*>")
fmt.Println(re.ReplaceAllString(s, ""))
Go Playground

Go's regexp doesn't have backtracking so you can't use <(.*?)>.*?</\1> like you would do in perl.
However if you don't care if the closing tag matches you can use:
<.*?/.*?>
Just saw your update, .* is a greedy operator, it will match everything in between, you have to use non-greedy matching (aka .*?).
play

Regex to get all character to the right of first space?

I am trying to craft a regular expression that will match all characters after (but not including) the first space in a string.
Input text:
foo bar bacon
Desired match:
bar bacon
The closest thing I've found so far is:
\s(.*)
However, this matches the first space in addition to "bar bacon", which is undesirable. Any help is appreciated.

You can use a positive lookbehind:
(?<=\s).*
(demo)
Although it looks like you've already put a capturing group around .* in your current regex, so you could just try grabbing that.

I'd prefer to use [[:blank:]] for it as it doesn't match newlines just in case we're targetting mutli's. And it's also compatible to those not supporting \s.
(?<=[[:blank:]]).*

You don't need look behind.
my $str = 'now is the time';
# Non-greedily match up to the first space, and then get everything after in a group.
$str =~ /^.*? +(.+)/;
my $right_of_space = $1; # Keep what is in the group in parens
print "[$right_of_space]\n";

You can also try this
(?s)(?<=\S*\s+).*
or
(?s)\S*\s+(.*)//group 1 has your match
With (?s) . would also match newlines

How does pattern matching work in Perl?

I want to know how pattern matching works in Perl.
My code is:
my $var = "VP KDC T. 20, pgcet. 5, Ch. 415, Refs %50 Annos";
if($var =~ m/(.*)\,(.*)/sgi)
{
print "$1\n$2";
}
I learnt that the first occurrence of comma should be matched. but here the last occurrence is being matched. The output I got is:
VP KDC T. 20, pgcet. 5, Ch. 415
Refs %50 Annos
Can someone please explain me how this matching works?

From docs:
By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match
So, first (.*) will take as much as possible.
Simple workaround is using non-greedy quantifier: *?. Or match not every character, but all except comma: ([^,]*).

Greedy and Ungreedy Matching
Perl regular expressions normally match the longest string possible.
For instance:
my($text) = "mississippi";
$text =~ m/(i.*s)/;
print $1 . "\n";
Run the preceding code, and here's what you get:
ississ
It matches the first i, the last s, and everything in between them. But what if you want to match the first i to the s most closely following it? Use this code:
my($text) = "mississippi";
$text =~ m/(i.*?s)/;
print $1 . "\n";
Now look what the code produces:
is
Clearly, the use of the question mark makes the match ungreedy. But theres another problem in that regular expressions always try to match as early as possible.
Source: http://www.troubleshooters.com/codecorn/littperl/perlreg.htm

Use question mark in your regex:
if($var =~ m/(.*?)\,(.*)/sgi)
{
print "$1\n$2";
}
So:
(.*)\, means: "match as much characters as you can as long as there will be a comma after them"
(.*?)\, means: "match any characters until you stumble upon a comma"

(.*)\, -you might expect that it will match till the first comma.
But it is greedy enough to match all the xcharacters it came across untill last comma instead of the first comma.
so
it matches till the last command.
and the second match is the rest of the line.
to avoid greedy pattern match adda ? after *

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Ungreedy regexp in Perl - regex

I'm trying to capture a string that is like this: document.all._NameOfTag_ != null ; How can I capture the substring: document.all._NameOfTag_ and the tag name: _NameOfTag_ My attempt so far: if($_line_ =~ m/document\.all\.(.?)./) { } but it's always greedy and captures _NameOfTag_ != null

The lazy (.?) will always match nothing, because the following greedy . will always match everything. You need to be more specific: if($_line_ =~ m/document\.all\.(\w+)/) will only match alphanumeric characters after document.all.

Try the following instead, personal I find it much clearer: document\.all\.([^ ]*)

Related

regex negative lookbehind matching when expected not to

PCRE Regular expression : only one matching

Go Regexp to Match Characters Between

Regex to get all character to the right of first space?

How does pattern matching work in Perl?

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Ungreedy regexp in Perl - regex

I'm trying to capture a string that is like this: document.all._NameOfTag_ != null ; How can I capture the substring: document.all._NameOfTag_ and the tag name: _NameOfTag_ My attempt so far: if($_line_ =~ m/document\.all\.(.*?).*/) { } but it's always greedy and captures _NameOfTag_ != null

The lazy (.*?) will always match nothing, because the following greedy .* will always match everything. You need to be more specific: if($_line_ =~ m/document\.all\.(\w+)/) will only match alphanumeric characters after document.all.

Try the following instead, personal I find it much clearer: document\.all\.([^ ]*)

Related

regex negative lookbehind matching when expected not to

PCRE Regular expression : only one matching

Go Regexp to Match Characters Between

Regex to get all character to the right of first space?

How does pattern matching work in Perl?

Categories

Resources

I'm trying to capture a string that is like this: document.all._NameOfTag_ != null ; How can I capture the substring: document.all._NameOfTag_ and the tag name: _NameOfTag_ My attempt so far: if($_line_ =~ m/document\.all\.(.?)./) { } but it's always greedy and captures _NameOfTag_ != null

The lazy (.?) will always match nothing, because the following greedy . will always match everything. You need to be more specific: if($_line_ =~ m/document\.all\.(\w+)/) will only match alphanumeric characters after document.all.