How does pattern matching work in Perl? - regex

I want to know how pattern matching works in Perl.
My code is:
my $var = "VP KDC T. 20, pgcet. 5, Ch. 415, Refs %50 Annos";
if($var =~ m/(.*)\,(.*)/sgi)
{
print "$1\n$2";
}
I learnt that the first occurrence of comma should be matched. but here the last occurrence is being matched. The output I got is:
VP KDC T. 20, pgcet. 5, Ch. 415
Refs %50 Annos
Can someone please explain me how this matching works?

From docs:
By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match
So, first (.*) will take as much as possible.
Simple workaround is using non-greedy quantifier: *?. Or match not every character, but all except comma: ([^,]*).

Greedy and Ungreedy Matching
Perl regular expressions normally match the longest string possible.
For instance:
my($text) = "mississippi";
$text =~ m/(i.*s)/;
print $1 . "\n";
Run the preceding code, and here's what you get:
ississ
It matches the first i, the last s, and everything in between them. But what if you want to match the first i to the s most closely following it? Use this code:
my($text) = "mississippi";
$text =~ m/(i.*?s)/;
print $1 . "\n";
Now look what the code produces:
is
Clearly, the use of the question mark makes the match ungreedy. But theres another problem in that regular expressions always try to match as early as possible.
Source: http://www.troubleshooters.com/codecorn/littperl/perlreg.htm

Use question mark in your regex:
if($var =~ m/(.*?)\,(.*)/sgi)
{
print "$1\n$2";
}
So:
(.*)\, means: "match as much characters as you can as long as there will be a comma after them"
(.*?)\, means: "match any characters until you stumble upon a comma"

(.*)\, -you might expect that it will match till the first comma.
But it is greedy enough to match all the xcharacters it came across untill last comma instead of the first comma.
so
it matches till the last command.
and the second match is the rest of the line.
to avoid greedy pattern match adda ? after *

Related

Lookbehind does not work as expected

I am trying to understand the lookbehind.
This example I am trying doesn't work as I expected. I wanted to try to form a regex that would match John but not John.
The following:
$ perl -e '
my $var = "John.";
if( $var =~ m/J*/) {
print "Matches!\n";
}
'
Matches!
matches up to and including . of course. The problem is the following:
$ perl -e '
my $var = "John.";
if( $var =~ m/J*(?<![.])/) {
print "Matches!\n";
}
'
Matches!
For the latter I expected that the regex would match John. consuming >.< (the period)
Then at the next position it would look behind and realize that it consumed a period (.) and would reject the match.
Is my understanding wrong? What am I messing up here?
Update:
Same result also for my $var = "John. ";
Update 2:
My question is not about how to match only John and not John.
But to understand how lookbehind works and if it is not supposed to work in this case why.
The * is a quantification operator, not a placeholder. So A* means zero or more A characters. Without any further context, this always matches, e.g. "foo" =~ /J*/ is true.
What you intended to write was /J.*/ which does what you've actually described.
Now let's look what happens when we do "John." =~ /(J.*(?<![.]))/:
The regex engine sees J, which matches.
The next pattern is .*, which matches ohn..
Next the assertion (?<![.]) is tested, which fails.
The regex engine therefore backtracks.
We try .* again, but this time only match ohn.
Next the assertion (?<![.]) is tested, which suceeds.
In the above regex, I enclosed the pattern in a capture group, which we can now read out:
$ perl -E'"John." =~ /(J.*(?<![.]))/ and say "<$1>" or say "No match"'
<John>
It is often more efficient to use a character class instead of assertions and .* quantifications, so that we can avoid backtracking:
/J[^.]*/
However, this is not strictly equivalent to the above regexes.
This regexp:
/John(?![.])/
will match John but not John. It uses a negative look-ahead assertion (rather than look-behind).
If you want to match full names other than 'John', you'll need to be a bit more specific about what you do and don't want to allow in the match, as putting J* will match zero or more J's.
Edit: Obviously I misread the * per #amon's post. Look-ahead vs. look-behind still applies.

Need help in matching regexp

I am having a string say
my $str = "FILLER-1-1,EQPT:MN,EQPT_MISSING,NSA,04-30,15-07-13,NEND,NA";
I want to match a pattern say
my $pattern = "FILLER-1-1";
I am using the below regexp
$reg = $str =~ /$pattern/;
This is working fine
Now the problem is it is also matching if our string is
FILLER-1-10/FILLER-1-11/FILLER-1-12 so on ...
I dont want to match this. Also I don't want my regexp to be like
$reg = $str =~ /$pattern\W+/;
This one is working against the above mentioned issue but \W may come or not come. In some strings it can come while in other it may not come. So i need the regexp to match only FILLER-1-1 without using \W+ and it should match specifically FILLER-1-10
Note: If somebody is doing -(minus) rating to my question, please let me know what's wrong in the code. It will be appreciable if the person write the comment too
As \w matches [a-zA-Z0-9], you can use the zero-width assumption \b, which denotes a change in \w state (called a "word boundary", hence the "b" shortcut):
/FILLER-1-1\b/
This means that there needs to be a character that differs from the previous word state - a word state change.
It will match
FILLER-1-1.
FILLER-1-1&
FILLER-1-1,
It will not match
FILLER-1-1a
FILLER-1-16
Read more about it here.
If you want to match FILLER at the start of the input (line) followed by two numbers, this simple regex should work:
/~FILLER-\d+-\d+/
~ matches the beginning of the input
\d matches any digit ([0-9])
+ matches at least one, but can match any number
use ? quantifier like so:
/FILLER-\d-\d\W?/
The \W? means not a word zero or one time

Why is Perl lazy when regex matching with * against a group?

In perl, the * is usually greedy, unless you add a ? after it. When * is used against a group, however, the situation seems different. My question is "why". Consider this example:
my $text = 'f fjfj ff';
my (#matches) = $text =~ m/((?:fj)*)/;
print "#matches\n";
# --> ""
#matches = $text =~ m/((?:fj)+)/;
print "#matches\n";
# --> "fjfj"
In the first match, perl lazily prints out nothing, though it could have matched something, as is demonstrated in the second match. Oddly, the behavior of * is greedy as expected when the contents of the group is just . instead of actual characters:
#matches = $text =~ m/((?:..)*)/;
print "#matches\n";
# --> 'f fjfj f'
Note: The above was tested on perl 5.12.
Note: It doesn't matter whether I use capturing or non-capturing parentheses for inside group.
This isn't a matter of greedy or lazy repetition. (?:fj)* is greedily matching as many repetitions of "fj" as it can, but it will successfully match zero repetitions. When you try to match it against the string "f fjfj ff", it will first attempt to match at position zero (before the first "f"). The maximum number of times you can successfully match "fj" at position zero is zero, so the pattern successfully matches the empty string. Since the pattern successfully matched at position zero, we're done, and the engine has no reason to try a match at a later position.
The moral of the story is: don't write a pattern that can match nothing, unless you want it to match nothing.
Perl will match as early as possible in the string (left-most). It can do that with your first match by matching zero occurrences of fj at the start of the string

Regex detect if a matched comma(,) does not lie in a regex

I am trying to figure out a way to determine if my matched comma(,) does not lie inside a regex. Basically, i do not want to match my character if it lies in a regex.
The regex i have come up with is ,(?<!.+\/)(?!.+\/) but its not quite working.
Any ideas?
I want to skip /some,regex/ but match any other commas.
Edit:
Live example: http://rubular.com/r/WjrwSnmzyP
Here is the regex that will work for you:
,(?!\s)(?=(?:(?:[^/]*\/){2})*[^/]*$)
Live Demo: http://rubular.com/r/37buDdg1tW
Explanation: It means match comma followed by EVEN number of forward slash /. Hence comma (,) between 2 slash (/) characters will NOT be matched and outside ones will be matched (since those are followed by even number of / characters).
A curious thing about regular expressions is that if you want to use them to ignore "something" that is within "something else", you need to match that "something else", prefer matches of it, and then either silently discard or reproduce those matches.
For example, in order to remove all commas from a string unless they are in a regular expression literal—
In Perl:
my $s = "/foo,bar/,baz";
$s =~ s{(/(?:[^/\\]|\\.)+/)|,}{\1}g;
In ECMAScript:
var s = "/foo,bar/,baz";
s = s.replace(/(\/([^\/\\]|\\.)+\/)|,/g, "$1");
or
s = s.replace(new RegExp("(/([^/\\\\]|\\\\.)+/)|,", "g"), "$1");
Note that I am capturing the match for the regular expression literal in the string value, and reproducing it (\1 or $1) if it matched. (If the other part of the alternation – the standalone comma – matched, the empty string is captured, so this simple approach suffices here.)
For further reading I recommend “Mastering Regular Expressions” by Jeffrey E. F. Friedl. Two rather enlightening example chapters, each from a different edition, are available for free online.

Perl regex subsitute last occurrence

I have this input:
AB2.HYNN.KABCDSEG.L000.G0001V00
AB2.HYNN.GABCDSEG.L000.G0005V00
I would like to remove all which finish by GXXXXVXX in the string.
When i use this code:
$result =~ s/\.G.*V.*$//g;
print "$result \n";
The result is :
AB2.HYNN.KABCDSEG.L000
AB2.HYNN
It seems each time the regex find ".G" it removes with blank .
I don't understand.
I would like to have this:
AB2.HYNN.KABCDSEG.L000
AB2.HYNN.GABCDSEG.L000
How i can do this in regex ?
Update:
After talking in the comments, the final solution was:
s/\.G\w+V\w+$//;
In your regex:
s/\.G.*V.*$//g;
those .* are greedy and will match as much as possible. The only requirement you have is that there must be a V after the .G somewhere, so it will truncate the string from the first .G it finds, as long as it is followed by a V. There is no need for the /g modifier here, because any match that occurs will delete the rest of the string. Unless you have newlines, because . does not match newlines without the /s modifier.
$result =~ s/\.G\d+V\d+//g;
Works on given input.