Why isnt greedy matching working in perl regex group - regex

I am trying to grab only whats BETWEEN the body tags in html with perl regex (so don't want to include the actual body tags, thus using the groups to throw away the tags to variables).
Here are some short test subjects:
<body>test1</body>
<body style="bob">test2</body>
So first, simple version I tried was:
(?<=<body>).*(?=</body>)
which returns test 1 and empty string
So then I tried:
(?<=<body).*(?=</body>)
Which now gives a result for both tests, but of course has garbage: ">test1" and " style="bob">test2"
I've tried every variation of greedy match now in the first version, e.g.:(?<=<body.*>).*(?=</body>)
But it simply will not work! Any time I put the * in there I get errors. Anybody able to help out?

I am trying to grab only whats BETWEEN the body tags
In that case:
#!/usr/bin/env perl
use strict;
use warnings;
while (my $line = <DATA>) {
if ($line =~ m{ <body [^>]*> (.+) </body> }xs) {
print "[$1]\n";
}
}
__DATA__
<body>test1</body>
<body style="bob">test2</body>
<!-- <body class="one"> --><body>This is why you should use an HTML parser</body>
Output:
[test1]
[test2]
[ --><body>This is why you should use an HTML parser]

You're looking for
while ($html =~ / <body[^>]*> ( (?: (?! </body\b ). )* ) /sxg) {
say $1;
}

I don't think using $& is efficient. Personally, I'd use capture groups
but this works pretty good.
/<(body)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!\/>)[^>])?)+)?\s*>\K[\S\s]*?(?=<\/\1\s*>)/
https://regex101.com/r/EkPkLb/1
Expanded
<
( body ) # (1)
(?:
\s+
(?>
" [\S\s]*? "
| ' [\S\s]*? '
| (?:
(?! /> )
[^>]
)?
)+
)?
\s* >
\K
[\S\s]*?
(?= </ \1 \s* > )
Note that to really find a particular tag, you have to consume all the
previous tags via a (*SKIP)(?!), else your tag could be embedded inside
script literals, comments or invisible content.
I wouldn't worry too much about it.
If you're interested I could post a fairly large proper regex,
but I doubt you'd be interested.

Choosing the best pattern for your data depends on what kind of characters will be contained in your body tags. An additional consideration is whether you want to aim for efficiency or minimal memory.
These are some suitable (or not) patterns for your case:
93steps ~<body[^>]*>\K.*(?=</body>)~ #no capture group,no newline matches
105steps ~<body[^>]*>\K[\S\s]*?(?=</body>)~ #no capture group, newline matches
87steps ~<body[^>]*>(.*)</body>~ #capture group, no newline matches
96steps ~<body[^>]*>([\S\s]*?)</body>~ #capture group, newline matches
Here is a Pattern Demo with three samples to show the impact of newline characters in your body text.

Related

Regex: Capturing first occurrence before lookahead

I'm trying to capture the urls before a particular word. The only trouble is that the word could also be part of the domain.
Examples: (i'm trying to capture everything before dinner)
https://breakfast.example.com/lunch/dinner/
https://breakfast.example.brunch.com:8080/lunch/dinner
http://dinnerdemo.example.com/dinner/
I am able to use:
^(.*://.*/)(?=dinner/?)
The trouble I am having is the lookahead doesn't appear to by lazy enough
So the following is failing:
https://breakfast.example.com/lunch/dinner/login.html?returnURL=https://breakfast.example.com/lunch/dinner/
as it captures:
https://breakfast.example.com/lunch/dinner/login.html?returnURL=https://breakfast.example.com/lunch/
I'm both failing to understand why and how to fix my regex.
Perhaps I'm on the wrong track but how can I capture all my examples?
You can use some laziness:
^(.*?:\/\/).*?/(?=dinner/?)
Live demo
By using a .* in the middle of your regex you ate everything until the last colon, where it found a match.
.* in the middle of a regex, by the way, is very bad practice. It can cause horrendous backtracking performance degradation in long strings. .*? is better, since it is reluctant rather than greedy.
The lookahead doesn't have to be lazy or not, the lookahead is only a check and in your case with a quasi-fixed string.
What you need to make lazy is obviously the subpattern before the lookahead.
^https?:\/\/(?:[^\/]+\/)*?(?=dinner(?:\/|$))
Note: (?:/|$) is like a boundary that ensures the word "dinner" is followed by a slash or the end of the string.
You're primary flaw is using greedy matching .* versus non-greedy .*?.
The following performs the matching that you desire using perl, but the regex could easily be applied in any language. Note the use of word boundaries around dinner, which might or might not be what you want:
use strict;
use warnings;
while (<DATA>) {
if (m{^(.*?://.*?/.*?)(?=\bdinner\b)}) {
print $1, "\n";
}
}
__DATA__
https://breakfast.example.com/lunch/dinner/
https://breakfast.example.brunch.com:8080/lunch/dinner
http://dinnerdemo.example.com/dinner/
Outputs:
https://breakfast.example.com/lunch/
https://breakfast.example.brunch.com:8080/lunch/
http://dinnerdemo.example.com/
Another way as well.
# Multi-line optional
# ^(?:(?!://).)*://[^?/\r\n]+/(?:(?!dinner)[^?/\r\n]+/)*(?=dinner)
^ # BOL
(?:
(?! :// )
.
)*
://
[^?/\r\n]+ # Domain
/
(?:
(?! dinner ) # Dirs ?
[^?/\r\n]+
/
)*
(?= dinner )
https://breakfast.example.com/lunch/dinner/
https://breakfast.example.brunch.com:8080/lunch/dinner
http://dinnerdemo.example.com/dinner/
https://breakfast.example.com/lunch/dinner/login.html?returnURL=https://breakfast.example.com/lunch/dinner/
Using python 3.7
import re
s = '''
https://breakfast.example.com/lunch/dinner/
https://breakfast.example.brunch.com:8080/lunch/dinner
http://dinnerdemo.example.com/dinner/
'''
pat = re.compile(r'.*(?=dinner)', re.M)
mo = re.findall(pat, s)
for line in mo:
print(line, end=' ')
Print Output:
https://breakfast.example.com/lunch/
https://breakfast.example.brunch.com:8080/lunch/
http://dinnerdemo.example.com/

RegEx: Don't match a certain character if it's inside quotes

Disclosure: I have read this answer many times here on SO and I know better than to use regex to parse HTML. This question is just to broaden my knowledge with regex.
Say I have this string:
some text <tag link="fo>o"> other text
I want to match the whole tag but if I use <[^>]+> it only matches <tag link="fo>.
How can I make sure that > inside of quotes can be ignored.
I can trivially write a parser with a while loop to do this, but I want to know how to do it with regex.
Regular Expression:
<[^>]*?(?:(?:('|")[^'"]*?\1)[^>]*?)*>
Online demo:
http://regex101.com/r/yX5xS8
Full Explanation:
I know this regex might be a headache to look at, so here is my explanation:
< # Open HTML tags
[^>]*? # Lazy Negated character class for closing HTML tag
(?: # Open Outside Non-Capture group
(?: # Open Inside Non-Capture group
('|") # Capture group for quotes, backreference group 1
[^'"]*? # Lazy Negated character class for quotes
\1 # Backreference 1
) # Close Inside Non-Capture group
[^>]*? # Lazy Negated character class for closing HTML tag
)* # Close Outside Non-Capture group
> # Close HTML tags
This is a slight improvement on Vasili Syrakis answer. It handles "…" and '…' completely separately, and does not use the *? qualifier.
Regular expression
<[^'">]*(("[^"]*"|'[^']*')[^'">]*)*>
Demo
http://regex101.com/r/jO1oQ1
Explanation
< # start of HTML tag
[^'">]* # any non-single, non-double quote or greater than
( # outer group
( # inner group
"[^"]*" # "..."
| # or
'[^']*' # '...'
) #
[^'">]* # any non-single, non-double quote or greater than
)* # zero or more of outer group
> # end of HTML tag
This version is slightly better than Vasilis's in that single quotes are allowed inside "…", and double quotes are allowed inside '…', and that a (incorrect) tag like <a href='> will not be matched.
It is slightly worse than Vasili's solution in that the groups are captured. If you do not want that, replace ( with (?:, in all places. (Just using ( makes the regex shorter, and a little bit more readable).
(<.+?>[^<]+>)|(<.+?>)
you can make two regexs than put them togather by using '|',
in this case :
(<.+?>[^<]+>) #will match some text <tag link="fo>o"> other text
(<.+?>) #will match some text <tag link="foo"> other text
if the first case match, it will not use second regex, so make sure you put special case in the firstplace.
If you want this to work with escaped double quotes, try:
/>(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g
For example:
const gtExp = />(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g;
const nextGtMatch = () => ((exec) => {
return exec ? exec.index : -1;
})(gtExp.exec(xml));
And if you're parsing through a bunch of XML, you'll want to set .lastIndex.
gtExp.lastIndex = xmlIndex;
const attrEndIndex = nextGtMatch(); // the end of the tag's attributes

managing and documenting a multiline substitution in Perl

I have recently been learning about the \x modifier in Perl Best Practices, enabling you to do cool things like multi-line indentation and documentation:
$txt =~ m/^ # anchor at beginning of line
The\ quick\ (\w+)\ fox # fox adjective
\ (\w+)\ over # fox action verb
\ the\ (\w+) dog # dog adjective
(?: # whitespace-trimmed comment:
\s* \# \s* # whitespace and comment token
(.*?) # captured comment text; non-greedy!
\s* # any trailing whitespace
)? # this is all optional
$ # end of line anchor
/x; # allow whitespace
However, I was unable to do the equivalent for find/replace string substitutions? Is there some other similar best practice that should be used to more effectively manage complex substitutions?
Edit Take this for an example:
$test =~ s/(src\s*=\s*['"]?)(.*?\.(jpg|gif|png))/${1}something$2/sig;
Is there a similar way that this could be documented using multi-line/whitespace for better readability?
Many thanks
Since you've chosen not to provide an example of something that doesn't work, I'll offer a few guesses at what you might be doing wrong:
Note that the delimiter (in your case /) cannot appear inside any comments inside the regex, because then they'll be indicating the end of the regex. For example, this:
s/foo # this is interesting and/or cool
/bar/x
will not work, because the regex is terminated by the slash between and and or.
Note that /x does not work on the replacement-string, only on the regex itself. For example this:
s/foo/bar # I love the word bar/x
will replace foo with bar # I love the word bar.
If you really want to be able to put comments in the replacement-string, then I suppose you could use a replacement-expression instead, using the /e flag. That would let you use the full syntax of Perl. For example:
s/foo/'bar' # I love the word bar/e
Here is an example that does work:
$test =~
s/
# the regex to replace:
(src\s*=\s*['"]?) # src=' or src=" (plus optional whitespace)
(.*?\.(jpg|gif|png)) # the URI of the JPEG or GIF or PNG image
/
# the string to replace it with:
$1 . # src=' or src=" (unchanged)
'something' . # insert 'something' at the start of the URI
$2 # the original URI
/sige;
If we just add the /x, we can break up the regular expression portion easily, including allowing comments.
my $test = '<img src = "http://www.somewhere.com/im/alright/jack/keep/your/hands/off/of/my/stack.gif" />';
$test =~ s/
( src \s* = \s* ['"]? ) # a src attribute ...
( .*?
\. (jpg|gif|png) # to an image file type, either jpeg, gif or png
)
/$1something$2/sigx # put 'something' in front of it
;
You have to use the evaluation switch (/e) if you want to break up the replacement. But the multi-line for the match portion, works fine.
Notice that I did not have to separate $1, because $1something is not a valid identifier anyway, so my version of Perl, at least, does not get confused.
For most of my evaluated replacements, I prefer the bracket style of substitution delimiter:
$test =~ s{
( src \s* = \s* ['"]? ) # a src attribute ... '
( .*?
\. (jpg|gif|png) # to an image file type, either jpeg, gif or png
)
}{
$1 . 'something' . $2
}sigxe
;
just to make it look more code-like.
Well
$test =~ s/(src\s*=\s*['"]?) # first group
(.*?\.(jpg|gif|png)) # second group
/${1}something$2/sigx;
should and does work indeed. Of course, you can't use this on the right part, unless you use somethig like :
$test =~ s/(src\s*=\s*['"]?) # first group
(.*?\.(jpg|gif|png)) # second group
/
$1 # Get 1st group
. "something" # Append ...
. $2 # Get 2d group
/sigxe;
s/foo/bar/
could be written as
s/
foo # foo
/
"bar" # bar
/xe
/x to allow whitespace in the pattern
/e to allow code in the replacement expression

Match from last occurrence using regex in perl

I have a text like this:
hello world /* select a from table_b
*/ some other text with new line cha
racter and there are some blocks of
/* any string */ select this part on
ly
////RESULT rest string
The text is multilined and I need to extract from last occurrence of "*/" until "////RESULT". In this case, the result should be:
select this part on
ly
How to achieve this in perl?
I have attempted \\\*/(.|\n)*////RESULT but that will start from first "*/"
A useful trick in cases like this is to prefix the regexp with the greedy pattern .*, which will try to match as many characters as possible before the rest of the pattern matches. So:
my ($match) = ($string =~ m!^.*\*/(.*?)////RESULT!s);
Let's break this pattern into its components:
^.* starts at the beginning of the string and matches as many characters as it can. (The s modifier allows . to match even newlines.) The beginning-of-string anchor ^ is not strictly necessary, but it ensures that the regexp engine won't waste too much time backtracking if the match fails.
\*/ just matches the literal string */.
(.*?) matches and captures any number of characters; the ? makes it ungreedy, so it prefers to match as few characters as possible in case there's more than one position where the rest of the regexp can match.
Finally, ////RESULT just matches itself.
Since the pattern contains a lot of slashes, and since I wanted to avoid leaning toothpick syndrome, I decided to use alternative regexp delimiters. Exclamation points (!) are a popular choice, since they don't collide with any normal regexp syntax.
Edit: Per discussion with ikegami below, I guess I should note that, if you want to use this regexp as a sub-pattern in a longer regexp, and if you want to guarantee that the string matched by (.*?) will never contain ////RESULT, then you should wrap those parts of the regexp in an independent (?>) subexpression, like this:
my $regexp = qr!\*/(?>(.*?)////RESULT)!s;
...
my $match = ($string =~ /^.*$regexp$some_other_regexp/s);
The (?>) causes the pattern inside it to fail rather than accepting a suboptimal match (i.e. one that extends beyond the first substring matching ////RESULT) even if that means that the rest of the regexp will fail to match.
(?:(?!STRING).)*
matches any number of characters that don't contain STRING. It's like [^a], but for strings instead of characters.
You can take shortcuts if you know certain inputs won't be encountered (like Kenosis and Ilmari Karonen did), but this is what what matches what you specified:
my ($segment) = $string =~ m{
\*/
( (?: (?! \*/ ). )* )
////RESULT
(?: (?! \*/ ). )*
\z
}xs;
If you don't care if */ appears after ////RESULT, the following is the safest:
my ($segment) = $string =~ m{
\*/
( (?: (?! \*/ ). )* )
////RESULT
}xs;
You didn't specify what should happen if there are two ////RESULT that follow the last */. The above matches until the last one. If you wanted to match until the first one, you'd use
my ($segment) = $string =~ m{
\*/
( (?: (?! \*/ | ////RESULT ). )* )
////RESULT
}xs;
Here's one option:
use strict;
use warnings;
my $string = <<'END';
hello world /* select a from table_b
*/ some other text with new line cha
racter and there are some blocks of
/* any string */ select this part on
ly
////RESULT
END
my ($segment) = $string =~ m!\*/([^/]+)////RESULT$!s;
print $segment;
Output:
select this part on
ly

Shortest match issues

I know the ? operator enables "non greedy" mode, but I am running into a problem, I can't seem to get around. Consider a string like this:
my $str = '<a>sdkhfdfojABCasjklhd</a><a>klashsdjDEFasl;jjf</a><a>askldhsfGHIasfklhss</a>';
where there are opening and closing tags <a> and </a>, there are keys ABC, DEF and GHI but are surrounded by some other random text. I want to replace the <a>klashsdjDEFasl;jjf</a> with <b>TEST</b> for example. However, if I have something like this:
$str =~ s/<a>.*?DEF.*?<\/a>/<b>TEST><\/b>/;
Even with the non greedy operators .*?, this does not do what I want. I know why it does not do it, because the first <a> matches the first occurrence in the string, and matches all the way up to DEF, then matches to the nearest closing </a>. What I want however is a way to match the closest opening <a> and closing </a> to "DEF" though. So currently, I get this as the result:
<a>TEST</b><a>askldhsfGHIasfklhss</a>
Where as I am looking for something to get this result:
<a>sdkhfdfojABCasjklhd</a><b>TEST</b><a>askldhsfGHIasfklhss</a>
By the way, I am not trying to parse HTML here, I know there are modules to do this, I am simply asking how this could be done.
Thanks,
Eric Seifert
$str =~ s/(.*)<a>.*?DEF.*?<\/a>/$1<b>TEST><\/b>/;
The problem is that even with non-greedy matching, Perl is still trying to find the match that starts at the leftmost possible point in the string. Since .*? can match <a> or </a>, that means it will always find the first <a> on the line.
Adding a greedy (.*) at the beginning causes it to find the last possible matching <a> on the line (because .* first grabs the whole line, and then backtracks until a match is found).
One caveat: Because it finds the rightmost match first, you can't use this technique with the /g modifier. Any additional matches would be inside $1, and /g resumes the search where the previous match ended, so it won't find them. Instead, you'd have to use a loop like:
1 while $str =~ s/(.*)<a>.*?DEF.*?<\/a>/$1<b>TEST><\/b>/;
Instead of a dot which says: "match any character", use what you really need which says: "match any char that is not the start of </a>". This translates into something like this:
$str =~ s/<a>(?:(?!<\/a>).)*DEF(?:(?!<\/a>).)*<\/a>/<b>TEST><\/b>/;
#!/usr/bin/perl
use warnings;
use strict;
my $str = '<a>sdkhfdfojABCasjklhd</a><a>klashsdjDEFasl;jjf</a><a>askldhsfGHIasfklhss</a>';
my #collections = $str =~ /<a>.*?(ABC|DEF|GHI).*?<\/a>/g;
print join ", ", #collections;
s{
<a>
(?: (?! </a> ) . )*
DEF
(?: (?! </a> ) . )*
</a>
}{<b>TEST</b>}x;
Basically,
(?: (?! PAT ) . )
is the equivalent of
[^CHARS]
for regex patterns instead of characters.
Based on my understanding, this is what you are looking for.
Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.