Regex: Capturing first occurrence before lookahead - regex

I'm trying to capture the urls before a particular word. The only trouble is that the word could also be part of the domain.
Examples: (i'm trying to capture everything before dinner)
https://breakfast.example.com/lunch/dinner/
https://breakfast.example.brunch.com:8080/lunch/dinner
http://dinnerdemo.example.com/dinner/
I am able to use:
^(.*://.*/)(?=dinner/?)
The trouble I am having is the lookahead doesn't appear to by lazy enough
So the following is failing:
https://breakfast.example.com/lunch/dinner/login.html?returnURL=https://breakfast.example.com/lunch/dinner/
as it captures:
https://breakfast.example.com/lunch/dinner/login.html?returnURL=https://breakfast.example.com/lunch/
I'm both failing to understand why and how to fix my regex.
Perhaps I'm on the wrong track but how can I capture all my examples?

You can use some laziness:
^(.*?:\/\/).*?/(?=dinner/?)
Live demo
By using a .* in the middle of your regex you ate everything until the last colon, where it found a match.
.* in the middle of a regex, by the way, is very bad practice. It can cause horrendous backtracking performance degradation in long strings. .*? is better, since it is reluctant rather than greedy.

The lookahead doesn't have to be lazy or not, the lookahead is only a check and in your case with a quasi-fixed string.
What you need to make lazy is obviously the subpattern before the lookahead.
^https?:\/\/(?:[^\/]+\/)*?(?=dinner(?:\/|$))
Note: (?:/|$) is like a boundary that ensures the word "dinner" is followed by a slash or the end of the string.

You're primary flaw is using greedy matching .* versus non-greedy .*?.
The following performs the matching that you desire using perl, but the regex could easily be applied in any language. Note the use of word boundaries around dinner, which might or might not be what you want:
use strict;
use warnings;
while (<DATA>) {
if (m{^(.*?://.*?/.*?)(?=\bdinner\b)}) {
print $1, "\n";
}
}
__DATA__
https://breakfast.example.com/lunch/dinner/
https://breakfast.example.brunch.com:8080/lunch/dinner
http://dinnerdemo.example.com/dinner/
Outputs:
https://breakfast.example.com/lunch/
https://breakfast.example.brunch.com:8080/lunch/
http://dinnerdemo.example.com/

Another way as well.
# Multi-line optional
# ^(?:(?!://).)*://[^?/\r\n]+/(?:(?!dinner)[^?/\r\n]+/)*(?=dinner)
^ # BOL
(?:
(?! :// )
.
)*
://
[^?/\r\n]+ # Domain
/
(?:
(?! dinner ) # Dirs ?
[^?/\r\n]+
/
)*
(?= dinner )
https://breakfast.example.com/lunch/dinner/
https://breakfast.example.brunch.com:8080/lunch/dinner
http://dinnerdemo.example.com/dinner/
https://breakfast.example.com/lunch/dinner/login.html?returnURL=https://breakfast.example.com/lunch/dinner/

Using python 3.7
import re
s = '''
https://breakfast.example.com/lunch/dinner/
https://breakfast.example.brunch.com:8080/lunch/dinner
http://dinnerdemo.example.com/dinner/
'''
pat = re.compile(r'.*(?=dinner)', re.M)
mo = re.findall(pat, s)
for line in mo:
print(line, end=' ')
Print Output:
https://breakfast.example.com/lunch/
https://breakfast.example.brunch.com:8080/lunch/
http://dinnerdemo.example.com/

Related

Non-Capturing and Capturing Groups - The right way

I'm trying to match an array of elements preceeded by a specific string in a line of text. For Example, match all pets in the text below:
fruits:apple,banana;pets:cat,dog,bird;colors:green,blue
/(?:pets:)(\w+[,|;])+/g**
Using the given regex I only could match the last word "bird"
Can anybody help me to understand the right way of using Non-Capturing and Capturing Groups?
Thanks!
First, let's talk about capturing and non-capturing group:
(?:...) non-capturing version, you're looking for this values, but don't need it
() capturing version, you want this values! You're searching for it
So:
(?:pets:) you searching for "pets" but don't want to capture it, after that point, you WANT to capture (if I've understood):
So try (?:pets:)([a-zA-Z,]+); ... You're searching for "pets:" (but don't want it !) and stop at the first ";" (and don't want it too).
Result is :
Match 1 : cat,dog,bird
A better solution exists with 1 match == 1 pet.
Since you want to have each pet in a separate match and you are using PCRE \G is, as suggested by Wiktor, a decent option:
(?:pets:)|\G(?!^)(\w+)(?:[,;]|$)
Explanation:
1st Alternative (?:pets:) to find the start of the pattern
2nd Alternative \G(?!^)(\w+)(?:[,;]|$)
\G asserts position at the end of the previous match or the start of the string for the first match
Negative Lookahead (?!^) to assert that the Regex does not match at the start of the string
(\w+) to matches the pets
Non-capturing group (?:[,;]|$) used as a delimiter (matches a single character in the list ,; (case sensitive) or $ asserts position at the end of the string
Perl Code Sample:
use strict;
use Data::Dumper;
my $str = 'fruits:apple,banana;pets:cat,dog,bird;colors:green,blue';
my $regex = qr/(?:pets:)|\G(?!^)(\w+)(?:[,;]|$)/mp;
my #result = ();
while ( $str =~ /$regex/g ) {
if ($1 ne '') {
#print "$1\n";
push #result, $1;
}
}
print Dumper(\#result);

Why isnt greedy matching working in perl regex group

I am trying to grab only whats BETWEEN the body tags in html with perl regex (so don't want to include the actual body tags, thus using the groups to throw away the tags to variables).
Here are some short test subjects:
<body>test1</body>
<body style="bob">test2</body>
So first, simple version I tried was:
(?<=<body>).*(?=</body>)
which returns test 1 and empty string
So then I tried:
(?<=<body).*(?=</body>)
Which now gives a result for both tests, but of course has garbage: ">test1" and " style="bob">test2"
I've tried every variation of greedy match now in the first version, e.g.:(?<=<body.*>).*(?=</body>)
But it simply will not work! Any time I put the * in there I get errors. Anybody able to help out?
I am trying to grab only whats BETWEEN the body tags
In that case:
#!/usr/bin/env perl
use strict;
use warnings;
while (my $line = <DATA>) {
if ($line =~ m{ <body [^>]*> (.+) </body> }xs) {
print "[$1]\n";
}
}
__DATA__
<body>test1</body>
<body style="bob">test2</body>
<!-- <body class="one"> --><body>This is why you should use an HTML parser</body>
Output:
[test1]
[test2]
[ --><body>This is why you should use an HTML parser]
You're looking for
while ($html =~ / <body[^>]*> ( (?: (?! </body\b ). )* ) /sxg) {
say $1;
}
I don't think using $& is efficient. Personally, I'd use capture groups
but this works pretty good.
/<(body)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!\/>)[^>])?)+)?\s*>\K[\S\s]*?(?=<\/\1\s*>)/
https://regex101.com/r/EkPkLb/1
Expanded
<
( body ) # (1)
(?:
\s+
(?>
" [\S\s]*? "
| ' [\S\s]*? '
| (?:
(?! /> )
[^>]
)?
)+
)?
\s* >
\K
[\S\s]*?
(?= </ \1 \s* > )
Note that to really find a particular tag, you have to consume all the
previous tags via a (*SKIP)(?!), else your tag could be embedded inside
script literals, comments or invisible content.
I wouldn't worry too much about it.
If you're interested I could post a fairly large proper regex,
but I doubt you'd be interested.
Choosing the best pattern for your data depends on what kind of characters will be contained in your body tags. An additional consideration is whether you want to aim for efficiency or minimal memory.
These are some suitable (or not) patterns for your case:
93steps ~<body[^>]*>\K.*(?=</body>)~ #no capture group,no newline matches
105steps ~<body[^>]*>\K[\S\s]*?(?=</body>)~ #no capture group, newline matches
87steps ~<body[^>]*>(.*)</body>~ #capture group, no newline matches
96steps ~<body[^>]*>([\S\s]*?)</body>~ #capture group, newline matches
Here is a Pattern Demo with three samples to show the impact of newline characters in your body text.

How to change this regex without lookbehind check

It should match substring between 0 or more spaces. C++11 does not have look behind. This is possible to rewrite this regex ? Or do I need to install boost and use "full" regex powerful?
The regex: ^\s*(.*(?<! ))\s*$
The image:
UPDATE: match in backreference!
You can make the inner * lazy by using .*? instead, which makes it match as few characters as possible while still giving you a match. This allows the last \s* to consume all the spaces:
>>> re.match(r'^\s*(.*?)\s*$', ' asdf asdf ').group(1)
'asdf asdf'

Perl : Decoding Regex

I would highly appreciate if somebody could help me understand the following.
=~/(?<![\w.])($val)(?![\w.])/gi)
This what i picked up but i dont understand this.
Lookaround: (?=a) for a lookahead, ?! for negative lookahead, or ?<= and ?<! for lookbehinds (positive and negative, respectively).
The regex seems to search for $val (i.e. string that matches the contents of the variable $val) not surrounded by word characters or dots.
Putting $val into parentheses remembers the corresponding matched part in $1.
See perlre for details.
Note that =~ is not part of the regex, it's the "binding operator".
Similarly, gi) is part of something bigger. g means the matching happens globally, which has different effects based on the context the matching occurs in, and i makes the match case insensitive (which could only influence $val here). The whole expression was in parentheses, probably, but we can't see the opening one.
Read (?<!PAT) as "not immediately preceded by text matching PAT".
Read (?!PAT) as "not immediately followed by text matching PAT".
I use these sites to help with testing and learning and decoding regex:
https://regex101.com/: This one dissects and explains the expression the best IMO.
http://www.regexr.com/
define $val then watch the regex engine work with rxrx - command-line REPL and wrapper for Regexp::Debugger
it shows output like this but in color
Matched
|
VVV
/(?<![\w.])(dog)(?![\w.])/
|
V
'The quick brown fox jumps over the lazy dog'
^^^
[Visual of regex at 'rxrx' line 0] [step: 189]
It also gives descriptions like this
(?<! # Match negative lookbehind
[\w.] # Match any of the listed characters
) # The end of negative lookbehind
( # The start of a capturing block ($1)
dog # Match a literal sequence ("dog")
) # The end of $1
(?! # Match negative lookahead
[\w.] # Match any of the listed characters
) # The end of negative lookahead

Why does this regular expression match?

I have this regex:
(?<!Sub ).*\(.*\)
And I'd like it to match this:
MsgBox ("The total run time to fix AREA and TD fields is: " & =imeElapsed & " minutes.")
But not this:
Sub ChangeAreaTD()
But somehow I still match the one that starts with Sub... does anyone have any idea why? I thought I'd be excluding "Sub " by doing
(?<!Sub )
Any help is appreciated!
Thanks.
Do this:
^MsgBox .*\(.*\)
The problem is that a negative lookbehind does not guarantee the beginning of a string. It will match anywhere.
However, adding a ^ character at the beginning of the regex does guarantee the beginning of the string. Then, change Sub to MsgBox so it only matches strings that begin with MsgBox
Your regex (?<!Sub ).*\(.*\), taken apart:
(?<! # negative look-behind
Sub # the string "Sub " must not occur before the current position
) # end negative look-behind
.* # anything ~ matches up to the end of the string!
\( # a literal "(" ~ causes the regex to backtrack to the last "("
.* # anything ~ matches up to the end of the string again!
\) # a literal ")" ~ causes the regex to backtrack to the last ")"
So, with your test string:
Sub ChangeAreaTD()
The look-behind is fulfilled immediately (right at position 0).
The .* travels to the end of the string after that.
Because of this .*, the look-behind never really makes a difference.
You were probably thinking of
(?<!Sub .*)\(.*\)
but it is very unlikely that variable-length look-behind is supported by your regex engine.
So what I would do is this (since variable-length look-ahead is widely supported):
^(?!.*\bSub\b)[^(]+\(([^)]+)\)
which translates as:
^ # At the start of the string,
(?! # do a negative look-ahead:
.* # anything
\b # a word boundary
Sub # the string "Sub"
\b # another word bounday
) # end negative look-ahead. If not found,
[^(]+ # match anything except an opening paren ~ to prevent backtracking
\( # match a literal "("
( # match group 1
[^)]+ # match anything up to a closing paren ~ to prevent backtracking
) # end match group 1
\) # match a literal ")".
and then go for the contents of match group 1.
However, regex generally is hideously ill-suited for parsing code. This is true for HTML the same way it is true for VB code. You will get wrong matches even with the improved regex. For example here, because of the nested parens:
MsgBox ("The total run time to fix all fields (AREA, TD) is: ...")
You have a backtracking problem here. The first .* in (?<!Sub ).*\(.*\) can match ChangeAreaTD or hangeAreaTD. In the latter case, the previous 4 characters are ub C, which does not match Sub. As the lookbehind is negated, this counts as a match!
Just adding a ^ to the beginning of your regex will not help you, as look-behind is a zero-length matching phrase. ^(?<!MsgBox ) would be looking for a line that followed a line ending in MsgBox. What you need to do instead is ^(?!Sub )(.*\(.*\)). This can be interpreted as "Starting at the beginning of a string, make sure it does not start with Sub. Then, capture everything in the string if it looks like a method call".
A good explanation of how regex engines parse lookaround can be found here.
If your wanting to match just the functions call, not declaration, then the pre bracket match should not match any characters, but more likely any identifier characters followed by spaces. Thus
(?<!Sub )[a-zA-Z][a-zA-Z0-9_]* *\(.*\)
The identifier may need more tokens depending on the language your matching.