Excluding the positive lookahead from the capture group - regex

I have the following text
<root>
<path>/my/data</path>
<paths>/global/data</paths>
</root>
and I'm trying to get a regex capture group for /my/data/ and /global/data only. I tried this:
^\s*(?=<path>|<paths>)(.*)$
but I don't understand why the (.*) groups are:
<path>/my/data</path>
<paths>/global/data</paths>
Is there any way to exclude the positive lookahead from the capture group?

The .* consumes the <path> and <paths> that are checked for with your lookahead. Look, (?=<path>|<paths>)(.*) in your regex first checks if there is <path> or <paths> immediately to the right of the current location and if there is, (.*) readily consumes (=adds the matched text to the overall match value and advances the regex index to the end of the current subpattern match) the <path> or <paths> since .* matches zero or more chars other than line break chars, as many as possible.
Make the lookahead pattern consuming:
^\s*(?:<path>|<paths>)(.*)$
See the regex demo.
Or, remove the alternation and contract the pattern to:
^\s*<paths?>(.*)$
See this regex demo. Here, <paths?> matches <path, then an optional s char and then a >.

(?:(?<=<path>)|(?<=<paths>))([^<]*)
(?< means Lookbehind and works in PCRE, Javascript, Java, Python, ...
Regex101 Test

Related

RegEx: Excluding a pattern from the match

I know some basics of the RegEx but not a pro in it. And I am learning it. Currently, I am using the following very very simple regex to match any digit in the given sentence.
/d
Now, I want that, all the digits except some patterns like e074663 OR e123444 OR e7736 should be excluded from the match. So for the following input,
Edit 398e997979 the Expression 9798729889 & T900980980098ext to see e081815 matches. Roll over matches or e081815 the expression e081815 for details.e081815 PCRE & JavaScript flavors of RegEx are e081815 supported. Validate your expression with Tests mode e081815.
Only bold digits should be matched and not any e081815. I tried the following without the success.
(^[e\d])(\d)
Also, going forward, some more patterns needs to be added for exclusion. For e.g. cg636553 OR cg(any digits). Any help in this regards will be much appreciated. Thanks!
Try this:
(?<!\be)(?<!\d)\d+
Test it live on regex101.com.
Explanation:
(?<!\be) # make sure we're not right after a word boundary and "e"
(?<!\d) # make sure we're not right after a digit
\d+ # match one or more digits
If you want to match individual digits, you can achieve that using the \G anchor that matches at the position after a successful match:
(?:(?<!\be)(?<=\D)|\G)\d
Test it here
Another option is to use a capturing group with lookarounds
(?:\b(?!e|cg)|(?<=\d)\D)[A-Za-z]?(\d+)
(?: Non capture group
\b(?!e|cg) Word boundary, assert what is directly to the right is not e or cg
| Or
(?<=\d)\D Match any char except a digit, asserting what is directly on the left is a digit
) Close group
[A-Za-z]? Match an optional char a-zA-Z
(\d+) Capture 1 or more digits in group 1
Regex demo

Regular expression to search for specific Referer in HTTP Header

I need to create a regular expression to match everything except a specific URL for a given Referer. I currently have it to match but can't reverse it and create the negative for it.
What I currently have:
Referer:(http(s)?(:\/\/))?(www\.)?test.com(\/.*)?
In the list below:
Referer:http://www.test.online/
Referer:https://www.test.online/
Referer:https://www.test.tv/
Referer:https://www.blah.com/
Referer:https://www.test.com/
Referer:http://www.test.com/
Referer:http://test.com/
Referer:https://test.com/
It will match:
Referer:https://www.test.com/
Referer:http://www.test.com/
Referer:http://test.com/
Referer:https://test.com/
However, I would like it to match everything except for those.
This is for our WAF so unfortunately are restricted on the usage which can only be fulfilled searching for the HTTP Header being passed back.
Try this regex:
^(?!.*Referer:(http(s)?(:\/\/))?(www\.)?test.com(\/.*)?).*$
A good way to negate your regex is to use negative lookahead.
Explanation:
The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead [is any regex pattern].
Working example: https://regex101.com/r/QJfeBB/1
You could use an anchor ^ to assert the start of the string and use a negative lookahead to assert what is on the right is not what you want to match.
Note that you have to escape the dot to match it literally and you could omit the last part (\/.*)?.
If you don't use the capturing groups for later use you might also turn those into non capturing groups (?:) instead.
^(?!Referer:(https?(:\/\/))?(www\.)?test\.com).+$
regex101 demo
About the pattern
^ Start of the string
(?! Negative lookahead to assert what is on the right does not match
Referer:(https?(:\/\/))?(www\.)?test\.com Match your pattern
) Close negative lookahead
.+ Match any char except a newline 1+ times
$ Assert end of the string

Negative Lookahead: trying to match one word and negate following words

I have a regex like
^.*\bfrost.*(?!flakes|snowman).*$
I am testing it against the following lines:
frosted flakes
frosty snowman
frost, jack
See this Regex.101 demo.
I only want the third expression to match, but all three are matching.
You should move the second .* into the lookahead, e.g.
^.*\bfrost(?!.*(?:flakes|snowman)).*$
Or
^.*\bfrost(?!.*flakes|.*snowman).*$
See the regex demo
In the original regex, the lookahead is located after a .* and whenever the lookahead returns false, the regex engine can backtrack and still match the string in another way, a location that is not immediately followed with snowman or flakes. When you put .* into the lookahead these two words do not have to appear immediately to the right of the current location.

I need to exclude word from regular expression

I have this regexp:
^[a-z0-9]+([.\-][a-z0-9]+)*$
I need exclude from match only one word "www".
I tried the negative lookahead but without a success.
Use a negative lookahead like this:
^(?!www$)[a-z0-9]+([.-][a-z0-9]+)*$
^^^^^^^^
This will not match a string equal to www.
See the regex demo
If you want to fail a match with strings that contain -www- or .www., use
^(?!.*\bwww\b)[a-z0-9]+([.-][a-z0-9]+)*$
See another regex demo. This pattern contains a (?!.*\bwww\b) lookahead that fails the whole match if there is a www somewhere inside the string and it has no digits or letters round it due to \b word boundaries.

Negative lookahead with capturing groups

I'm attempting this challenge:
https://regex.alf.nu/4
I want to match all strings that don't contain an ABBA pattern.
Match:
aesthophysiology
amphimictical
baruria
calomorphic
Don't Match
anallagmatic
bassarisk
chorioallantois
coccomyces
abba
Firstly, I have a regex to determine the ABBA pattern.
(\w)(\w)\2\1
Next I want to match strings that don't contain that pattern:
^((?!(\w)(\w)\2\1).)*$
However this matches everything.
If I simplify this by specifying a literal for the negative lookahead:
^((?!agm).)*$
The the regex does not match the string "anallagmatic", which is the desired behaviour.
So it looks like the issue is with me using capturing groups and back-references within the negative lookahead.
^(?!.*(.)(.)\2\1).+$
^^
You can use a lookahead here.See demo.The lookahead you created was correct but you need add .* so that it cannot appear anywhere in the string.
https://regex101.com/r/vV1wW6/39
Your approach will also work if you make the first group non capturing.
^(?:(?!(\w)(\w)\2\1).)*$
^^
See demo.It was not working because \2 \1 were different than what you intended.In your regex they should have been \3 and \2.
https://regex101.com/r/vV1wW6/40