Negative lookahead with capturing groups - regex

I'm attempting this challenge:
https://regex.alf.nu/4
I want to match all strings that don't contain an ABBA pattern.
Match:
aesthophysiology
amphimictical
baruria
calomorphic
Don't Match
anallagmatic
bassarisk
chorioallantois
coccomyces
abba
Firstly, I have a regex to determine the ABBA pattern.
(\w)(\w)\2\1
Next I want to match strings that don't contain that pattern:
^((?!(\w)(\w)\2\1).)*$
However this matches everything.
If I simplify this by specifying a literal for the negative lookahead:
^((?!agm).)*$
The the regex does not match the string "anallagmatic", which is the desired behaviour.
So it looks like the issue is with me using capturing groups and back-references within the negative lookahead.

^(?!.*(.)(.)\2\1).+$
^^
You can use a lookahead here.See demo.The lookahead you created was correct but you need add .* so that it cannot appear anywhere in the string.
https://regex101.com/r/vV1wW6/39
Your approach will also work if you make the first group non capturing.
^(?:(?!(\w)(\w)\2\1).)*$
^^
See demo.It was not working because \2 \1 were different than what you intended.In your regex they should have been \3 and \2.
https://regex101.com/r/vV1wW6/40

Related

Excluding the positive lookahead from the capture group

I have the following text
<root>
<path>/my/data</path>
<paths>/global/data</paths>
</root>
and I'm trying to get a regex capture group for /my/data/ and /global/data only. I tried this:
^\s*(?=<path>|<paths>)(.*)$
but I don't understand why the (.*) groups are:
<path>/my/data</path>
<paths>/global/data</paths>
Is there any way to exclude the positive lookahead from the capture group?
The .* consumes the <path> and <paths> that are checked for with your lookahead. Look, (?=<path>|<paths>)(.*) in your regex first checks if there is <path> or <paths> immediately to the right of the current location and if there is, (.*) readily consumes (=adds the matched text to the overall match value and advances the regex index to the end of the current subpattern match) the <path> or <paths> since .* matches zero or more chars other than line break chars, as many as possible.
Make the lookahead pattern consuming:
^\s*(?:<path>|<paths>)(.*)$
See the regex demo.
Or, remove the alternation and contract the pattern to:
^\s*<paths?>(.*)$
See this regex demo. Here, <paths?> matches <path, then an optional s char and then a >.
(?:(?<=<path>)|(?<=<paths>))([^<]*)
(?< means Lookbehind and works in PCRE, Javascript, Java, Python, ...
Regex101 Test

Regex - How to prevent any string that starts with "de" but cannot use lookahead or lookbehind?

I have a regex
[a-zA-Z][a-z]
I have to change this regex such that the regex should not accept string that starts with "de","DE","dE" and "De" .I cannot use look behind or look ahead because my system does not support it?
There's a solution without a lookahead or lookbehind, but you need to be able to use groups.
The idea there is to create a sort of "honeypot" that will match your negative results and keep only the results that do interest you.
In your case, that would write:
[dD][eE].*|(<your-regex>)
If the proposition is de<anything> (case insensitive here), it will match, but group(1) will be null.
On the other hand, matching diZ for instance would match not match what is before the or and would therefore fall into the group(1).
Finally, if the proposition doesn't start with de and doesn't match your regex, well, there will be no groups to get at all.
If you need to be sure that your proposition will match the whole provided string, you can update the regex thus:
^(?:[dD][eE].*|(<your-regex>))$
Note that ?: is not a lookahead of any kind, it serves to mark the group as non-capturing, so that <your-regex> will still be captured by group(1) (would become group(2) otherwise and the capture of a group is not always a transparent operation, performance-wise).
Simply ignore those characters:
[a-ce-z][a-df-z][a-gi-kwxyzWZXZ]
Make sure the flag is set to case insensitive. Also, [a-gi-kwxyzWZXZ] can then be modified to [a-gi-kwxyz].
EDIT:
As pointed out in this comment, the regex here won't support other words that start with d but are not followed by e. In this case, negative lookahead is a possible solution:
^(?!de)[a-z]+
This matches anything not starting with "DE" (case insensitive, without look arounds, allowing leading whitespace):
^ *+(?:[^Dd].|.[^Ee])<your regex for rest of input>
See live demo.
The possessive quantifier *+ used for whitespace prevents [^Dd] from being allowed to match a space via backtracking, making this regex hardened against leading spaces.
You can use an alternation excluding matching the d and D from the first character, or exclude matching the e as the second character.
Note that the pattern [a-zA-Z][a-z] matches at least 2 characters, so will the following pattern:
^(?:[abce-zABCE-Z][a-z]|[a-zA-Z][a-df-z]).*
^ Start of string
(?: Non capture group
[abce-zABCE-Z][a-z] Match a char a-zA-Z without d and D followed by a lowercase char a-z
| or
[a-zA-Z][a-df-z] Match a char a-zA-Z followed by a lowercase chars a-z without e
) Close non capture grou
.* Match 0+ times any char except a newline
Regex demo
Another option is to use word boundaries \b instead of an anchor ^
\b(?:[abce-zABCE-Z][a-z]|[a-zA-Z][a-df-z])[a-zA-Z]*\b
Regex demo

Regular expression to search for specific Referer in HTTP Header

I need to create a regular expression to match everything except a specific URL for a given Referer. I currently have it to match but can't reverse it and create the negative for it.
What I currently have:
Referer:(http(s)?(:\/\/))?(www\.)?test.com(\/.*)?
In the list below:
Referer:http://www.test.online/
Referer:https://www.test.online/
Referer:https://www.test.tv/
Referer:https://www.blah.com/
Referer:https://www.test.com/
Referer:http://www.test.com/
Referer:http://test.com/
Referer:https://test.com/
It will match:
Referer:https://www.test.com/
Referer:http://www.test.com/
Referer:http://test.com/
Referer:https://test.com/
However, I would like it to match everything except for those.
This is for our WAF so unfortunately are restricted on the usage which can only be fulfilled searching for the HTTP Header being passed back.
Try this regex:
^(?!.*Referer:(http(s)?(:\/\/))?(www\.)?test.com(\/.*)?).*$
A good way to negate your regex is to use negative lookahead.
Explanation:
The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead [is any regex pattern].
Working example: https://regex101.com/r/QJfeBB/1
You could use an anchor ^ to assert the start of the string and use a negative lookahead to assert what is on the right is not what you want to match.
Note that you have to escape the dot to match it literally and you could omit the last part (\/.*)?.
If you don't use the capturing groups for later use you might also turn those into non capturing groups (?:) instead.
^(?!Referer:(https?(:\/\/))?(www\.)?test\.com).+$
regex101 demo
About the pattern
^ Start of the string
(?! Negative lookahead to assert what is on the right does not match
Referer:(https?(:\/\/))?(www\.)?test\.com Match your pattern
) Close negative lookahead
.+ Match any char except a newline 1+ times
$ Assert end of the string

I need to exclude word from regular expression

I have this regexp:
^[a-z0-9]+([.\-][a-z0-9]+)*$
I need exclude from match only one word "www".
I tried the negative lookahead but without a success.
Use a negative lookahead like this:
^(?!www$)[a-z0-9]+([.-][a-z0-9]+)*$
^^^^^^^^
This will not match a string equal to www.
See the regex demo
If you want to fail a match with strings that contain -www- or .www., use
^(?!.*\bwww\b)[a-z0-9]+([.-][a-z0-9]+)*$
See another regex demo. This pattern contains a (?!.*\bwww\b) lookahead that fails the whole match if there is a www somewhere inside the string and it has no digits or letters round it due to \b word boundaries.

Trouble with non-capturing groups in regular expression

I'm attempting to capture the 6 digit number in the following:
ObjectID: !nrtdms:0:!session:slonwswtest1:!database:TEST:!folder:ordinary,486150:
I tried the following regex:
\d+(?::$)
attempting to use a non-capturing group to strip the colon out of the returned match, but it returns the colon as in:
486150:
Any ideas what I'm doing wrong?
You want a positive lookahead:
\d+(?=:$)
A non-capturing group is simply a group that cannot be accessed via a backreference; they still are part of the match, nonetheless.
Alternatively, you can use
(\d+):$
and obtain the 1st match group.
You should use a positive lookahead rather than a non-capturing group
\d+(?=:$)
Non-capturing groups are groups that will not create a capture (to be used in backreferences or extracted from the match result). Nonetheless they will match the expression.
What you're looking for is lookahead - to test the expression but exclude it from the match:
\d+(?=:$)
Probably your regex tool is returning the complete match since you don't have any capture group there. Try to enclose the \d+ in a capture group, and find the way to get capture group 1 in your regex tool.
Alternatively, you can also use positive look-ahead:
\d+(?=:$)
And given that you want to capture 6 digits, you can make that explicit:
\d{6}