Using OR in negative lookahead

Using OR in negative lookahead - regex

Given an input like #1=A1#2=A2#3=A3>>#1=B1#2=B2#3=B3>>#1=C1#2=C2#3=C3>>#1=B1#2=B2#3=B3. I want to capture what is after #2= when #3=B3 and also verify that subsequently, #2= should contain the same value which was captured OR the value should be "ABC"
The patterns that should match are:
#1=A1#2=A2#3=A3>>#1=B1#2=B2#3=B3>>#1=C1#2=B2#3=C3>>#1=B1#2=B2#3=B3 #1=A1#2=A2#3=A3>>#1=B1#2=B2#3=B3>>#1=C1#2=B2#3=C3>>#1=B1#2=ABC#3=B3
#1=A1#2=A2#3=A3>>#1=B1#2=B2#3=B3>>#1=C1#2=B2#3=C3
The pattern that should not match #1=A1#2=A2#3=A3>>#1=B1#2=B2#3=B3>>#1=C1#2=B2#3=C3>>#1=B1#2=B10#3=B3
#1=A1#2=A2#3=A3>>#1=B1#2=B2#3=B3>>#1=B1#2=B10#3=B3>>#1=B1#2=B2#3=B3
I am able to do the part when it should match the entire string using negative lookaround. But I am not able to the OR part i.e. #2=ABC if the string does not match.
https://regex101.com/r/eCYCtg/1

Note your current regex matches when the repeating #2= has the value starting with the captured value before. You need to add # in the negative lookahead, (?!\1#).
To fix the pattern as you need you need to add ABC# as an alternative to this lookahead: (?!\1#|ABC#). It will now fail the negative lookahead check (and thus will allow the match to occur) if the entire #2 value is ABC or the same value as captured before into Group 1.
You may use
^(?:(?!#2=[^#]*#3=B3(?:[#>]|$)).)*#2=([^#]*)#3=B3(?:[#>]|$)(?!.*#2=(?!\1#|ABC#)[^#]*#3=B3(?:[#>]|$))
See the regex demo.

Related

Why do substrings prevent match with negative lookahead?

Consider the following test data:
x.foo,x.bar
y.foo,y.bar
yy.foo,yy.bar
x.foo,y.bar
y.foo,x.bar
yy.foo,x.bar
x.foo,yy.bar
yy.foo,y.bar
y.foo,yy.bar
I'm attempting to write a regular expression where the string before .foo and the string before .bar are different from each other. The first three items should not match. The other six should.
This mostly works:
^(.+?)\.foo,(?!\1)(.+?)\.bar$
However, it misses on the last one, because y is in match group 1, and thus yy is not matched in match group 2.
Interactive: https://regex101.com/r/Pv5062/1
How can I modify the negative lookahead pattern such that the last item matches as well?

Inline backreferences do not store the context information, they only keep the text captured. You need to specify the context yourself.
You may add a dot after \1:
^(.+?)\.foo,(?!\1\.)(.+?)\.bar$
^^
Or, even repeat the part after the second (.+?):
^(.+?)\.foo,(?!\1\.bar$)(.+?)\.bar$
Or, if the bar part cannot contain ., you may make it more "generic":
^(.+?)\.foo,(?!\1\.[^.]+$)(.+?)\.bar$
See the regex demo and another regex demo.
The point is: your (?!\1) is not "anchored" and will fail the match in case the text stored in Group 1 appears immediately to the right of the current location regardless of the context. To solve this, you need to provide this context. As the value that can be matched with .+? can contain virtually anything all you can rely on is the "hardcoded" bits after the lookahead.

Regex validate only the result of a captured group

I have this regex to detect an email address:
(?=.*[a-zA-Z])([a-zA-Z0-9_.+-]{8,})#(\S+\.\S+)
The requirement: The part before # needs to contain at least one letter and be at least 8 characters long.
I'm using positive lookahead to see if it contains a letter, but lookahead actually apply to the entire line (the part after # usually will contain letters), so this will pass
123456789#gmail.com
So question is, how can I validate only the result of the first capturing group (in this case 123456789) to see if it has a letter or not?

The [a-zA-Z0-9_.+-]{8,} consuming pattern part before # does not match #, so the lookahead check should only check for a letter after 0 or more chars other than #.
Using
(?=[^#]*[a-zA-Z])([a-zA-Z0-9_.+-]{8,})#(\S+\.\S+)
will fix the issue. See the regex demo and a Regulex graph:
You may further optimize the lookahead pattern by precising the [^#]. E.g. since you only allow 0-9_.+- apart from letters, you may write the regex as
(?=[0-9_.+-]*[a-zA-Z])([a-zA-Z0-9_.+-]{8,})#(\S+\.\S+)
^^^^^^^^^
See this regex demo.
Or, you may follow the principle of contrast (suggested in comments), and use [^#a-zA-Z]* instead of [^#]*.
Depending on where you are using the regex, you might want to wrap it with ^ and $ anchors to ensure a full string match.

regex with match in GREL/OpenRefine

I'm using OpenRefine to parse a column with string values.
I want to find the cells that contain either: offer or discount.
The string value is usually a sentence
My code below is using the match function not working.
using value.contains() is limited to searching for one word only.
value.match(/.*(offer)|(discount)/)

What I can see in the documentation is that the .match function Attempts to match the string s in its entirety against the regex pattern p and returns an array of capture groups.
To match either one of them but not both, you might use a positive and a negative lookahead if that is supported.
To match either of the options, use an alternation to make sure one of the words is there and the other one is not and vice versa:
(?:(?!.*\bdiscount\b).*\boffer\b.*|(?!.*\boffer).*\bdiscount\b.*)
Regex demo
That will match
(?: Non capturing group
(?!.*\bdiscount\b).*\boffer\b.* Assert that on the right is no discount and match any char and offer
| Or
(?!.*\boffer).*\bdiscount\b.* Or assert the opposite
) Close non capturing group

Regex for not matching if the pattern captured in capturing group changes

Given an input like #1=A1#2=A2#3=A3>>#1=B1#2=B2#3=B3>>#1=C1#2=C2#3=C3>>#1=B1#2=B2#3=B3. I want to capture what is after #2= when #3=B3 and also verify that when #3=B3, then #2= should contain the same value which was captured.
The patterns that should match are:
#1=A1#2=A2#3=A3>>#1=B1#2=B2#3=B3>>#1=C1#2=C2#3=C3>>#1=B1#2=B2#3=B3
#1=A1#2=A2#3=A3>>#1=B1#2=B2#3=B3>>#1=C1#2=C2#3=C3
The pattern that should not match
#1=A1#2=A2#3=A3>>#1=B1#2=B2#3=B3>>#1=C1#2=C2#3=C3>>#1=B1#2=B10#3=B3
#1=A1#2=A2#3=A3>>#1=B1#2=B2#3=B3>>#1=C1#2=C2#3=C3>>#1=B1#2=B10#3=B3>>#1=B1#2=B2#3=B3
The way I do this currently is in two passes, first by getting all invalid patterns by using regex #2=((?:\w|-|'|""|,|\.)+?)#3=B3.+#2=(?!\1#)((?:\w|-|'|""|,|\.)+?)#3=B3 and then removing these patterns from all the available inputs.

You can use the following regex:
^(?:(?!#2=[^#]*#3=B3(?:[#>]|$)).)*#2=([^#]*)#3=B3(?:[#>]|$)(?!.*#2=(?!\1)[^#]*#3=B3(?:[#>]|$))
Online demo.
How does it work?
First it skips all the text up until the first #2= followed by #3=B3 using a tempered greedy token:
^(?:(?!#2=[^#]*#3=B3(?:[#>]|$)).)*
Then it captures the value of the #2=:
#2=([^#]*)#3=B3(?:[#>]|$)
And finally it uses a negative lookahead assertion to make sure that no other #2= followed by a #3=B3 has a different value than the captured one:
(?!.*#2=(?!\1)[^#]*#3=B3(?:[#>]|$))

Regex to match all permutations of {1,2,3,4} without repetition

I am implementing the following problem in ruby.
Here's the pattern that I want :
1234, 1324, 1432, 1423, 2341 and so on
i.e. the digits in the four digit number should be between [1-4] and should also be non-repetitive.
to make you understand in a simple manner I take a two digit pattern
and the solution should be :
12, 21
i.e. the digits should be either 1 or 2 and should be non-repetitive.
To make sure that they are non-repetitive I want to use $1 for the condition for my second digit but its not working.
Please help me out and thanks in advance.

You can use this (see on rubular.com):
^(?=[1-4]{4}$)(?!.*(.).*\1).*$
The first assertion ensures that it's ^[1-4]{4}$, the second assertion is a negative lookahead that ensures that you can't match .*(.).*\1, i.e. a repeated character. The first assertion is "cheaper", so you want to do that first.
References
regular-expressions.info/Lookarounds and Backreferences
Related questions
How does the regular expression (?<=#)[^#]+(?=#) work?

Just for a giggle, here's another option:
^(?:1()|2()|3()|4()){4}\1\2\3\4$
As each unique character is consumed, the capturing group following it captures an empty string. The backreferences also try to match empty strings, so if one of them doesn't succeed, it can only mean the associated group didn't participate in the match. And that will only happen if string contains at least one duplicate.
This behavior of empty capturing groups and backreferences is not officially supported in any regex flavor, so caveat emptor. But it works in most of them, including Ruby.

I think this solution is a bit simpler
^(?:([1-4])(?!.*\1)){4}$
See it here on Rubular
^ # matches the start of the string
(?: # open a non capturing group
([1-4]) # The characters that are allowed the found char is captured in group 1
(?!.*\1) # That character is matched only if it does not occur once more
){4} # Defines the amount of characters
$
(?!.*\1) is a lookahead assertion, to ensure the character is not repeated.
^ and $ are anchors to match the start and the end of the string.

While the previous answers solve the problem, they aren't as generic as they could be, and don't allow for repetitions in the initial string. For example, {a,a,b,b,c,c}. After asking a similar question on Perl Monks, the following solution was given by Eily:
^(?:(?!\1)a()|(?!\2)a()|(?!\3)b()|(?!\4)b()|(?!\5)c()|(?!\6)c()){6}$
Similarly, this works for longer "symbols" in a string, and for variable length symbols too.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using OR in negative lookahead - regex

Related

Why do substrings prevent match with negative lookahead?

Regex validate only the result of a captured group

regex with match in GREL/OpenRefine

Regex for not matching if the pattern captured in capturing group changes

Regex to match all permutations of {1,2,3,4} without repetition

Categories

Resources