Why do substrings prevent match with negative lookahead? - regex

Consider the following test data:
x.foo,x.bar
y.foo,y.bar
yy.foo,yy.bar
x.foo,y.bar
y.foo,x.bar
yy.foo,x.bar
x.foo,yy.bar
yy.foo,y.bar
y.foo,yy.bar
I'm attempting to write a regular expression where the string before .foo and the string before .bar are different from each other. The first three items should not match. The other six should.
This mostly works:
^(.+?)\.foo,(?!\1)(.+?)\.bar$
However, it misses on the last one, because y is in match group 1, and thus yy is not matched in match group 2.
Interactive: https://regex101.com/r/Pv5062/1
How can I modify the negative lookahead pattern such that the last item matches as well?

Inline backreferences do not store the context information, they only keep the text captured. You need to specify the context yourself.
You may add a dot after \1:
^(.+?)\.foo,(?!\1\.)(.+?)\.bar$
^^
Or, even repeat the part after the second (.+?):
^(.+?)\.foo,(?!\1\.bar$)(.+?)\.bar$
Or, if the bar part cannot contain ., you may make it more "generic":
^(.+?)\.foo,(?!\1\.[^.]+$)(.+?)\.bar$
See the regex demo and another regex demo.
The point is: your (?!\1) is not "anchored" and will fail the match in case the text stored in Group 1 appears immediately to the right of the current location regardless of the context. To solve this, you need to provide this context. As the value that can be matched with .+? can contain virtually anything all you can rely on is the "hardcoded" bits after the lookahead.

Related

Regex conditional lookahead issue

I have a conditional lookahead regex that tests to see if there is a number substring at the end of a string, and if so match for the numbers, and if not, match for another substring
The string in question: "H2K 101"
If just the lookahead is used, i.e. (?=\d{1,8}$)(\d{1,8}$), the lookahead succeeds, and "101" is found in capture group 1
When the lookahead is placed into a conditional, i.e. (?(?=\d{1,8}\z)(\d{1,8}\z)|([a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+)), the lookahead now fails, and the second pattern is used, matching "H2K", and a "2" is found in capture group 2.
If the test string has the "2" swapped for a letter, i.e. "HKK 101"
then the lookahead conditional works as expected, and the number "101" is once again found in capture group 1.
I've tested this in Regex101 and other PCRE engines, and all work the same, so clearly I'm missing something obvious about conditionals or the condition regex I'm using. Any insight greatly appreciated.
Thanks.
The look ahead starts at the current position, so initially it fails, and the alternative is used -- where it finds a match at the current position.
If you want the look ahead to succeed when still at the initial position, you need to allow for the intermediate characters to occur. Also, when the alternative kicks in, realise that there can follow a second match that still uses the look ahead, but now at a position where the look ahead is successful.
From what I understand, you are interested in one match only, not two consecutive matches (or more). So that means you should attempt to match the whole string, and capture the part of interest in a capture group. Also, the look ahead should be made to succeed when still at the initial position. This all means you need to inject several .*. There is no need for a conditional.
(?=.*\d{1,8}\z).*?(\d{1,8}\z)|([a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+).*
Note also that (?=.*\d{1,8}\z) succeeds if and only when (?=.*\d\z) succeeds, so you can simplify that:
(?=.*\d\z).*?(\d{1,8}\z)|([a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+).*
There are two capture groups. It there is a match, exactly one of the capture groups will have a non-empty matching content, which is the content you need.
You want to match a number of specific length at the end of the string, and if there is none, match something else.
There is no need for a conditional here. Conditional patterns are necessary to examine what to match next at the given position inside the string based either on a specific group match or a lookaround test. They are not useful when you want to give priority to a specific pattern.
Here, you can use a PCRE pattern based on the \K operator like
.*?\K\d{1,8}\z|[a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+
Or, using capturing groups
(?|.*?(\d{1,8})\z|([a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+))
See the regex demo #1 and regex demo #2.
Details:
.*?\K\d{1,8}$ - any zero or more chars other than line break chars, as few as possible, then the match reset operator that discards the text matched so far, then one to eight digits at the end of string
| - or
[a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+ - one or more letters, 1-8 digits, underscores or hyphens, and then one or more letters.
And
(?| - start of the branch reset group:
.*? - any zero or more chars other than line break chars, as few as possible
(\d{1,8}) - Group 1: one to eight digits
\z - end of string
| - or
( - Group 1 start:
[a-zA-Z]+ - one or more ASCII letters
[\d_-]{1,8} - one to eight digits, underscores, hyphens
[a-zA-Z]+ - one or more ASCII letters
) - Group 1 end
) - end of the group.

Regex Pattern to Match except when the clause enclosed by the tilde (~) on both sides

I want to extract matches of the clauses match-this that is enclosed with anything other than the tilde (~) in the string.
For example, in this string:
match-this~match-this~ match-this ~match-this#match-this~match-this~match-this
There should be 5 matches from above. The matches are explained below (enclosed by []):
Either match-this~ or match-this is correct for first match.
match-this is correct for 2nd match.
Either ~match-this# or ~match-this is correct for 3rd match.
Either #match-this~ or #match-this or match-this~ is correct for 4th match.
Either ~match-this or match-this is correct for 5th match.
I can use the pattern ~match-this~ catch these ~match-this~, but when I tried the negation of it (?!(~match-this)), it literally catches all nulls.
When I tried the pattern [^~]match-this[^~], it catches only one match (the 2nd match from above). And when I tried to add asterisk wild card on any negation of tilde, either [^~]match-this[^~]* or [^~]*match-this[^~], I got only 2 matches. When I put the asterisk wild card on both, it catches all match-this including those which enclosed by tildes ~.
Is it possible to achieve this with only one regex test? Or Does it need more??
If you also want to match #match-this~ as a separate match, you would have to account for # while matching, as [^~] also matches #
You could match what you don't want, and capture in a group what you want to keep.
~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)
Explanation
~[^~#]*~ Match any char except ~ or # between ~
| Or
( Capture group 1
(?:(?!match-this).)* Match any char if not directly followed by *match-this~
match-this Match literally
(?:(?!match-this)[^#~])* Match any char except ~ or # if not directly followed by match this
) Close group 1
See a regex demo and a Python demo.
Example
import re
pattern = r"~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)"
s = "match-this~match-this~ match-this ~match-this#match-this~match-this~match-this"
res = [m for m in re.findall(pattern, s) if m]
print (res)
Output
['match-this', ' match-this ', '~match-this', '#match-this', 'match-this']
If all five matches can be "match-this" (contradicting the requirement for the 3rd match) you can match the regular expression
~match-this~|(\bmatch-this\b)
and keep only matches that are captured (to capture group 1). The idea is to discard matches that are not captured and keep matches that are captured. When the regex engine matches "~match-this~" its internal string pointer is moved just past the closing "~", thereby skipping an unwanted substring.
Demo
The regular expression can be broken down as follows.
~match-this~ # match literal
| # or
( # begin capture group 1
\b # match a word boundary
match-this # match literal
\b # match a word boundary
) # end capture group 1
Being so simple, this regular expression would be supported by most regex engines.
For this you need both kinds of lookarounds. This will match the 5 spots you want, and there's a reason why it only works this way and not another and why the prefix and/or suffix can't be included:
(?<=~)match-this(?!~)|(?<!~)match-this(?=~)|(?<!~)match-this(?!~)
Explaining lookarounds:
(?=...) is a positive lookahead: what comes next must match
(?!...) is a negative lookahead: what comes next must not match
(?<=...) is a positive lookbehind: what comes before must match
(?<!...) is a negative lookbehind: what comes before must not match
Why other ways won't work:
[^~] is a class with negation, but it always needs one character to be there and also consumes that character for the match itself. The former is a problem for a starting text. The latter is a problem for having advanced too far, so a "don't match" character is gone already.
(^|[^~]) would solve the first problem: either the text starts or it must be a character not matching this. We could do the same for ending texts, but this is a dead again anyway.
Only lookarounds remain, and even then we have to code all 3 variants, hence the two |.
As per the nature of lookarounds the character in front or behind cannot be captured. Additionally if you want to also match either a leading or a trailing character then this collides with recognizing the next potential match.
It's a difference between telling the engine to "not match" a character and to tell the engine to "look out" for something without actually consuming characters and advancing the current position in the text. Also not every regex engine supports all lookarounds, so it matters where you actually want to use it. For me it works fine in TextPad 8 and should also work fine in PCRE (f.e. in PHP). As per regex101.com/r/CjcaWQ/1 it also works as expected by me.
What irritates me: if the leading and/or trailing character of a found match is important to you, then just extract it from the input when processing all the matches, since they also come with starting positions and lengths: first match at position 0 for 10 characters means you look at input text position -1 and 10.

Select Northings from a 1 Line String

I have the following string;
Start: 738392E, 6726376N
I extracted 738392 ok using (?<=.art\:\s)([0-9A-Z]*). This gave me a one group match allowing me to extract it as a column value
.
I want to extract 6726376 the same way. Have only one group appear because I am parsing that to a column value.
Not sure why is (?=(art\:\s\s*))(?=[,])*(.*[0-9]*) giving me the entire line after S.
Helping me get it right with an explanation will go along way.
Because you used positive lookaheads. Those just make some assertions, but don't "move the head along".
(?=(art\:\s\s*)) makes sure you're before "art: ...". The next thing is another positive lookahead that you quantify with a star to make it optional. Finally you match anything, so you get the rest of the line in your capture group.
I propose a simpler regex:
(?<=(art\:\s))(\d+)\D+(\d+)
Demo
First we make a positive lookback that makes sure we're after "art: ", then we match two numbers, seperated by non-numbers.
There is no need for you to make it this complicated. Just use something like
Start: (\d+)E, (\d+)N
or
\b\d+(?=[EN]\b)
if you need to match each bit separately.
Your expression (?=(art\:\s\s*))(?=[,])*(.*[0-9]*) has several problems besides the ones already mentioned: 1) your first and second lookahead match at different locations, 2) your second lookahead is quantified, which, in 25 years, I have never seen someone do, so kudos. ;), 3) your capturing group matches about anything, including any line or the empty string.
You match the whole part after it because you use .* which will match until the end of the line.
Note that this part [0-9]* at the end of the pattern does not match because it is optional and the preceding .* already matches until the end of the string.
You could get the match without any lookarounds:
(art:\s)(\d+)[^,]+,\s(\d+)
Regex demo
If you want the matches only, you could make use of the PyPi regex module
(?<=\bStart:(?:\s+\d+[A-Z],)* )\d+(?=[A-Z])
Regex demo (For example only, using a different engine) | Python demo

Using OR in negative lookahead

Given an input like #1=A1#2=A2#3=A3>>#1=B1#2=B2#3=B3>>#1=C1#2=C2#3=C3>>#1=B1#2=B2#3=B3. I want to capture what is after #2= when #3=B3 and also verify that subsequently, #2= should contain the same value which was captured OR the value should be "ABC"
The patterns that should match are:
#1=A1#2=A2#3=A3>>#1=B1#2=B2#3=B3>>#1=C1#2=B2#3=C3>>#1=B1#2=B2#3=B3 #1=A1#2=A2#3=A3>>#1=B1#2=B2#3=B3>>#1=C1#2=B2#3=C3>>#1=B1#2=ABC#3=B3
#1=A1#2=A2#3=A3>>#1=B1#2=B2#3=B3>>#1=C1#2=B2#3=C3
The pattern that should not match #1=A1#2=A2#3=A3>>#1=B1#2=B2#3=B3>>#1=C1#2=B2#3=C3>>#1=B1#2=B10#3=B3
#1=A1#2=A2#3=A3>>#1=B1#2=B2#3=B3>>#1=B1#2=B10#3=B3>>#1=B1#2=B2#3=B3
I am able to do the part when it should match the entire string using negative lookaround. But I am not able to the OR part i.e. #2=ABC if the string does not match.
https://regex101.com/r/eCYCtg/1
Note your current regex matches when the repeating #2= has the value starting with the captured value before. You need to add # in the negative lookahead, (?!\1#).
To fix the pattern as you need you need to add ABC# as an alternative to this lookahead: (?!\1#|ABC#). It will now fail the negative lookahead check (and thus will allow the match to occur) if the entire #2 value is ABC or the same value as captured before into Group 1.
You may use
^(?:(?!#2=[^#]*#3=B3(?:[#>]|$)).)*#2=([^#]*)#3=B3(?:[#>]|$)(?!.*#2=(?!\1#|ABC#)[^#]*#3=B3(?:[#>]|$))
See the regex demo.

Regex to match all permutations of {1,2,3,4} without repetition

I am implementing the following problem in ruby.
Here's the pattern that I want :
1234, 1324, 1432, 1423, 2341 and so on
i.e. the digits in the four digit number should be between [1-4] and should also be non-repetitive.
to make you understand in a simple manner I take a two digit pattern
and the solution should be :
12, 21
i.e. the digits should be either 1 or 2 and should be non-repetitive.
To make sure that they are non-repetitive I want to use $1 for the condition for my second digit but its not working.
Please help me out and thanks in advance.
You can use this (see on rubular.com):
^(?=[1-4]{4}$)(?!.*(.).*\1).*$
The first assertion ensures that it's ^[1-4]{4}$, the second assertion is a negative lookahead that ensures that you can't match .*(.).*\1, i.e. a repeated character. The first assertion is "cheaper", so you want to do that first.
References
regular-expressions.info/Lookarounds and Backreferences
Related questions
How does the regular expression (?<=#)[^#]+(?=#) work?
Just for a giggle, here's another option:
^(?:1()|2()|3()|4()){4}\1\2\3\4$
As each unique character is consumed, the capturing group following it captures an empty string. The backreferences also try to match empty strings, so if one of them doesn't succeed, it can only mean the associated group didn't participate in the match. And that will only happen if string contains at least one duplicate.
This behavior of empty capturing groups and backreferences is not officially supported in any regex flavor, so caveat emptor. But it works in most of them, including Ruby.
I think this solution is a bit simpler
^(?:([1-4])(?!.*\1)){4}$
See it here on Rubular
^ # matches the start of the string
(?: # open a non capturing group
([1-4]) # The characters that are allowed the found char is captured in group 1
(?!.*\1) # That character is matched only if it does not occur once more
){4} # Defines the amount of characters
$
(?!.*\1) is a lookahead assertion, to ensure the character is not repeated.
^ and $ are anchors to match the start and the end of the string.
While the previous answers solve the problem, they aren't as generic as they could be, and don't allow for repetitions in the initial string. For example, {a,a,b,b,c,c}. After asking a similar question on Perl Monks, the following solution was given by Eily:
^(?:(?!\1)a()|(?!\2)a()|(?!\3)b()|(?!\4)b()|(?!\5)c()|(?!\6)c()){6}$
Similarly, this works for longer "symbols" in a string, and for variable length symbols too.