Regex conditional lookahead issue - regex

I have a conditional lookahead regex that tests to see if there is a number substring at the end of a string, and if so match for the numbers, and if not, match for another substring
The string in question: "H2K 101"
If just the lookahead is used, i.e. (?=\d{1,8}$)(\d{1,8}$), the lookahead succeeds, and "101" is found in capture group 1
When the lookahead is placed into a conditional, i.e. (?(?=\d{1,8}\z)(\d{1,8}\z)|([a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+)), the lookahead now fails, and the second pattern is used, matching "H2K", and a "2" is found in capture group 2.
If the test string has the "2" swapped for a letter, i.e. "HKK 101"
then the lookahead conditional works as expected, and the number "101" is once again found in capture group 1.
I've tested this in Regex101 and other PCRE engines, and all work the same, so clearly I'm missing something obvious about conditionals or the condition regex I'm using. Any insight greatly appreciated.
Thanks.

The look ahead starts at the current position, so initially it fails, and the alternative is used -- where it finds a match at the current position.
If you want the look ahead to succeed when still at the initial position, you need to allow for the intermediate characters to occur. Also, when the alternative kicks in, realise that there can follow a second match that still uses the look ahead, but now at a position where the look ahead is successful.
From what I understand, you are interested in one match only, not two consecutive matches (or more). So that means you should attempt to match the whole string, and capture the part of interest in a capture group. Also, the look ahead should be made to succeed when still at the initial position. This all means you need to inject several .*. There is no need for a conditional.
(?=.*\d{1,8}\z).*?(\d{1,8}\z)|([a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+).*
Note also that (?=.*\d{1,8}\z) succeeds if and only when (?=.*\d\z) succeeds, so you can simplify that:
(?=.*\d\z).*?(\d{1,8}\z)|([a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+).*
There are two capture groups. It there is a match, exactly one of the capture groups will have a non-empty matching content, which is the content you need.

You want to match a number of specific length at the end of the string, and if there is none, match something else.
There is no need for a conditional here. Conditional patterns are necessary to examine what to match next at the given position inside the string based either on a specific group match or a lookaround test. They are not useful when you want to give priority to a specific pattern.
Here, you can use a PCRE pattern based on the \K operator like
.*?\K\d{1,8}\z|[a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+
Or, using capturing groups
(?|.*?(\d{1,8})\z|([a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+))
See the regex demo #1 and regex demo #2.
Details:
.*?\K\d{1,8}$ - any zero or more chars other than line break chars, as few as possible, then the match reset operator that discards the text matched so far, then one to eight digits at the end of string
| - or
[a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+ - one or more letters, 1-8 digits, underscores or hyphens, and then one or more letters.
And
(?| - start of the branch reset group:
.*? - any zero or more chars other than line break chars, as few as possible
(\d{1,8}) - Group 1: one to eight digits
\z - end of string
| - or
( - Group 1 start:
[a-zA-Z]+ - one or more ASCII letters
[\d_-]{1,8} - one to eight digits, underscores, hyphens
[a-zA-Z]+ - one or more ASCII letters
) - Group 1 end
) - end of the group.

Related

Nesting capture groups

I have the following strings:
'TwoOrMoreDimensions'
'LookLikeVectors'
'RecentVersions'
'= getColSums'
'=getColSums'
I would like to capture all occurrences of an uppercase letter that is preceded by a lowercase letter in all strings but the last two.
I can use ([a-z]+)([A-Z]) to capture all such occurrences but I don't know how to exclude matches from the last two strings.
The last two strings can be excluded using the negative lookahead ^(?!>\s|\=) - is it possible to combine this with the expression above?
I tried ^(?!>\s|\=)(([a-z]+)([A-Z])) but it doesn't yield any matches. I'm not sure why because ^(?!>\s|\=)(.+) captures all characters after the start of the matching string as a group. So why can't this capture group be further divided into group 2 ([a-z]+) and group 3 ([A-Z])?
Link to tester
The issue with your current regex is that the ^ anchors it to the start of string, so it can only match a sequence of lower case letters followed by an upper case letter at the start of the string, and none of your strings have that.
One way to do what you want is to use the \G anchor, which forces the current match to start where the previous one ended. That can be used in an alternation with ^(?!=) which will match any string which doesn't start with an = sign, and then a negated character class ([^a-z]) to skip any non-lower case characters:
(?:^(?!=)|\G)[^a-z]*(([a-z]+)([A-Z]))
This will give the same capture groups as your original regex.
Demo on regex101
Another solution (may not be the most efficient but meets the task) would be (?:^=\s*\w*)|([a-z]+)([A-Z])
This essentially forces the regex to greedily consume everything (in a non-capturing group, although is considered for full match) if it begins with =, leaving nothing for the next capture groups.
Regex101 Demo Link

How to replace last digit in a line with the same digit and a fullstop?

I am looking for a RegExp to find and replace all instances of last digits in a line with the same digit and a full stop in Memsource, which seems to be not working properly.
Example:
Pic. 12-1
Pic. 12-2
Pic. 12-3
To:
Pic. 12-1.
Pic. 12-2.
Pic. 12-3.
I've chosen them by \d$ but when I try to replace them with a \., .$ etc. it seems not to be working properly. Any advice would be much appreciated. Thanks!
As #WiktorStribiżew stated in the comments, you can use (\d)$ as the pattern to match and \1. as the string to replace it with. A quick breakdown of how this works is:
(\d) Matches any digit and captures it in group 1
$ Matches the end of the line
\1. Replaces the the matched string with the first capture group followed by a period
Resulting in (\d)$ -> \1.
However, is it necessary to even match the digit? Would the following substitution suffice $ -> .? This would simply add the . to the end of each line. The only issue would be that it would not discriminate whether or not the line ends with a digit.
If it must end with a digit to receive a period, you can also avoid using capture groups by using a positive look-behind. In this case the pattern to match would be (?<=\d)$ and the replacement pattern would be ..
(?<=\d) Is a positive look-behind that checks if the there is a digit before the current character without consuming any characters.
(?<=\d)$ Checks to make sure the character at the end of the line is a digit with consuming that character (i.e. that character will not be replaced).
The resulting replacement would then be (?<=\d)$ -> . which would add a period to each line that ends with a digit without the need for capture groups.
Further reading:
Grouping and Capturing
Lookahead and Lookbehind

Positive look ahead case [duplicate]

I found these things in my regex body but I haven't got a clue what I can use them for.
Does somebody have examples so I can try to understand how they work?
(?!) - negative lookahead
(?=) - positive lookahead
(?<=) - positive lookbehind
(?<!) - negative lookbehind
(?>) - atomic group
Examples
Given the string foobarbarfoo:
bar(?=bar) finds the 1st bar ("bar" which has "bar" after it)
bar(?!bar) finds the 2nd bar ("bar" which does not have "bar" after it)
(?<=foo)bar finds the 1st bar ("bar" which has "foo" before it)
(?<!foo)bar finds the 2nd bar ("bar" which does not have "foo" before it)
You can also combine them:
(?<=foo)bar(?=bar) finds the 1st bar ("bar" with "foo" before it and "bar" after it)
Definitions
Look ahead positive (?=)
Find expression A where expression B follows:
A(?=B)
Look ahead negative (?!)
Find expression A where expression B does not follow:
A(?!B)
Look behind positive (?<=)
Find expression A where expression B precedes:
(?<=B)A
Look behind negative (?<!)
Find expression A where expression B does not precede:
(?<!B)A
Atomic groups (?>)
An atomic group exits a group and throws away alternative patterns after the first matched pattern inside the group (backtracking is disabled).
(?>foo|foot)s applied to foots will match its 1st alternative foo, then fail as s does not immediately follow, and stop as backtracking is disabled
A non-atomic group will allow backtracking; if subsequent matching ahead fails, it will backtrack and use alternative patterns until a match for the entire expression is found or all possibilities are exhausted.
(foo|foot)s applied to foots will:
match its 1st alternative foo, then fail as s does not immediately follow in foots, and backtrack to its 2nd alternative;
match its 2nd alternative foot, then succeed as s immediately follows in foots, and stop.
Some resources
http://www.regular-expressions.info/lookaround.html
http://www.rexegg.com/regex-lookarounds.html
Online testers
https://regex101.com
Lookarounds are zero width assertions. They check for a regex (towards right or left of the current position - based on ahead or behind), succeeds or fails when a match is found (based on if it is positive or negative) and discards the matched portion. They don't consume any character - the matching for regex following them (if any), will start at the same cursor position.
Read regular-expression.info for more details.
Positive lookahead:
Syntax:
(?=REGEX_1)REGEX_2
Match only if REGEX_1 matches; after matching REGEX_1, the match is discarded and searching for REGEX_2 starts at the same position.
example:
(?=[a-z0-9]{4}$)[a-z]{1,2}[0-9]{2,3}
REGEX_1 is [a-z0-9]{4}$ which matches four alphanumeric chars followed by end of line.
REGEX_2 is [a-z]{1,2}[0-9]{2,3} which matches one or two letters followed by two or three digits.
REGEX_1 makes sure that the length of string is indeed 4, but doesn't consume any characters so that search for REGEX_2 starts at the same location. Now REGEX_2 makes sure that the string matches some other rules. Without look-ahead it would match strings of length three or five.
Negative lookahead
Syntax:
(?!REGEX_1)REGEX_2
Match only if REGEX_1 does not match; after checking REGEX_1, the search for REGEX_2 starts at the same position.
example:
(?!.*\bFWORD\b)\w{10,30}$
The look-ahead part checks for the FWORD in the string and fails if it finds it. If it doesn't find FWORD, the look-ahead succeeds and the following part verifies that the string's length is between 10 and 30 and that it contains only word characters a-zA-Z0-9_
Look-behind is similar to look-ahead: it just looks behind the current cursor position. Some regex flavors like javascript doesn't support look-behind assertions. And most flavors that support it (PHP, Python etc) require that look-behind portion to have a fixed length.
Atomic groups basically discards/forgets the subsequent tokens in the group once a token matches. Check this page for examples of atomic groups
Grokking lookaround rapidly.
How to distinguish lookahead and lookbehind?
Take 2 minutes tour with me:
(?=) - positive lookahead
(?<=) - positive lookbehind
Suppose
A B C #in a line
Now, we ask B, Where are you?
B has two solutions to declare it location:
One, B has A ahead and has C bebind
Two, B is ahead(lookahead) of C and behind (lookhehind) A.
As we can see, the behind and ahead are opposite in the two solutions.
Regex is solution Two.
Why - Suppose you are playing wordle, and you've entered "ant". (Yes three-letter word, it's only an example - chill)
The answer comes back as blank, yellow, green, and you have a list of three letter words you wish to use a regex to search for? How would you do it?
To start off with you could start with the presence of the t in the third position:
[a-z]{2}t
We could improve by noting that we don't have an a
[b-z]{2}t
We could further improve by saying that the search had to have an n in it.
(?=.*n)[b-z]{2}t
or to break it down;
(?=.*n) - Look ahead, and check the match has an n in it, it may have zero or more characters before that n
[b-z]{2} - Two letters other than an 'a' in the first two positions;
t - literally a 't' in the third position
I used look behind to find the schema and look ahead negative to find tables missing with(nolock)
expression="(?<=DB\.dbo\.)\w+\s+\w+\s+(?!with\(nolock\))"
matches=re.findall(expression,sql)
for match in matches:
print(match)

Negative lookahead's behavior [duplicate]

I found these things in my regex body but I haven't got a clue what I can use them for.
Does somebody have examples so I can try to understand how they work?
(?!) - negative lookahead
(?=) - positive lookahead
(?<=) - positive lookbehind
(?<!) - negative lookbehind
(?>) - atomic group
Examples
Given the string foobarbarfoo:
bar(?=bar) finds the 1st bar ("bar" which has "bar" after it)
bar(?!bar) finds the 2nd bar ("bar" which does not have "bar" after it)
(?<=foo)bar finds the 1st bar ("bar" which has "foo" before it)
(?<!foo)bar finds the 2nd bar ("bar" which does not have "foo" before it)
You can also combine them:
(?<=foo)bar(?=bar) finds the 1st bar ("bar" with "foo" before it and "bar" after it)
Definitions
Look ahead positive (?=)
Find expression A where expression B follows:
A(?=B)
Look ahead negative (?!)
Find expression A where expression B does not follow:
A(?!B)
Look behind positive (?<=)
Find expression A where expression B precedes:
(?<=B)A
Look behind negative (?<!)
Find expression A where expression B does not precede:
(?<!B)A
Atomic groups (?>)
An atomic group exits a group and throws away alternative patterns after the first matched pattern inside the group (backtracking is disabled).
(?>foo|foot)s applied to foots will match its 1st alternative foo, then fail as s does not immediately follow, and stop as backtracking is disabled
A non-atomic group will allow backtracking; if subsequent matching ahead fails, it will backtrack and use alternative patterns until a match for the entire expression is found or all possibilities are exhausted.
(foo|foot)s applied to foots will:
match its 1st alternative foo, then fail as s does not immediately follow in foots, and backtrack to its 2nd alternative;
match its 2nd alternative foot, then succeed as s immediately follows in foots, and stop.
Some resources
http://www.regular-expressions.info/lookaround.html
http://www.rexegg.com/regex-lookarounds.html
Online testers
https://regex101.com
Lookarounds are zero width assertions. They check for a regex (towards right or left of the current position - based on ahead or behind), succeeds or fails when a match is found (based on if it is positive or negative) and discards the matched portion. They don't consume any character - the matching for regex following them (if any), will start at the same cursor position.
Read regular-expression.info for more details.
Positive lookahead:
Syntax:
(?=REGEX_1)REGEX_2
Match only if REGEX_1 matches; after matching REGEX_1, the match is discarded and searching for REGEX_2 starts at the same position.
example:
(?=[a-z0-9]{4}$)[a-z]{1,2}[0-9]{2,3}
REGEX_1 is [a-z0-9]{4}$ which matches four alphanumeric chars followed by end of line.
REGEX_2 is [a-z]{1,2}[0-9]{2,3} which matches one or two letters followed by two or three digits.
REGEX_1 makes sure that the length of string is indeed 4, but doesn't consume any characters so that search for REGEX_2 starts at the same location. Now REGEX_2 makes sure that the string matches some other rules. Without look-ahead it would match strings of length three or five.
Negative lookahead
Syntax:
(?!REGEX_1)REGEX_2
Match only if REGEX_1 does not match; after checking REGEX_1, the search for REGEX_2 starts at the same position.
example:
(?!.*\bFWORD\b)\w{10,30}$
The look-ahead part checks for the FWORD in the string and fails if it finds it. If it doesn't find FWORD, the look-ahead succeeds and the following part verifies that the string's length is between 10 and 30 and that it contains only word characters a-zA-Z0-9_
Look-behind is similar to look-ahead: it just looks behind the current cursor position. Some regex flavors like javascript doesn't support look-behind assertions. And most flavors that support it (PHP, Python etc) require that look-behind portion to have a fixed length.
Atomic groups basically discards/forgets the subsequent tokens in the group once a token matches. Check this page for examples of atomic groups
Grokking lookaround rapidly.
How to distinguish lookahead and lookbehind?
Take 2 minutes tour with me:
(?=) - positive lookahead
(?<=) - positive lookbehind
Suppose
A B C #in a line
Now, we ask B, Where are you?
B has two solutions to declare it location:
One, B has A ahead and has C bebind
Two, B is ahead(lookahead) of C and behind (lookhehind) A.
As we can see, the behind and ahead are opposite in the two solutions.
Regex is solution Two.
Why - Suppose you are playing wordle, and you've entered "ant". (Yes three-letter word, it's only an example - chill)
The answer comes back as blank, yellow, green, and you have a list of three letter words you wish to use a regex to search for? How would you do it?
To start off with you could start with the presence of the t in the third position:
[a-z]{2}t
We could improve by noting that we don't have an a
[b-z]{2}t
We could further improve by saying that the search had to have an n in it.
(?=.*n)[b-z]{2}t
or to break it down;
(?=.*n) - Look ahead, and check the match has an n in it, it may have zero or more characters before that n
[b-z]{2} - Two letters other than an 'a' in the first two positions;
t - literally a 't' in the third position
I used look behind to find the schema and look ahead negative to find tables missing with(nolock)
expression="(?<=DB\.dbo\.)\w+\s+\w+\s+(?!with\(nolock\))"
matches=re.findall(expression,sql)
for match in matches:
print(match)

Regular expression to match non-integer values in a string

I want to match the following rules:
One dash is allowed at the start of a number.
Only values between 0 and 9 should be allowed.
I currently have the following regex pattern, I'm matching the inverse so that I can thrown an exception upon finding a match that doesn't follow the rules:
[^-0-9]
The downside to this pattern is that it works for all cases except a hyphen in the middle of the String will still pass. For example:
"-2304923" is allowed correctly but "9234-342" is also allowed and shouldn't be.
Please let me know what I can do to specify the first character as [^-0-9] and the rest as [^0-9]. Thanks!
This regex will work for you:
^-?\d+$
Explanation: start the string ^, then - but optional (?), the digit \d repeated few times (+), and string must finish here $.
You can do this:
(?:^|\s)(-?\d+)(?:["'\s]|$)
^^^^^ non capturing group for start of line or space
^^^^^ capture number
^^^^^^^^^ non capturing group for end of line, space or quote
See it work
This will capture all strings of numbers in a line with an optional hyphen in front.
-2304923" "9234-342" 1234 -1234
++++++++ captured
^^^^^^^^ NOT captured
++++ captured
+++++ captured
I don't understand how your pattern - [^-0-9] is matching those strings you are talking about. That pattern is just the opposite of what you want. You have simply negated the character class by using caret(^) at the beginning. So, this pattern would match anything except the hyphen and the digits.
Anyways, for your requirement, first you need to match one hyphen at the beginning. So, just keep it outside the character class. And then to match any number of digits later on, you can use [0-9]+ or \d+.
So, your pattern to match the required format should be:
-[0-9]+ // or -\d+
The above regex is used to find the pattern in some large string. If you want the entire string to match this pattern, then you can add anchors at the ends of the regex: -
^-[0-9]+$
For a regular expression like this, it's sometimes helpful to think of it in terms of two cases.
Is the first character messed up somehow?
If not, are any of the other characters messed up somehow?
Combine these with |
(^[^-0-9]|^.+?[^0-9])