Suppose one would like to match an arbitrary number, that is either preceded or followed by a certain string. However, if possible, the pattern should always prefer to match the substring where the number is followed by the string.
For example:
the string 1234 foo should match.
the string foo 1234 should match.
the string 1234 foo 1234 should produce the match 1234 foo.
(most important) the string foo 1234 foo should produce the match 1234 foo and not foo 1234.
To avoid having the same expression twice, only inverted and separated by an or, I've been trying to implement it using a conditional pattern. Using the pattern below works for the cases 1-3, but fails for case 4 as it matches foo 1234.
(\d+)?(?:(?(1) ?)(foo) ??)(?(1)|(?:(\d+)))
Adding a negative lookahead to the conditional like below also doesn't help. The string foo 1234 foo now produces the two matches foo 123 and 4 foo.
(\d+)?(?:(?(1) ?)(foo) ??)(?(1)|(?:(\d+)(?! ?foo)))
Since the regex is greedy it always matches the non-preferred substring in case 4 first. Is there a way to make the regex expression prefer the later match with a conditional expression?
Okay, if you don't want to reuse foo unless it's in a Lookahead, I believe the following pattern would meet all your conditions:
(\d+)?(?(1) )(foo)(?(1)| (\d+)\b(?! foo))
Demo.
Breakdown:
(\d+)? # An optional capturing group matching one or more digits.
(?(1) ) # If the previous group exists, match a space character.
(foo) # A second capturing group to capture "foo".
(? # If...
(1) # ..the first group exists, match nothing.
| # Else...
[ ] # Match a space character.
(\d+) # A third capturing group matching one or more digits.
\b # Assert a word boundary to make sure no more digits following.
(?! foo) # A negative Lookahead to make sure " foo" is not following.
) # End If.
An alternative approach for that last part if the digits don't have to end with a word boundary would be to get rid of \b and add \d* in the negative Lookahead:
(\d+)?(?(1) )(foo)(?(1)| (\d+)(?!\d* foo))
Demo.
Another solution is to wrap the third digit capturing group in an atomic group, to prevent backtracking in the following negative lookahead. This solves an issue for the first solution, such that there doesn't have to be a word boundary between the numerical substring and foo. It also solves an issue with the second solution, such that it allows for the numerical substring to be more complex (like including decimals \d+.?\d*).
(Credits go to the OP, Ferdinand Schlatt)
(\d+)?(?(1) )(foo)(?(1)| (?>(\d+))(?! foo))
Demo
Related
I want to expect some characters only if a prior regex matched. If not, no characters (empty string) is expected.
For instance, if after the first four characters appears a string out of the group (A10, B32, C56, D65) (kind of enumeration) then a "_" followed by a 3-digit number like 123 is expected. If no element of the mentioned group appears, no other string is expected.
My first attempt was this but the ELSE branch does not work:
^XXX_(?<DT>A12|B43|D14)(?(DT)(_\d{1,3})|)\.ZZZ$
XXX_A12_123.ZZZ --> match
XXX_A11.ZZZ --> match
XXX_A12_abc.ZZZ --> no match
XXX_A23_123.ZZZ --> no match
These are examples of filenames. If the filename contains a string of the mentioned group like A12 or C56, then I expect that this element if followed by an underscore followed by 1 to 3 digits. If the filename does not contain a string of that group (no character or a character sequence different from the strings in the group) then I don't want to see the underscore followed by 1 to 3 digits.
For instance, I could extend the regex to
^XXX_(?<DT>A12|B43|D14)_\d{5}(?(DT)(_\d{1,3})|)_someMoreChars\.ZZZ$
...and then I want these filenames to be valid:
XXX_A12_12345_123_wellDone.ZZZ
XXX_Q21_00000_wellDone.ZZZ
XXX_Q21_00000_456_wellDone.ZZZ
...but this is invalid:
XXX_A12_12345_wellDone.ZZZ
How can I make the ELSE branch of the conditional statement work?
In the end I intend to have two groups like
Group A: (A11, B32, D76, R33)
Group B: (A23, C56, H78, T99)
If an element of group A occurs in the filename then I expect to find _\d{1,3} in the filename.
If an element of group B occurs ion the filename then the _\d{1,3} shall be optional (it may or may not occur in the filename).
I ended up in this regex:
^XXX_(?:(?A12|B43|D14))?(?(DT)(_\d{5}_\d{1,3})|(?!(?&DT))(?!.*_\d{3}(?!\d))).*\.ZZZ$
^XXX_(?:(?<DT>A12|B43|D14))?_\d{5}(?(DT)(_\d{1,3})|(?!(?&DT))(?!.*_\d{3}(?!\d))).+\.ZZZ$
Since I have to use this regex in the OpenApi #Pattern annotation I have the problem that I get the error:
Conditionals are not supported in this regex dialect.
As #The fourth bird suggested alternation seems to do the trick:
XXX_((((A12|B43|D14)_\d{5}_\d{1,3}))|((?:(A10|B10|C20)((?:_\d{5}_\d{3})|(?:_\d{3}))))).*\.ZZZ$
The else branch is the part after the |, but if you also want to match the 2nd example, the if clause would not work as you have already matched one of A12|B43|D14
The named capture group is not optional, so the if clause will always be true.
What you can do instead is use an alternation to match either the numeration part followed by an underscore and 3 digits, or match an uppercase char and 2 digits.
^XXX_(?:(?<DT>A12|B43|D14)_\d{1,3}|[A-Z]\d{2})\.ZZZ$
Regex demo
If you want to make use of the if/else clause, you can make the named capture group optional, and then check if group 1 exists.
^XXX_(?<DT>A12|B43|D14)?(?(DT)_\d{1,3}|[A-Z]\d{2})\.ZZZ$
Regex demo
For the updated question:
^XXX_(?<DT>A12|B43|D14)?(?(DT)(?:_\d{5})?_\d{3}(?!\d)|(?!A12|B43|D14|[A-Z]\d{2}_\d{3}(?!\d))).*\.ZZZ$
The pattern matches:
^ Start of string
XXX_ Match literally
(?<DT>A12|B43|D14)?
(?(DT) If we have group DT
(?:_\d{5})? Optionally match _ and 5 digits
_\d{3}(?!\d) Match _ and 3 digits
| Or
(?! Negative lookahead, assert not to the right
A12|B43|D14| Match one of the alternatives, or
[A-Z]\d{2}_\d{3}(?!\d) Match 1 char A-Z, 2 digits _ 3 digits not followed by a digit
) Close lookahead
) Close if clause
.* Match the rest of the line
\.ZZZ Match . and ZZZ
$ End of string
Regex demo
I'm trying to find a regex to check for the validity of options that are supplied with a command.
Say that -a, -b and -c are valid options. They may be combined, for example as -ac or -abc. Order doesn't matter, so -ba is also valid.
I thought this regex would do the trick:
^-[abc]{1,3}$
But it has a downside. This regex also accepts duplicates, i.e. -abb.
How do I modify this regex to disallow duplicates?
You may use this regex with a capture group and a negative lookahead:
^-((?!.*\1)[abc]){1,3}$
RegEx Demo
RegEx Details:
^: Start
-: Match a -
(: Start capture group #1
(?!.*\1): Negative lookahead to make sure we don't have repeat of what we have in capture group #1 anywhere in the input
[abc]: Match a or b or c
){1,3}: End capture group #1. Repeat this group 1 to 3 times
$: End
You could list all the alternatives, but if it is a long character class, you can check that on the right side there is no char that is already captured using a capture group and a backreference.
^-(?![abc]*?([abc])[abc]*?\1)[abc]{1,3}$
^ Start of string
- Match a hyphen
(?! Negative lookahead, assert that at the right is not
[abc]*([abc])[abc]*\1 Match optional chars a, b or c and then capture 1 char. Then check that the captured char does not occur at the right side
) Close lookahead
[abc]{1,3} Match 1-3 times a b or c
$ End of string
Regex demo
Or a short version using only non whitespace chars, as the character class can only match 3 chars.
^-(?!\S*(\S)\S*\1)[abc]{1,3}$
Regex demo
A straight in poker is five cards in a row, for example 23456 or 89TJQ. With a "sorted" hand, the regex could be written as:
^(A2345|23456|34567|45678|56789|6789T|789TJ|89TJQ|9TJQK|TJQKA)$
It's a bit verbose but straightforward enough. However, would it be possible to generate a (sensible) regex if the hand was unordered? For example, if the hand was 52634 or JQ89T??
One possible way would be to use a ?=.*<item> lookahead (which would essentially be "unsorted"), for example:
^(?:
(?=.*A)(?=.*2)(?=.*3)(?=.*4)(?=.*5)
|(?=.*2)(?=.*3)(?=.*4)(?=.*5)(?=.*6)
|(?=.*3)(?=.*4)(?=.*5)(?=.*6)(?=.*7)
|(?=.*4)(?=.*5)(?=.*6)(?=.*7)(?=.*8)
|(?=.*5)(?=.*6)(?=.*7)(?=.*8)(?=.*9)
|(?=.*6)(?=.*7)(?=.*8)(?=.*9)(?=.*T)
|(?=.*7)(?=.*8)(?=.*9)(?=.*T)(?=.*J)
|(?=.*8)(?=.*9)(?=.*T)(?=.*J)(?=.*Q)
|(?=.*9)(?=.*T)(?=.*J)(?=.*Q)(?=.*K)
|(?=.*T)(?=.*J)(?=.*Q)(?=.*K)(?=.*A)
)
.{5}$
Are there other / better approaches to finding if a straight exists using regex only?
You can use the following regex:
See regex in use here
(?!.*(.).*\1)(?:[A2345]{5}|[23456]{5}|[34567]{5}|[45678]{5}|[56789]{5}|[6789T]{5}|[789TJ]{5}|[89TJQ]{5}|[9TJQK]{5}|[TJQKA]{5})
This works by first using a negative lookahead to ensure that the string doesn't contain any duplicates (?!.*(.).*\1). Then it matches 5 characters from any of the straight possibilities.
(?!.*(.).*\1)
#^^^ ^ negative lookahead ensuring what follows doesn't match
# ^^ match any character any number of times
# ^^^ capture a character into capture group #1
# ^^ match any character any number of times
# ^^ match the same text as most recently matched by the 1st capture group
Against JQQ89, it works as follows:
- .* matches J
- (.) captures Q
- .* matches nothing
- \1 tries to match Q (and succeeds)
- Negative lookahead has a match, so fail the match.
For example, this is the regular expression
([a]{2,3})
This is the string
aaaa // 1 match "(aaa)a" but I want "(aa)(aa)"
aaaaa // 2 match "(aaa)(aa)"
aaaaaa // 2 match "(aaa)(aaa)"
However, if I change the regular expression
([a]{2,3}?)
Then the results are
aaaa // 2 match "(aa)(aa)"
aaaaa // 2 match "(aa)(aa)a" but I want "(aaa)(aa)"
aaaaaa // 3 match "(aa)(aa)(aa)" but I want "(aaa)(aaa)"
My question is that is it possible to use as few groups as possible to match as long string as possible?
How about something like this:
(a{3}(?!a(?:[^a]|$))|a{2})
This looks for either the character a three times (not followed by a single a and a different character) or the character a two times.
Breakdown:
( # Start of the capturing group.
a{3} # Matches the character 'a' exactly three times.
(?! # Start of a negative Lookahead.
a # Matches the character 'a' literally.
(?: # Start of the non-capturing group.
[^a] # Matches any character except for 'a'.
| # Alternation (OR).
$ # Asserts position at the end of the line/string.
) # End of the non-capturing group.
) # End of the negative Lookahead.
| # Alternation (OR).
a{2} # Matches the character 'a' exactly two times.
) # End of the capturing group.
Here's a demo.
Note that if you don't need the capturing group, you can actually use the whole match instead by converting the capturing group into a non-capturing one:
(?:a{3}(?!a(?:[^a]|$))|a{2})
Which would look like this.
Try this Regex:
^(?:(a{3})*|(a{2,3})*)$
Click for Demo
Explanation:
^ - asserts the start of the line
(?:(a{3})*|(a{2,3})*) - a non-capturing group containing 2 sub-sequences separated by OR operator
(a{3})* - The first subsequence tries to match 3 occurrences of a. The * at the end allows this subsequence to match 0 or 3 or 6 or 9.... occurrences of a before the end of the line
| - OR
(a{2,3})* - matches 2 to 3 occurrences of a, as many as possible. The * at the end would repeat it 0+ times before the end of the line
-$ - asserts the end of the line
Try this short regex:
a{2,3}(?!a([^a]|$))
Demo
How it's made:
I started with this simple regex: a{2}a?. It looks for 2 consecutive a's that may be followed by another a. If the 2 a's are followed by another a, it matches all three a's.
This worked for most cases:
However, it failed in cases like:
So now, I knew I had to modify my regex in such a way that it would match the third a only if the third a is not followed by a([^a]|$). So now, my regex looked like a{2}a?(?!a([^a]|$)), and it worked for all cases. Then I just simplified it to a{2,3}(?!a([^a]|$)).
That's it.
EDIT
If you want the capturing behavior, then add parenthesis around the regex, like:
(a{2,3}(?!a([^a]|$)))
I have two possible patterns:
1.2 hello
1.2.3 hello
I would like to match 1, 2 and 3 if the latter exists.
Optional items seem to be the way to go, but my pattern (\d)\.(\d)?(\.(\d)).hello matches only 1.2.3 hello (almost perfectly: I get four groups but the first, second and fourth contain what I want) - the first test sting is not matched at all.
What would be the right match pattern?
Your pattern contains (\d)\.(\d)?(\.(\d)) part that matches a digit, then a ., then an optional digit (it may be 1 or 0) and then a . + a digit. Thus, it can match 1..2 hello, but not 1.2 hello.
You may make the third group non-capturing and make it optional:
(\d)\.(\d)(?:\.(\d))?\s*hello
^^^ ^^
See the regex demo
If your regex engine does not allow non-capturing groups, use a capturing one, just you will have to grab the value from Group 4:
(\d)\.(\d)(\.(\d))?\s*hello
See this regex.
Note that I replaced . before hello with \s* to match zero or more whitespaces.
Note also that if you need to match these numbers at the start of a line, you might consider pre-pending the pattern with ^ (and depending on your regex engine/tool, the m modifier).