How to optionally match a group? - regex

I have two possible patterns:
1.2 hello
1.2.3 hello
I would like to match 1, 2 and 3 if the latter exists.
Optional items seem to be the way to go, but my pattern (\d)\.(\d)?(\.(\d)).hello matches only 1.2.3 hello (almost perfectly: I get four groups but the first, second and fourth contain what I want) - the first test sting is not matched at all.
What would be the right match pattern?

Your pattern contains (\d)\.(\d)?(\.(\d)) part that matches a digit, then a ., then an optional digit (it may be 1 or 0) and then a . + a digit. Thus, it can match 1..2 hello, but not 1.2 hello.
You may make the third group non-capturing and make it optional:
(\d)\.(\d)(?:\.(\d))?\s*hello
^^^ ^^
See the regex demo
If your regex engine does not allow non-capturing groups, use a capturing one, just you will have to grab the value from Group 4:
(\d)\.(\d)(\.(\d))?\s*hello
See this regex.
Note that I replaced . before hello with \s* to match zero or more whitespaces.
Note also that if you need to match these numbers at the start of a line, you might consider pre-pending the pattern with ^ (and depending on your regex engine/tool, the m modifier).

Related

Refer to same branch of previous alternate group

I need to build a regex with capture groups that would result in the following:
12-34 # match: (1) (2) (3) (4)
1a-2b # match: (1) (a) (2) (b)
12-3b # nomatch
In a nutshell, if the first part has two digits, then the second part must also have two digits. And if it has a letter, then the second part must also have a letter.
In PCRE flavor, (\d)(\d|[abc])-(\d)(\d|[abc]) matches the third line, so it is too permissive.
Using named groups, (\d)(?<named>\d|[abc])-(\d)(?P=named) matches no line at all, for it requires the second characters to be exactly the same. It is too restrictive.
Is there a way I can require that my second alternate group (\d|[abc]) takes the same branch as the first (\d|[abc])?
Or do I need to fall back on the full (?:(\d)(\d)-(\d)(\d)|(\d)([abc])-(\d)([abc])) which duplicates parts of my regex?
In PCRE you may use this regex:
^(?:(?<num>\d{2})-(?&num)|(?<alnum>\d\pL)-(?&alnum))$
RegEx Demo 1
RegEx Details:
(?<num>\d{2}): named group num for matching 2 digits
(?<alnum>\d\pL): named group alnum for matching 1 digit followed by a letter
(?&num): Match same sub-pattern as in named group num
(?&alnum): Match same sub-pattern as in named group alnum
Another option is to use conditional sub-patterns in PCRE as:
^(?:(?<num>\d{2})|\d\pL)-(?(num)\d{2}|\d\pL)$
RegEx Demo 2

Using regex to determine straight (unordered hand)

A straight in poker is five cards in a row, for example 23456 or 89TJQ. With a "sorted" hand, the regex could be written as:
^(A2345|23456|34567|45678|56789|6789T|789TJ|89TJQ|9TJQK|TJQKA)$
It's a bit verbose but straightforward enough. However, would it be possible to generate a (sensible) regex if the hand was unordered? For example, if the hand was 52634 or JQ89T??
One possible way would be to use a ?=.*<item> lookahead (which would essentially be "unsorted"), for example:
^(?:
(?=.*A)(?=.*2)(?=.*3)(?=.*4)(?=.*5)
|(?=.*2)(?=.*3)(?=.*4)(?=.*5)(?=.*6)
|(?=.*3)(?=.*4)(?=.*5)(?=.*6)(?=.*7)
|(?=.*4)(?=.*5)(?=.*6)(?=.*7)(?=.*8)
|(?=.*5)(?=.*6)(?=.*7)(?=.*8)(?=.*9)
|(?=.*6)(?=.*7)(?=.*8)(?=.*9)(?=.*T)
|(?=.*7)(?=.*8)(?=.*9)(?=.*T)(?=.*J)
|(?=.*8)(?=.*9)(?=.*T)(?=.*J)(?=.*Q)
|(?=.*9)(?=.*T)(?=.*J)(?=.*Q)(?=.*K)
|(?=.*T)(?=.*J)(?=.*Q)(?=.*K)(?=.*A)
)
.{5}$
Are there other / better approaches to finding if a straight exists using regex only?
You can use the following regex:
See regex in use here
(?!.*(.).*\1)(?:[A2345]{5}|[23456]{5}|[34567]{5}|[45678]{5}|[56789]{5}|[6789T]{5}|[789TJ]{5}|[89TJQ]{5}|[9TJQK]{5}|[TJQKA]{5})
This works by first using a negative lookahead to ensure that the string doesn't contain any duplicates (?!.*(.).*\1). Then it matches 5 characters from any of the straight possibilities.
(?!.*(.).*\1)
#^^^ ^ negative lookahead ensuring what follows doesn't match
# ^^ match any character any number of times
# ^^^ capture a character into capture group #1
# ^^ match any character any number of times
# ^^ match the same text as most recently matched by the 1st capture group
Against JQQ89, it works as follows:
- .* matches J
- (.) captures Q
- .* matches nothing
- \1 tries to match Q (and succeeds)
- Negative lookahead has a match, so fail the match.

How to use regular expression to use as few groups as possible to match as long string as possible

For example, this is the regular expression
([a]{2,3})
This is the string
aaaa // 1 match "(aaa)a" but I want "(aa)(aa)"
aaaaa // 2 match "(aaa)(aa)"
aaaaaa // 2 match "(aaa)(aaa)"
However, if I change the regular expression
([a]{2,3}?)
Then the results are
aaaa // 2 match "(aa)(aa)"
aaaaa // 2 match "(aa)(aa)a" but I want "(aaa)(aa)"
aaaaaa // 3 match "(aa)(aa)(aa)" but I want "(aaa)(aaa)"
My question is that is it possible to use as few groups as possible to match as long string as possible?
How about something like this:
(a{3}(?!a(?:[^a]|$))|a{2})
This looks for either the character a three times (not followed by a single a and a different character) or the character a two times.
Breakdown:
( # Start of the capturing group.
a{3} # Matches the character 'a' exactly three times.
(?! # Start of a negative Lookahead.
a # Matches the character 'a' literally.
(?: # Start of the non-capturing group.
[^a] # Matches any character except for 'a'.
| # Alternation (OR).
$ # Asserts position at the end of the line/string.
) # End of the non-capturing group.
) # End of the negative Lookahead.
| # Alternation (OR).
a{2} # Matches the character 'a' exactly two times.
) # End of the capturing group.
Here's a demo.
Note that if you don't need the capturing group, you can actually use the whole match instead by converting the capturing group into a non-capturing one:
(?:a{3}(?!a(?:[^a]|$))|a{2})
Which would look like this.
Try this Regex:
^(?:(a{3})*|(a{2,3})*)$
Click for Demo
Explanation:
^ - asserts the start of the line
(?:(a{3})*|(a{2,3})*) - a non-capturing group containing 2 sub-sequences separated by OR operator
(a{3})* - The first subsequence tries to match 3 occurrences of a. The * at the end allows this subsequence to match 0 or 3 or 6 or 9.... occurrences of a before the end of the line
| - OR
(a{2,3})* - matches 2 to 3 occurrences of a, as many as possible. The * at the end would repeat it 0+ times before the end of the line
-$ - asserts the end of the line
Try this short regex:
a{2,3}(?!a([^a]|$))
Demo
How it's made:
I started with this simple regex: a{2}a?. It looks for 2 consecutive a's that may be followed by another a. If the 2 a's are followed by another a, it matches all three a's.
This worked for most cases:
However, it failed in cases like:
So now, I knew I had to modify my regex in such a way that it would match the third a only if the third a is not followed by a([^a]|$). So now, my regex looked like a{2}a?(?!a([^a]|$)), and it worked for all cases. Then I just simplified it to a{2,3}(?!a([^a]|$)).
That's it.
EDIT
If you want the capturing behavior, then add parenthesis around the regex, like:
(a{2,3}(?!a([^a]|$)))

Regex to check only if the group is present

I have String which may have values like below.
854METHYLDOPA
041ALDOMET /00000101/
133IODETO DE SODIO [I 131]
In this i need to get the text starting from index 4 till we find any one these patterns /00000101/ or [I 131]
Expected Output:
METHYLDOPA
ALDOMET
IODETO DE SODIO
I have tried the below RegEx for the same
(?:^.{3})(.*)(?:[[/][A-Z0-9\s]+[]/\s+])
But this RegEx works if the string contains [/ but it doesn't work for the case1 where these patterns doesn't exist.
I have tried adding ? at the end but it works fore case 1 but doesn't work for case 2 and 3.
Could anyone please help me on getting the regx work?
Your logic is difficult to phrase. My interpretation is that you always want to capture from the 4th character onwards. What else gets captured depends on the remainder of the input. Should either /00000101/ or [I 131] occur, then you want to capture up until that point. Otherwise, you want to capture the entire string. Putting this all together yields this regex:
^.{3}(?:(.*)(?=/00000101/|\[I 131\])|(.*))
Demo
You may try this:
^.{3}(.*?)($|(?:\s*\/00000101\/)|(?:\s*\[I\s+131\])).*$
and replace by this to get the exact output you want.
\1
Regex Demo
Explanation:
^ --> start of a the string
.{3} --> followed by 3 characters
(.*?) --> followed by anything where ? means lazy it will fetch until it finds the following and won't go beyond that. It also captures it as
group 1 --> \1
($|(?:\s*\/00000101\/)|(?:\s*\[I\s+131\])) ---------->
$ --> ends with $ which means there is there is not such pattern that
you have mentioned
| or
(?:\s*\/00000101\/) -->another pattern of yours improvised with \s* to cover zero or more blank space.
| or
(?:\s*\[I\s+131\]) --> another pattern of yours with improvised \s+
which means 1 or more spaces. ?: indicates that we will not capture
it.
.*$ --> .* is just to match anything that follows and $
declares the end of string.
so we end up only capturing group 1 and nothing else which ensures to
replace everything by group1 which is your target output.
You could get the values you are looking for in group 1:
^.{3}(.+?)(?=$| ?\[I 131\]| ?\/00000101\/)
Explanation
From the beginning of the string ^
Match the first 3 characters .{3}
Match in a capturing group (where your values will be) any character one or more times non greedy (.+?)
A positive lookahead (?=
To assert what follow is either the end of the string $
or |
an optional space ? followed by [I 131] \[I 131\]
or |
an optional space ? followed by /00000101/ \/00000101\/
If your regex engine supports \K, you could try it like this and the values you are looking for are not in a group but the full match:
^.{3}\K.+?(?=$| ?\[I 131\]| ?\/00000101\/)

Regular expression to capture & fix a version number

I'm trying to create a regular expression to extract version numbers. Since the source who provide those version strings is mostly unreliable I need to clean those values.
A version is a number or a group of numbers separated by only one dot. As soon as the chain is broken I stop capturing and I keep what was captured so far.
Test cases:
Foo 1.2.3.4.5 bar --> Should capture 1.2.3.4.5
Foo 111111.2..3.4.5 bar --> Should capture 111111.2
Foo 10.. bar --> Should capture 10
1.2.3 aaa --> Should capture 1.2.3
aaa 1.2.3 --> Should capture 1.2.3
1.23 --> Should capture 1.23
I found out some examples but none matches my edge cases (see 3rd case outlined above).
So far I have:
/(\d+(?:\.\d+)+)/i
But it does not cover all my cases... I use that with PHP (pcre).
I would go with the following :
\d+(?:\.\d+)*
This matches a number which can be followed by any amount of [ dot and number ].
The difference with your regex is the use of a * which enables capturing versions composed of a single number.
I've also removed the outer grouping parenthesis which likely serve no purpose.
Solution 1:
Regex demo
Regex: ^[^\d]+\s*\K(\d+(?:\.\d+)*)
1. ^ start of string.
2. [^\d]+\s* match all except digit then spaces
3. \K will reset the match.
4. (\d+(?:\.\d+)*) this will match digits and then existence of zero or more patterns of . and digits
Solution 2:
Regex demo
Regex: ^Foo\s*\K(\d+(?:\.\d+)*)
1. ^ start of string.
2. Foo\s* this will match Foo and spaces
3. \K will reset the match.
4. (\d+(?:\.\d+)*) this will match digits and then existence of zero or more patterns of . and digits