Can I make two groups of regex match in same quantity? - regex

I want a regex that matches the following pattern:
b
abc
aabcc
aaabccc
But does NOT match any of:
ab
bc
aabc
abcc
Basically, /(a)*b(c){_Q_}/, where _Q_ is the number of times that group 1 matched. I know how to match group 1 content later in the string, but how can I match group 1 count?

Use this recursive regex:
^(a(?:(?1)|b)c)$|^(b)$
Demo on regex101
The regex can be further reduced to:
^(a(?1)c|b)$
Demo on regex101
The alternation consists of:
The base case b
The recursive case a(?1)c which matches a, then recurse into group 1, then matches c. Group 1 is the alternation itself, so it can contain more pairs of a and c, or the recursion ends at base case b.

Related

Is it possible to prefer later matches using conditional regex?

Suppose one would like to match an arbitrary number, that is either preceded or followed by a certain string. However, if possible, the pattern should always prefer to match the substring where the number is followed by the string.
For example:
the string 1234 foo should match.
the string foo 1234 should match.
the string 1234 foo 1234 should produce the match 1234 foo.
(most important) the string foo 1234 foo should produce the match 1234 foo and not foo 1234.
To avoid having the same expression twice, only inverted and separated by an or, I've been trying to implement it using a conditional pattern. Using the pattern below works for the cases 1-3, but fails for case 4 as it matches foo 1234.
(\d+)?(?:(?(1) ?)(foo) ??)(?(1)|(?:(\d+)))
Adding a negative lookahead to the conditional like below also doesn't help. The string foo 1234 foo now produces the two matches foo 123 and 4 foo.
(\d+)?(?:(?(1) ?)(foo) ??)(?(1)|(?:(\d+)(?! ?foo)))
Since the regex is greedy it always matches the non-preferred substring in case 4 first. Is there a way to make the regex expression prefer the later match with a conditional expression?
Okay, if you don't want to reuse foo unless it's in a Lookahead, I believe the following pattern would meet all your conditions:
(\d+)?(?(1) )(foo)(?(1)| (\d+)\b(?! foo))
Demo.
Breakdown:
(\d+)? # An optional capturing group matching one or more digits.
(?(1) ) # If the previous group exists, match a space character.
(foo) # A second capturing group to capture "foo".
(? # If...
(1) # ..the first group exists, match nothing.
| # Else...
[ ] # Match a space character.
(\d+) # A third capturing group matching one or more digits.
\b # Assert a word boundary to make sure no more digits following.
(?! foo) # A negative Lookahead to make sure " foo" is not following.
) # End If.
An alternative approach for that last part if the digits don't have to end with a word boundary would be to get rid of \b and add \d* in the negative Lookahead:
(\d+)?(?(1) )(foo)(?(1)| (\d+)(?!\d* foo))
Demo.
Another solution is to wrap the third digit capturing group in an atomic group, to prevent backtracking in the following negative lookahead. This solves an issue for the first solution, such that there doesn't have to be a word boundary between the numerical substring and foo. It also solves an issue with the second solution, such that it allows for the numerical substring to be more complex (like including decimals \d+.?\d*).
(Credits go to the OP, Ferdinand Schlatt)
(\d+)?(?(1) )(foo)(?(1)| (?>(\d+))(?! foo))
Demo

REGEX Capturing differing sets of repeating groups

this is a two-part question, but I feel the answers will be related.
I have this regex pattern:
(\d+)(aa|bb) which I use to capture this string: 1bb2aa3aa4bb5bb6aa7bb8cc9cc
See demo: example 1
The way it captures the random series of aa and bb (both preceded by a digit) is exactly what I want, and is good as far as it goes.
So we get this match on regex101:
Match 1
Full match 0-3 `1bb`
Group 1. 0-1 `1`
Group 2. 1-3 `bb`
Match 2
Full match 3-6 `2aa`
Group 1. 3-4 `2`
Group 2. 4-6 `aa`
Match 3
Full match 6-9 `3aa`
Group 1. 6-7 `3`
Group 2. 7-9 `aa`
Match 4
Full match 9-12 `4bb`
Group 1. 9-10 `4`
Group 2. 10-12 `bb`
Match 5
Full match 12-15 `5bb`
Group 1. 12-13 `5`
Group 2. 13-15 `bb`
Match 6
Full match 15-18 `6aa`
Group 1. 15-16 `6`
Group 2. 16-18 `aa`
Match 7
Full match 18-21 `7bb`
Group 1. 18-19 `7`
Group 2. 19-21 `bb`
As expected, the 8cc9ccbit at the end is ignored. I would like capture this as well, in the same way I have captured the first repeating groups, in the same expression. So in the final output, I'd get something like this added to the end of the output. This should work for any amounts of matches on either side. This text is just one example.
Full match 21-24 `8cc`
Group 1. 21-22 `8`
Group 2. 22-24 `cc`
Match 7
Full match 24-27 `9cc`
Group 1. 24-25 `9`
Group 2. 25-27 `cc`
Also, I'd like to do similar but flipping the 'or' group to the end i.e. this:
1cc2cc3cc4cc5cc6cc7ccb8aa9bb
My current regex pattern (\\d+)(cc) only matches the repeating 'cc' groups.
See demo: example 2
I would like a similar full capture, with any amount of permissible entries of each group.
Any thoughts?
You may use
(?:\G(?!^)(?(?=\d+(?:aa|bb))(?<!\dcc))|(?=(?:\d+(?:aa|bb))+(?:\d+cc)+))(\d+)(aa|bb|cc)
See the regex demo
The regex will only match the string that meets the pattern in the (?=(?:\d+(?:aa|bb))+(?:\d+cc)+) lookahead, and then will consecutively match and capture digits and aa, bb or cc, but digits + aa or bb will be matched unless digits + cc is not in front.
Details
(?:\G(?!^)(?(?=\d+(?:aa|bb))(?<!\dcc))|(?=(?:\d+(?:aa|bb))+(?:\d+cc)+)) - either of the two alternatives:
\G(?!^) - end of the previous successful match
(?(?=\d+(?:aa|bb))(?<!\dcc)) - if-then-else construct: if there is 1+ digits and aa or bb immediately to the right of the current location ((?=\d+(?:aa|bb)), then only continue matching if there is no digit followed with cc immediately to the left of the current location ((?<!\dcc))
| - or
^ - start of string
(?=(?:\d+(?:aa|bb))+(?:\d+cc)+) - a positive lookahead that, immediately to the right of the current location, searches for the following (and returns true if it finds the patterns, or false if it does not):
(?:\d+(?:aa|bb))+ - one or more occurrences of 1+ digits followed with aa or bb
(?:\d+cc)+ - one or more occurrences of 1+ digits followed with cc
(\d+) - Group 1: one or more digits
(aa|bb|cc) - aa, bb or cc.
For the second pattern, replace cc with (?:aa|bb):
(?:\G(?!^)(?(?=\d+cc)(?<!\d(?:aa|bb)))|(?=(?:\d+cc)+(?:\d+(?:aa|bb))+))(\d+)(aa|bb|cc)
I'm no expert with perl, so I'll give a bit of pseudo code here. Feel free to suggest an edit.
You can start by matching any number of xaa or xbb combos, followed by one or more xcc combos using this pattern: ^(?:\d+(?:aa|bb))+(?:\dcc)+$
Once you have that you can use this pattern to capture the appropriate groups: (\d+)(aa|bb|cc)
Demo 1
Demo 2
Something like:
if(ismatch("^(?:\d+(?:aa|bb))+(?:\dcc)+$", inputString))
{
match = match("(\d+)(aa|bb|cc)", inputString);
}
from here you can extract the information using the groups.

How to use regular expression to use as few groups as possible to match as long string as possible

For example, this is the regular expression
([a]{2,3})
This is the string
aaaa // 1 match "(aaa)a" but I want "(aa)(aa)"
aaaaa // 2 match "(aaa)(aa)"
aaaaaa // 2 match "(aaa)(aaa)"
However, if I change the regular expression
([a]{2,3}?)
Then the results are
aaaa // 2 match "(aa)(aa)"
aaaaa // 2 match "(aa)(aa)a" but I want "(aaa)(aa)"
aaaaaa // 3 match "(aa)(aa)(aa)" but I want "(aaa)(aaa)"
My question is that is it possible to use as few groups as possible to match as long string as possible?
How about something like this:
(a{3}(?!a(?:[^a]|$))|a{2})
This looks for either the character a three times (not followed by a single a and a different character) or the character a two times.
Breakdown:
( # Start of the capturing group.
a{3} # Matches the character 'a' exactly three times.
(?! # Start of a negative Lookahead.
a # Matches the character 'a' literally.
(?: # Start of the non-capturing group.
[^a] # Matches any character except for 'a'.
| # Alternation (OR).
$ # Asserts position at the end of the line/string.
) # End of the non-capturing group.
) # End of the negative Lookahead.
| # Alternation (OR).
a{2} # Matches the character 'a' exactly two times.
) # End of the capturing group.
Here's a demo.
Note that if you don't need the capturing group, you can actually use the whole match instead by converting the capturing group into a non-capturing one:
(?:a{3}(?!a(?:[^a]|$))|a{2})
Which would look like this.
Try this Regex:
^(?:(a{3})*|(a{2,3})*)$
Click for Demo
Explanation:
^ - asserts the start of the line
(?:(a{3})*|(a{2,3})*) - a non-capturing group containing 2 sub-sequences separated by OR operator
(a{3})* - The first subsequence tries to match 3 occurrences of a. The * at the end allows this subsequence to match 0 or 3 or 6 or 9.... occurrences of a before the end of the line
| - OR
(a{2,3})* - matches 2 to 3 occurrences of a, as many as possible. The * at the end would repeat it 0+ times before the end of the line
-$ - asserts the end of the line
Try this short regex:
a{2,3}(?!a([^a]|$))
Demo
How it's made:
I started with this simple regex: a{2}a?. It looks for 2 consecutive a's that may be followed by another a. If the 2 a's are followed by another a, it matches all three a's.
This worked for most cases:
However, it failed in cases like:
So now, I knew I had to modify my regex in such a way that it would match the third a only if the third a is not followed by a([^a]|$). So now, my regex looked like a{2}a?(?!a([^a]|$)), and it worked for all cases. Then I just simplified it to a{2,3}(?!a([^a]|$)).
That's it.
EDIT
If you want the capturing behavior, then add parenthesis around the regex, like:
(a{2,3}(?!a([^a]|$)))

How to optionally match a group?

I have two possible patterns:
1.2 hello
1.2.3 hello
I would like to match 1, 2 and 3 if the latter exists.
Optional items seem to be the way to go, but my pattern (\d)\.(\d)?(\.(\d)).hello matches only 1.2.3 hello (almost perfectly: I get four groups but the first, second and fourth contain what I want) - the first test sting is not matched at all.
What would be the right match pattern?
Your pattern contains (\d)\.(\d)?(\.(\d)) part that matches a digit, then a ., then an optional digit (it may be 1 or 0) and then a . + a digit. Thus, it can match 1..2 hello, but not 1.2 hello.
You may make the third group non-capturing and make it optional:
(\d)\.(\d)(?:\.(\d))?\s*hello
^^^ ^^
See the regex demo
If your regex engine does not allow non-capturing groups, use a capturing one, just you will have to grab the value from Group 4:
(\d)\.(\d)(\.(\d))?\s*hello
See this regex.
Note that I replaced . before hello with \s* to match zero or more whitespaces.
Note also that if you need to match these numbers at the start of a line, you might consider pre-pending the pattern with ^ (and depending on your regex engine/tool, the m modifier).

Conditional Regexp: return only one group

Two types of URLs I want to match:
(1) www.test.de/type1/12345/this-is-a-title.html
(2) www.test.de/category/another-title-oh-yes.html
In the first type, I want to match "12345".
In the second type I want to match "category/another-title-oh-yes".
Here is what I came up with:
(?:(?:\.de\/type1\/([\d]*)\/)|\.de\/([\S]+)\.html)
This returns the following:
For type (1):
Match group 1: 12345
Match group 2:
For type (2):
Match group:
Match group 2: category/another-title-oh-yes
As you can see, it is working pretty well already.
For various reasons I need the regex to return only one match-group, though. Is there a way to achieve that?
Java/PHP/Python
Get both the matched group at index 1 using both Negative Lookahead and Positive Lookbehind.
((?<=\.de\/type1\/)\d+|(?<=\.de\/)(?!type1)[^\.]+)
There are two regex pattern that are ORed.
First regex pattern looks for 12345
Second regex pattern looks for category/another-title-oh-yes.
Note:
Each regex pattern must match exactly one match in each URL
Combine whole regex pattern inside the parenthesis (...|...) and remove parenthesis from the [^\.]+ and \d+ where:
[^\.]+ find anything until dot is found
\d+ find one or more digits
Here is online demo on regex101
Input:
www.test.de/type1/12345/this-is-a-title.html
www.test.de/category/another-title-oh-yes.html
Output:
MATCH 1
1. [18-23] `12345`
MATCH 2
1. [57-86] `category/another-title-oh-yes`
JavaScript
try this one and get both the matched group at index 2.
((?:\.de\/type1\/)(\d+)|(?:\.de\/)(?!type1)([^\.]+))
Here is online demo on regex101.
Input:
www.test.de/type1/12345/this-is-a-title.html
www.test.de/category/another-title-oh-yes.html
Output:
MATCH 1
1. `.de/type1/12345`
2. `12345`
MATCH 2
1. `.de/category/another-title-oh-yes`
2. `category/another-title-oh-yes`
Maybe this:
^www\.test\.de/(type1/(.*)\.|(.*)\.html)$
Debuggex Demo
Then for example:
var str = "www.test.de/type1/12345/this-is-a-title.html"
var regex = /^www\.test\.de/(type1/(.*)\.|(.*)\.html)$/
console.log(str.match(regex))
This will output an array, the first element is the string, the second one is whatever is after the website address, the third is what matched according to type1 and the fourth element is the rest.
You can do something like var matches = str.match(regex); return matches[2] || matches[3];