Capturing groups in regex

Capturing groups in regex - regex

I have string a/b/c/ and I want to get 3 groups (a/, b/, c/) by regex.
So, I can do this
^([^\/]+\/)([^\/]+\/)([^\/]+\/)$
but it is not very elegant.
I want to do something like this
^([^\/]+\/){3}$
but I get warning:
A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
So, I'm interested in the data, but I don't understand what I should change in the regex to get valid result.
Test on regex101
Small example for context (nginx config):
location ~* ^/([^/]+/)([^/]+/)([^/]+/)$ {
rewrite (?i)^/([^/]+/)([^/]+/)([^/]+/)$ /$3$2$1 break;
}
in this case I rewrite url from /a/b/c/ to /c/b/a/.

There is really not much you can do to reduce the duplication in:
^([^\/]+\/)([^\/]+\/)([^\/]+\/)$
The warning is telling you that a repeated group such as ([^\/]+\/){3} will only capture the last repeat. You might think that ([^\/]+\/){3} is 3 groups, but it's only one group, because there is only one pair of parenthesis. That group is going to contain the last thing the quantifier matches, in this case c/.
So to have 3 groups, you must have 3 pairs of parenthesis.
If you really want to make the regex shorter, you can try:
[^\/]+\/
This will create 3 matches instead of groups, but you would have to check, using code, that:
there are exactly three matches
the end of each match is the start of the next match
the first match starts at the start of the string
the last match ends at the end of the string
in order to achieve the same effect as your original regex.

The pattern ^([^\/]+\/){3}$ repeats the group 3 times but group 1 will only contain the value of the last iteration. Perhaps this page at The Returned Value for a Given Group is the Last One Captured can be helpful.
If you want group 1, 2 and 3 you have to use 3 capturing groups in the pattern.
Not sure if this qualifies as more elegant, but perhaps is an option to get 3 separate matches using \G to get iterative matches and a positive lookahead (?= to assert that the pattern of not a forward slash followed by a / occurs 3 times:
(?:(?=^(?:[^/]+/){3}$)|\G(?!^))[^/]+/
(?: Non capturing group
(?= Positive lookahead, assert what is on the right is
^(?:[^/]+/){3}$ Match 3 times a not a forward slash, then a /
) Close positive lookahead
| Or
\G(?!^) Assert postion at the end of the previous match, not at the start
) Close non capturing group
[^/]+/ Match not a forward slash, then /
See a regex demo

Related

using regex to verify if character before dot is even or odd number

I'm trying to figure out if my problem is solvable using regex.
I have computer name in format computer01.domain.com.
I'd like to check if number before first dot is odd or even number.
I managed to build regex to locate first character before dot ^([^.])+(?=\.) now I can't figure out how to check if it's odd or even.

If you want to know wether ther is an even or odd number, you might use 2 capture groups with an alternation.
If the first capture group exists, it is an odd number, if the second one exists then it is an even number.
If you only want to capture the single digit, you can also match the following dot instead of asserting it.
^[^.]+(?:([13579])|([02468]))\.
The pattern matches:
^ Start of string.
[^.]+ Match 1+ times any char other than a dot
(?: Non capture group
([13579]) Capture an odd number in group 1
| Or
([02468]) Capture an even number in group 2
) Close the non capture group
\. Match the dot
Regex demo

How to capture nested named groups when referencing outer group by name?

In the list of integer numbers separated by comma, I need to capture (via a PCRE regex) the first occurrence of 12* (if any) and the first occurrence of 45* (if any). How do I do that?
I tried the following but it can only capture inside the first number in the sequence :(
(?P<number>(?P<n12>12\d)|(?P<n45>45\d)|\d+)(?:,(?P>number))*
Here's a sample string to test: 11,222,123,444,456,7. I expect to capture n12=123 and n45=456 here.
UPD
As a workaround, my own solution is to declare the delimiter optional (which it isn't), like this:
(?:,?(?P<number>(?P<n12>12\d)|(?P<n45>45\d)|\d+))*
- this works for me, but not in all cases (e.g. ,1234, 123,4, 1234 and ,123,4 are parsed identically) which i'd like to avoid if possible.
UPD2
N.B. C'mon, this is not the real task I'm faced with - it is just a simplified example. Here's another one so that you can get my idea better:
(?P<animal>(?P<cat>pussy|cat)|(?P<dog>doge|dog)|\w+)(?:,(?P>animal))*
pussy,mouse,dog,bird - has to capture: cat=pussy, dog=dog

Without named groups, you could capture either 12 or 45 in group 1, and for the second capture group recurse the first subpattern using (?1) and before it assert that it is not the same as what is already captured in group 1 using a negative lookahead with a backreference (?!\1)
^(?:\d+,)*?(12|45)(?:\d*(?:,\d+)*?,(?!\1)((?1)))?
Explanation
^ Start of string
(?:\d+,)*? Match as least as possible optional repetitions of 1+ digits and ,
(12|45)\d* Capture either 12 or 45 in group 1
(?: Non capture group
(?:,\d+)*?, Match as least as possible optional repetitions of , and 1+ digits and match ,
(?!\1) Negative lookahead, assert not what was captured in group 1
((?1)) Capture group 2, repeat the first subpattern
)? Close non capture group and make it optional to also allow matching 1 capture group
Regex demo
If you want named capture groups for a single or 2 group values, you can use an alternation with the J flag to allow duplicate subpattern names.
The pattern matches either first occurrence of 12 and then 45, or only 12 or only 45.
^(?:(?:\d+,)*?(?P<n12>12)\d*(?:,\d+)*?,(?P<n45>45)|(?:\d+,)*?(?P<n45>45)\d*(?:,\d+)*?,(?P<n12>12)|(?:\d+,)*?(?P<n12>12)|(?:\d+,)*?(?P<n45>45))
Regex demo

Looks like PCRE doesn't allow to capture named subpatterns nested inside a named pattern called by reference. So the exact answer to the asked question is "There's no way. Sorry".
But there's a workaround for this specific case: instead of referencing the subpattern:
(?P<animal>...)(?:,(?P>animal))*
- you may avoid referencing it:
(?:,(?P<animal>...))*
- but this would require the subject to have a leading delimiter in the beginning, which it doesn't have.
A bad workaround for this is to mark the delimiter as optional:
(?:,?(?P<animal>...))*
- but it allows strange sequences to match.
A better solution is to mark the delimiter conditionally required: if the subpattern has already matched at least once, then the delimiter is required, otherwise it must be omitted:
(?:(?(animal),)(?P<animal>...))*
i.e
(?:(?(animal),)(?P<animal>(?P<cat>pussy|cat)|(?P<dog>doge|dog)|\w+))*
N.B. This will only capture the last match for each subpattern (if any).

Regex for two of any digit then three of another then four of another?

Regex is great, but I can't for the life of me figure out how I'd express the following constraint, without spelling out the whole permutation:
2 of any digit [0-9]
3 of any other digit [0-9] excluding the above
4 of any third digit [0-9] excluding the above
I've got this monster, which is clearly not a good way of doing this, as it grows exponentially with each additional set of digits:
^(001112222|001113333|001114444|001115555|001116666|0001117777|0001118888|0001119999|0002220000|...)$
OR
^(0{2}1{3}2{4}|0{2}1{3}3{4}|0{2}1{3}4{4}|0{2}1{3}5{4}|0{2}1{3}6{4}|0{2}1{3}7{4}|0{2}1{3}8{4}|...)$

Looks like the following will work:
^((\d)\2(?!.+\2)){2}\2(\d)\3{3}$
It may look a bit tricky, using recursive patterns, but it may look more intimidating then it really is. See the online demo.
^ - Start string anchor.
( - Open 1st capture group:
(\d) - A 2nd capture group that does capture a single digit ranging from 0-9.
\2 - A backreference to what is captured in this 2nd group.
(?!.+\2) - Negative lookahead to prevent 1+ characters followed by a backreference to the 2nd group's match.
){2} - Close the 1st capture group and match this two times.
\2 - A backreference to what is most recently captured in the 2nd capture group.
(\d) - A 3rd capture group holding a single digit.
\3{3} - Exactly three backreferences to the 3rd capture group's match.
$ - End string anchor.
EDIT:
Looking at your alternations it looks like you are also allowing numbers like "002220000" as long as the digits in each sequence are different to the previous sequence of digits. If that is the case you can simplify the above to:
^((\d)\2(?!.\2)){2}\2(\d)\3{3}$
With the main difference is the "+" modifier been taken out of the pattern to allow the use of the same number further on.
See the demo

Depending on whether your target environment/framework/language supports lookaheads, you could do something like:
^(\d)\1(?!\1)(\d)\2\2(?!\1|\2)(\d)\3\3\3$
First capture group ((\d)) allows us to enforce the "two identical consecutive digits" by referencing the capture value (\1) as the next match, after which the negative lookahead ensures the next sequence doesn't start with the previous digit - then we just repeat this pattern twice
Note: If you want to exclude only the digit used in the immediately preceding sequence, change (?!\1|\2) to just (?!\2)

Regex - optional capture group after wildcard

Say I have the following list:
No 1 And Your Bird Can Sing (4)
No 2 Baby, You're a Rich Man (5)
No 3 Blue Jay Way S
No 4 Everybody's Got Something to Hide Except Me and My Monkey (1)
And I want to extract the number, the title and the number of weeks in the parenthesis if it exists.
Works, but the last group is not optional (regstorm):
No (?<no>\d{1,3}) (?<title>.*?) \((?<weeks>\d)\)
Last group optional, only matches number (regstorm):
No (?<no>\d{1,3}) (?<title>.*?)( \((?<weeks>\d)\))?
Combining one pattern with week capture with a pattern without week capture works, but there gotta be a better way:
(No (?<no>\d{1,3}) (?<title>.*) \((?<weeks>\d)\))|(No (?<no>\d{1,3}) (?<title>.*))
I use C# and javascript but I guess this is a general regex question.

Your regex is almost there!
First and most importantly, you should add a $ at the end. This makes (?<title>.*?) match all the way towards the end of the string. Currently, (?<title>.*?) matches an empty string and then stops, because it realises that it has reached a point where the rest of the regex matches. Why does the rest of the regex match? Because the optional group can match any empty string. By putting the $, you are making the rest of the regex "harder" to match.
Secondly, you forgot to match an open parenthesis \(.
This is how your regex should look like:
No (?<no>\d{1,3}) (?<title>.*?)( \((?<weeks>\d)\))?$
Demo

You may use this regex with an optional last part:
^No (?<no>\d{1,3}) (?<title>.*?\S)(?: \((?<weeks>\d)\))?$
RegEx Demo

Another option could be for the title to match either not ( or when it does encounter a ( it should not be followed by a digit and a closing parenthesis.
^No (?<no>\d{1,3}) (?<title>(?:[^(\r\n]+|\((?!\d\)))+)(?:\((?<weeks>\d)\))?
In parts
^No
(?\d{1,3}) Group no and space
(?<title>
(?: Non capturing group
[^(\r\n]+ Match any char except ( or newline
| Or
\((?!\d\)) Match ( if not directly followed by a digit and )
)+ Close group and repeat 1+ times
) Close group title
(?: Non capturing group
\((?<weeks>\d)\) Group weeks between parenthesis
)? Close group and make it optional
Regex demo
If you don't want to trim the last space of the title you could exclude it from matching before the weeks.
Regex demo

Regex optional group

I am using this regex:
((?:[a-z][a-z]+))_(\d+)_((?:[a-z][a-z]+)\d+)_(\d{13})
to match strings like this:
SH_6208069141055_BC000388_20110412101855
separating into 4 groups:
SH
6208069141055
BC000388
20110412101855
Question: How do I make the first group optional, so that the resulting group is a empty string?
I want to get 4 groups in every case, when possible.
Input string for this case: (no underline after the first group)
6208069141055_BC000388_20110412101855

Making a non-capturing, zero to more matching group, you must append ?.
(?: ..... )?
^ ^____ optional
|____ group

You can easily simplify your regex to be this:
(?:([a-z]{2,})_)?(\d+)_([a-z]{2,}\d+)_(\d+)$
^ ^^
|--------------||
| first group ||- quantifier for 0 or 1 time (essentially making it optional)
I'm not sure whether the input string without the first group will have the underscore or not, but you can use the above regex if it's the whole string.
regex101 demo
As you can see, the matched group 1 in the second match is empty and starts at matched group 2.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Capturing groups in regex - regex

Related

using regex to verify if character before dot is even or odd number

How to capture nested named groups when referencing outer group by name?

Regex for two of any digit then three of another then four of another?

Regex - optional capture group after wildcard

Regex optional group

Categories

Resources