How to capture nested named groups when referencing outer group by name? - regex

In the list of integer numbers separated by comma, I need to capture (via a PCRE regex) the first occurrence of 12* (if any) and the first occurrence of 45* (if any). How do I do that?
I tried the following but it can only capture inside the first number in the sequence :(
(?P<number>(?P<n12>12\d)|(?P<n45>45\d)|\d+)(?:,(?P>number))*
Here's a sample string to test: 11,222,123,444,456,7. I expect to capture n12=123 and n45=456 here.
UPD
As a workaround, my own solution is to declare the delimiter optional (which it isn't), like this:
(?:,?(?P<number>(?P<n12>12\d)|(?P<n45>45\d)|\d+))*
- this works for me, but not in all cases (e.g. ,1234, 123,4, 1234 and ,123,4 are parsed identically) which i'd like to avoid if possible.
UPD2
N.B. C'mon, this is not the real task I'm faced with - it is just a simplified example. Here's another one so that you can get my idea better:
(?P<animal>(?P<cat>pussy|cat)|(?P<dog>doge|dog)|\w+)(?:,(?P>animal))*
pussy,mouse,dog,bird - has to capture: cat=pussy, dog=dog

Without named groups, you could capture either 12 or 45 in group 1, and for the second capture group recurse the first subpattern using (?1) and before it assert that it is not the same as what is already captured in group 1 using a negative lookahead with a backreference (?!\1)
^(?:\d+,)*?(12|45)(?:\d*(?:,\d+)*?,(?!\1)((?1)))?
Explanation
^ Start of string
(?:\d+,)*? Match as least as possible optional repetitions of 1+ digits and ,
(12|45)\d* Capture either 12 or 45 in group 1
(?: Non capture group
(?:,\d+)*?, Match as least as possible optional repetitions of , and 1+ digits and match ,
(?!\1) Negative lookahead, assert not what was captured in group 1
((?1)) Capture group 2, repeat the first subpattern
)? Close non capture group and make it optional to also allow matching 1 capture group
Regex demo
If you want named capture groups for a single or 2 group values, you can use an alternation with the J flag to allow duplicate subpattern names.
The pattern matches either first occurrence of 12 and then 45, or only 12 or only 45.
^(?:(?:\d+,)*?(?P<n12>12)\d*(?:,\d+)*?,(?P<n45>45)|(?:\d+,)*?(?P<n45>45)\d*(?:,\d+)*?,(?P<n12>12)|(?:\d+,)*?(?P<n12>12)|(?:\d+,)*?(?P<n45>45))
Regex demo

Looks like PCRE doesn't allow to capture named subpatterns nested inside a named pattern called by reference. So the exact answer to the asked question is "There's no way. Sorry".
But there's a workaround for this specific case: instead of referencing the subpattern:
(?P<animal>...)(?:,(?P>animal))*
- you may avoid referencing it:
(?:,(?P<animal>...))*
- but this would require the subject to have a leading delimiter in the beginning, which it doesn't have.
A bad workaround for this is to mark the delimiter as optional:
(?:,?(?P<animal>...))*
- but it allows strange sequences to match.
A better solution is to mark the delimiter conditionally required: if the subpattern has already matched at least once, then the delimiter is required, otherwise it must be omitted:
(?:(?(animal),)(?P<animal>...))*
i.e
(?:(?(animal),)(?P<animal>(?P<cat>pussy|cat)|(?P<dog>doge|dog)|\w+))*
N.B. This will only capture the last match for each subpattern (if any).

Related

Regex for two of any digit then three of another then four of another?

Regex is great, but I can't for the life of me figure out how I'd express the following constraint, without spelling out the whole permutation:
2 of any digit [0-9]
3 of any other digit [0-9] excluding the above
4 of any third digit [0-9] excluding the above
I've got this monster, which is clearly not a good way of doing this, as it grows exponentially with each additional set of digits:
^(001112222|001113333|001114444|001115555|001116666|0001117777|0001118888|0001119999|0002220000|...)$
OR
^(0{2}1{3}2{4}|0{2}1{3}3{4}|0{2}1{3}4{4}|0{2}1{3}5{4}|0{2}1{3}6{4}|0{2}1{3}7{4}|0{2}1{3}8{4}|...)$
Looks like the following will work:
^((\d)\2(?!.+\2)){2}\2(\d)\3{3}$
It may look a bit tricky, using recursive patterns, but it may look more intimidating then it really is. See the online demo.
^ - Start string anchor.
( - Open 1st capture group:
(\d) - A 2nd capture group that does capture a single digit ranging from 0-9.
\2 - A backreference to what is captured in this 2nd group.
(?!.+\2) - Negative lookahead to prevent 1+ characters followed by a backreference to the 2nd group's match.
){2} - Close the 1st capture group and match this two times.
\2 - A backreference to what is most recently captured in the 2nd capture group.
(\d) - A 3rd capture group holding a single digit.
\3{3} - Exactly three backreferences to the 3rd capture group's match.
$ - End string anchor.
EDIT:
Looking at your alternations it looks like you are also allowing numbers like "002220000" as long as the digits in each sequence are different to the previous sequence of digits. If that is the case you can simplify the above to:
^((\d)\2(?!.\2)){2}\2(\d)\3{3}$
With the main difference is the "+" modifier been taken out of the pattern to allow the use of the same number further on.
See the demo
Depending on whether your target environment/framework/language supports lookaheads, you could do something like:
^(\d)\1(?!\1)(\d)\2\2(?!\1|\2)(\d)\3\3\3$
First capture group ((\d)) allows us to enforce the "two identical consecutive digits" by referencing the capture value (\1) as the next match, after which the negative lookahead ensures the next sequence doesn't start with the previous digit - then we just repeat this pattern twice
Note: If you want to exclude only the digit used in the immediately preceding sequence, change (?!\1|\2) to just (?!\2)

Regex - optional capture group after wildcard

Say I have the following list:
No 1 And Your Bird Can Sing (4)
No 2 Baby, You're a Rich Man (5)
No 3 Blue Jay Way S
No 4 Everybody's Got Something to Hide Except Me and My Monkey (1)
And I want to extract the number, the title and the number of weeks in the parenthesis if it exists.
Works, but the last group is not optional (regstorm):
No (?<no>\d{1,3}) (?<title>.*?) \((?<weeks>\d)\)
Last group optional, only matches number (regstorm):
No (?<no>\d{1,3}) (?<title>.*?)( \((?<weeks>\d)\))?
Combining one pattern with week capture with a pattern without week capture works, but there gotta be a better way:
(No (?<no>\d{1,3}) (?<title>.*) \((?<weeks>\d)\))|(No (?<no>\d{1,3}) (?<title>.*))
I use C# and javascript but I guess this is a general regex question.
Your regex is almost there!
First and most importantly, you should add a $ at the end. This makes (?<title>.*?) match all the way towards the end of the string. Currently, (?<title>.*?) matches an empty string and then stops, because it realises that it has reached a point where the rest of the regex matches. Why does the rest of the regex match? Because the optional group can match any empty string. By putting the $, you are making the rest of the regex "harder" to match.
Secondly, you forgot to match an open parenthesis \(.
This is how your regex should look like:
No (?<no>\d{1,3}) (?<title>.*?)( \((?<weeks>\d)\))?$
Demo
You may use this regex with an optional last part:
^No (?<no>\d{1,3}) (?<title>.*?\S)(?: \((?<weeks>\d)\))?$
RegEx Demo
Another option could be for the title to match either not ( or when it does encounter a ( it should not be followed by a digit and a closing parenthesis.
^No (?<no>\d{1,3}) (?<title>(?:[^(\r\n]+|\((?!\d\)))+)(?:\((?<weeks>\d)\))?
In parts
^No
(?\d{1,3}) Group no and space
(?<title>
(?: Non capturing group
[^(\r\n]+ Match any char except ( or newline
| Or
\((?!\d\)) Match ( if not directly followed by a digit and )
)+ Close group and repeat 1+ times
) Close group title
(?: Non capturing group
\((?<weeks>\d)\) Group weeks between parenthesis
)? Close group and make it optional
Regex demo
If you don't want to trim the last space of the title you could exclude it from matching before the weeks.
Regex demo

Capturing groups in regex

I have string a/b/c/ and I want to get 3 groups (a/, b/, c/) by regex.
So, I can do this
^([^\/]+\/)([^\/]+\/)([^\/]+\/)$
but it is not very elegant.
I want to do something like this
^([^\/]+\/){3}$
but I get warning:
A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
So, I'm interested in the data, but I don't understand what I should change in the regex to get valid result.
Test on regex101
Small example for context (nginx config):
location ~* ^/([^/]+/)([^/]+/)([^/]+/)$ {
rewrite (?i)^/([^/]+/)([^/]+/)([^/]+/)$ /$3$2$1 break;
}
in this case I rewrite url from /a/b/c/ to /c/b/a/.
There is really not much you can do to reduce the duplication in:
^([^\/]+\/)([^\/]+\/)([^\/]+\/)$
The warning is telling you that a repeated group such as ([^\/]+\/){3} will only capture the last repeat. You might think that ([^\/]+\/){3} is 3 groups, but it's only one group, because there is only one pair of parenthesis. That group is going to contain the last thing the quantifier matches, in this case c/.
So to have 3 groups, you must have 3 pairs of parenthesis.
If you really want to make the regex shorter, you can try:
[^\/]+\/
This will create 3 matches instead of groups, but you would have to check, using code, that:
there are exactly three matches
the end of each match is the start of the next match
the first match starts at the start of the string
the last match ends at the end of the string
in order to achieve the same effect as your original regex.
The pattern ^([^\/]+\/){3}$ repeats the group 3 times but group 1 will only contain the value of the last iteration. Perhaps this page at The Returned Value for a Given Group is the Last One Captured can be helpful.
If you want group 1, 2 and 3 you have to use 3 capturing groups in the pattern.
Not sure if this qualifies as more elegant, but perhaps is an option to get 3 separate matches using \G to get iterative matches and a positive lookahead (?= to assert that the pattern of not a forward slash followed by a / occurs 3 times:
(?:(?=^(?:[^/]+/){3}$)|\G(?!^))[^/]+/
(?: Non capturing group
(?= Positive lookahead, assert what is on the right is
^(?:[^/]+/){3}$ Match 3 times a not a forward slash, then a /
) Close positive lookahead
| Or
\G(?!^) Assert postion at the end of the previous match, not at the start
) Close non capturing group
[^/]+/ Match not a forward slash, then /
See a regex demo

Regular Expression to Extract Text Bounded by '/'

I need to a regular expression to extract names from a GEDCOM file. The format is:
Fred Joseph /Smith/
Where the text bounded by the / is the surname and the Fred Joseph are the forenames. The complication is that the surname could be at any place in the text or may not be there at all. I need something that will extract the surname and capture everything else as the forenames.
This is as far as I have got and I have tried making groups optional with the ? qualifier but to no avail:
As you can see it has several problems: If the surname is missing nothing gets captured, the forename(s) sometimes have leading and trailing spaces, and I have 3 capture groups when I'd really like 2. Even better would be if the capture group for the surname didn't include the '/' characters.
Any help would be much appreciated.
For your last line, I'm not sure there is a way to join the group 1 with group 3 into a single group.
Here is my proposed solution. It doesn't capture spaces around forenames.
^(?:\h*([a-z\h]+\b)\h*)?(?:\/([a-z\h]+)\/)?(?:\h*([a-z\h]+\b)\h*)?$
To correctly match the names, care to use the insensitive flag, and if you test all lines at once, use multiline flag.
See the demo
Explanation
^ start of the line
(?:\h*([a-z\h]+\b)\h*)? first non-capturing group that matches 0 or 1 time:
\h* 0 or more horizontal spaces
([a-z\h]+\b) captures in a group letters and spaces, but stops at the end of the last word
\h* matches the possible remaining spaces without capturing
(?:\/([a-z\h]+)\/)? second non-capturing group that matches 0 or 1 time a name in a capturing group surrounded by slashes
(?:\h*([a-z\h]+\b)\h*)? third non-capturing group doing the same as first one, capturing the names in a third group.
$ end of the line
For your requirements
([A-z a-z /])+\w*
Sample
Hope this helps
(.\*?)\\/(.\*?)\\/(.\*)
Try this: ^([^/]*)(/[^/]+/)?([^/]*)$
This matches the following:
^ start of string (or with multiline modifier start of line)
([^/\n]*) anything other than / or new line zero or more times - this is captured as group 1
(/[^/\n]+/)? a single / followed by one or more non / or new line characters, then a single '/' character - this is captured as group 2, and is optional
([^/\n]*) anything other than / or new line zero or more times - this is captured as group 3
$ end of string (or with multiline modifier end of line)
You can see in action with your example text here: https://regex101.com/r/9kmKpy/1
To not capture the slashes you can add a non capturing group by adding ?: to the second set of brackets, and then adding another pair between the slashes:
^([^\/\n]*)(?:\/([^\/\n]+)\/)?([^\/\n]*)$
https://regex101.com/r/9kmKpy/2
I am not sure I follow what language is being used to extract the data, but based on what you have so far, you simply need to add '?':
(.*)(\/?.*\/?)(.*)
Not that this does not give you groupings for EACH name as some solutions will have multiple names in a single group
Edit:
Extending on Niitaku solution and looking at having each individual name in its own group, you could use:
^\s*(?:\/?([a-z]+)\/?)\s*(?:\/?([a-z]+)\/?)\s*(?:\/?([a-z]+)\/?)\s*$
As explained though, if using a language like ruby it would simply be:
ruby -pe '$_ = $_.scan(/\w+/)' file

Regex optional group

I am using this regex:
((?:[a-z][a-z]+))_(\d+)_((?:[a-z][a-z]+)\d+)_(\d{13})
to match strings like this:
SH_6208069141055_BC000388_20110412101855
separating into 4 groups:
SH
6208069141055
BC000388
20110412101855
Question: How do I make the first group optional, so that the resulting group is a empty string?
I want to get 4 groups in every case, when possible.
Input string for this case: (no underline after the first group)
6208069141055_BC000388_20110412101855
Making a non-capturing, zero to more matching group, you must append ?.
(?: ..... )?
^ ^____ optional
|____ group
You can easily simplify your regex to be this:
(?:([a-z]{2,})_)?(\d+)_([a-z]{2,}\d+)_(\d+)$
^ ^^
|--------------||
| first group ||- quantifier for 0 or 1 time (essentially making it optional)
I'm not sure whether the input string without the first group will have the underscore or not, but you can use the above regex if it's the whole string.
regex101 demo
As you can see, the matched group 1 in the second match is empty and starts at matched group 2.