Regex optional group - regex

I am using this regex:
((?:[a-z][a-z]+))_(\d+)_((?:[a-z][a-z]+)\d+)_(\d{13})
to match strings like this:
SH_6208069141055_BC000388_20110412101855
separating into 4 groups:
SH
6208069141055
BC000388
20110412101855
Question: How do I make the first group optional, so that the resulting group is a empty string?
I want to get 4 groups in every case, when possible.
Input string for this case: (no underline after the first group)
6208069141055_BC000388_20110412101855

Making a non-capturing, zero to more matching group, you must append ?.
(?: ..... )?
^ ^____ optional
|____ group

You can easily simplify your regex to be this:
(?:([a-z]{2,})_)?(\d+)_([a-z]{2,}\d+)_(\d+)$
^ ^^
|--------------||
| first group ||- quantifier for 0 or 1 time (essentially making it optional)
I'm not sure whether the input string without the first group will have the underscore or not, but you can use the above regex if it's the whole string.
regex101 demo
As you can see, the matched group 1 in the second match is empty and starts at matched group 2.

Related

How to capture nested named groups when referencing outer group by name?

In the list of integer numbers separated by comma, I need to capture (via a PCRE regex) the first occurrence of 12* (if any) and the first occurrence of 45* (if any). How do I do that?
I tried the following but it can only capture inside the first number in the sequence :(
(?P<number>(?P<n12>12\d)|(?P<n45>45\d)|\d+)(?:,(?P>number))*
Here's a sample string to test: 11,222,123,444,456,7. I expect to capture n12=123 and n45=456 here.
UPD
As a workaround, my own solution is to declare the delimiter optional (which it isn't), like this:
(?:,?(?P<number>(?P<n12>12\d)|(?P<n45>45\d)|\d+))*
- this works for me, but not in all cases (e.g. ,1234, 123,4, 1234 and ,123,4 are parsed identically) which i'd like to avoid if possible.
UPD2
N.B. C'mon, this is not the real task I'm faced with - it is just a simplified example. Here's another one so that you can get my idea better:
(?P<animal>(?P<cat>pussy|cat)|(?P<dog>doge|dog)|\w+)(?:,(?P>animal))*
pussy,mouse,dog,bird - has to capture: cat=pussy, dog=dog
Without named groups, you could capture either 12 or 45 in group 1, and for the second capture group recurse the first subpattern using (?1) and before it assert that it is not the same as what is already captured in group 1 using a negative lookahead with a backreference (?!\1)
^(?:\d+,)*?(12|45)(?:\d*(?:,\d+)*?,(?!\1)((?1)))?
Explanation
^ Start of string
(?:\d+,)*? Match as least as possible optional repetitions of 1+ digits and ,
(12|45)\d* Capture either 12 or 45 in group 1
(?: Non capture group
(?:,\d+)*?, Match as least as possible optional repetitions of , and 1+ digits and match ,
(?!\1) Negative lookahead, assert not what was captured in group 1
((?1)) Capture group 2, repeat the first subpattern
)? Close non capture group and make it optional to also allow matching 1 capture group
Regex demo
If you want named capture groups for a single or 2 group values, you can use an alternation with the J flag to allow duplicate subpattern names.
The pattern matches either first occurrence of 12 and then 45, or only 12 or only 45.
^(?:(?:\d+,)*?(?P<n12>12)\d*(?:,\d+)*?,(?P<n45>45)|(?:\d+,)*?(?P<n45>45)\d*(?:,\d+)*?,(?P<n12>12)|(?:\d+,)*?(?P<n12>12)|(?:\d+,)*?(?P<n45>45))
Regex demo
Looks like PCRE doesn't allow to capture named subpatterns nested inside a named pattern called by reference. So the exact answer to the asked question is "There's no way. Sorry".
But there's a workaround for this specific case: instead of referencing the subpattern:
(?P<animal>...)(?:,(?P>animal))*
- you may avoid referencing it:
(?:,(?P<animal>...))*
- but this would require the subject to have a leading delimiter in the beginning, which it doesn't have.
A bad workaround for this is to mark the delimiter as optional:
(?:,?(?P<animal>...))*
- but it allows strange sequences to match.
A better solution is to mark the delimiter conditionally required: if the subpattern has already matched at least once, then the delimiter is required, otherwise it must be omitted:
(?:(?(animal),)(?P<animal>...))*
i.e
(?:(?(animal),)(?P<animal>(?P<cat>pussy|cat)|(?P<dog>doge|dog)|\w+))*
N.B. This will only capture the last match for each subpattern (if any).

Comma separated prefix list with commas inside

I'm trying to match a comma separated list with prefixed values which contains also a comma.
I finally made it to match all occurrence which doesn't have a ,.
Sample String (With NL for visualization - original string doesn't have NL):
field01=Value 1,
field02=Value 2,
field03=<xml value>,
field04=127.0.0.1,
field05=User-Agent: curl/7.28.0\r\nHost: example.org\r\nAccept: */*,
field06=Location, Resource,
field07={Item 1},{Item 2}
My actual RegEx looks like this not optimized piece ....
(?'fields'(field[0-9]{2,3})=?([\s\w\d_<>.:="*?\-\/\\(){}<>'#]+))([^,](?&fields))*
Any one has a clue how to solve this?
EDIT:
The first pattern is near to my expected result.
This is a anonymized full example of the string:
asm01=Predictable Resource Location,Information Leakage,asm02=N/A,asm04=Uncategorized,asm08=2021-02-15 09:18:16,asm09=127.0.0.1,asm10=443,asm11=N/A,asm15=,asm16=DE,asm17=User-Agent: curl/7.29.0\r\nHost: dev.example.com\r\nAccept: */*\r\nX-Forwarded-For: 127.0.0.1\r\n\r\n,asm18=/Common/_www.example.com_live_v1,asm20=127.0.0.1,asm22=,asm27=HEAD,asm34=/Common/_www.example.com_live_v1,asm35=HTTPS,asm39=blocked,asm41=0,asm42=3,asm43=0,asm44=Error,asm46=200000028,200100015,asm47=Unix hidden (dot-file) access,.htaccess access,asm48={Unix/Linux Signatures},{Apache/NCSA HTTP Server Signatures},asm50=40622,asm52=200000028,asm53=Unix hidden (dot-file) access,asm54={Unix/Linux Signatures},asm55=,asm61=,asm62=,asm63=8985143867830069446,asm64=example-waf.example.com,asm65=/.htaccess,asm67=Attack signature detected,asm68=<?xml version='1.0' encoding='UTF-8'?><BAD_MSG><violation_masks><block>13020008202d8a-f803000000000000</block><alarm>417020008202f8a-f803000000000000</alarm><learn>13000008202f8a-f800000000000000</learn><staging>200000-0</staging></violation_masks><request-violations><violation><viol_index>42</viol_index><viol_name>VIOL_ATTACK_SIGNATURE</viol_name><context>request</context><sig_data><sig_id>200000028</sig_id><blocking_mask>7</blocking_mask><kw_data><buffer>Ly5odGFjY2Vzcw==</buffer><offset>0</offset><length>2</length></kw_data></sig_data><sig_data><sig_id>200000028</sig_id><blocking_mask>4</blocking_mask><kw_data><buffer>Ly5odGFjY2Vzcw==</buffer><offset>0</offset><length>3</length></kw_data></sig_data><sig_data><sig_id>200100015</sig_id><blocking_mask>7</blocking_mask><kw_data><buffer>Ly5odGFjY2Vzcw==</buffer><offset>1</offset><length>9</length></kw_data></sig_data></violation></request-violations></BAD_MSG>,asm69=5,asm71=/Common/_dev.example.com_SSL,asm75=127.0.0.1,asm100=,asm101=HEAD /.htaccess HTTP/1.1\r\nUser-Agent: curl/7.29.0\r\nHost: dev.example.com\r\nAccept: */*\r\nX-Forwarded-For: 127.0.0.1\r\n\r\n#015
The pattern does not work as the fields group matches the string field
You are trying to repeat the named group fields but the example strings do not have the string field.
Note that [^,] matches any char except a comma, you can omit the capture group inside the named group field as it already is a group and \w also matches \d
With 2 capture groups:
\b(asm[0-9]+)=(.*?)(?=,asm[0-9]+=|$)
\b A word boundary
(asm[0-9]+) Capture group 1, match asm and 1+ digits
= Match literally
(.*?) Capture group 2, match any char as least as possible
(?= Positive lookahead, assert what is at the right is
,asm[0-9]+= Match ,asm followed by 1+ digits and =
| Or
$ Assert the end of the string
) Close lookahead
Regex demo
A simple solution would be (see regexr.com/5mg1b):
/((asm\d{2,3})=(.*?))(?=,asm|$)/g
Match groupings will be:
group #1 - asm01=Predictable Resource Location,Information Leakage
group #2 - asm01
group #3 - Predictable Resource Location,Information Leakage
Conditions:
This will match everything including empty values
The key here is to make sure that each match is delimited by either a comma and your field descriptor, or an end of string. A look ahead will be handy here: (?=,asm|$).

Regex - optional capture group after wildcard

Say I have the following list:
No 1 And Your Bird Can Sing (4)
No 2 Baby, You're a Rich Man (5)
No 3 Blue Jay Way S
No 4 Everybody's Got Something to Hide Except Me and My Monkey (1)
And I want to extract the number, the title and the number of weeks in the parenthesis if it exists.
Works, but the last group is not optional (regstorm):
No (?<no>\d{1,3}) (?<title>.*?) \((?<weeks>\d)\)
Last group optional, only matches number (regstorm):
No (?<no>\d{1,3}) (?<title>.*?)( \((?<weeks>\d)\))?
Combining one pattern with week capture with a pattern without week capture works, but there gotta be a better way:
(No (?<no>\d{1,3}) (?<title>.*) \((?<weeks>\d)\))|(No (?<no>\d{1,3}) (?<title>.*))
I use C# and javascript but I guess this is a general regex question.
Your regex is almost there!
First and most importantly, you should add a $ at the end. This makes (?<title>.*?) match all the way towards the end of the string. Currently, (?<title>.*?) matches an empty string and then stops, because it realises that it has reached a point where the rest of the regex matches. Why does the rest of the regex match? Because the optional group can match any empty string. By putting the $, you are making the rest of the regex "harder" to match.
Secondly, you forgot to match an open parenthesis \(.
This is how your regex should look like:
No (?<no>\d{1,3}) (?<title>.*?)( \((?<weeks>\d)\))?$
Demo
You may use this regex with an optional last part:
^No (?<no>\d{1,3}) (?<title>.*?\S)(?: \((?<weeks>\d)\))?$
RegEx Demo
Another option could be for the title to match either not ( or when it does encounter a ( it should not be followed by a digit and a closing parenthesis.
^No (?<no>\d{1,3}) (?<title>(?:[^(\r\n]+|\((?!\d\)))+)(?:\((?<weeks>\d)\))?
In parts
^No
(?\d{1,3}) Group no and space
(?<title>
(?: Non capturing group
[^(\r\n]+ Match any char except ( or newline
| Or
\((?!\d\)) Match ( if not directly followed by a digit and )
)+ Close group and repeat 1+ times
) Close group title
(?: Non capturing group
\((?<weeks>\d)\) Group weeks between parenthesis
)? Close group and make it optional
Regex demo
If you don't want to trim the last space of the title you could exclude it from matching before the weeks.
Regex demo

Capturing groups in regex

I have string a/b/c/ and I want to get 3 groups (a/, b/, c/) by regex.
So, I can do this
^([^\/]+\/)([^\/]+\/)([^\/]+\/)$
but it is not very elegant.
I want to do something like this
^([^\/]+\/){3}$
but I get warning:
A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
So, I'm interested in the data, but I don't understand what I should change in the regex to get valid result.
Test on regex101
Small example for context (nginx config):
location ~* ^/([^/]+/)([^/]+/)([^/]+/)$ {
rewrite (?i)^/([^/]+/)([^/]+/)([^/]+/)$ /$3$2$1 break;
}
in this case I rewrite url from /a/b/c/ to /c/b/a/.
There is really not much you can do to reduce the duplication in:
^([^\/]+\/)([^\/]+\/)([^\/]+\/)$
The warning is telling you that a repeated group such as ([^\/]+\/){3} will only capture the last repeat. You might think that ([^\/]+\/){3} is 3 groups, but it's only one group, because there is only one pair of parenthesis. That group is going to contain the last thing the quantifier matches, in this case c/.
So to have 3 groups, you must have 3 pairs of parenthesis.
If you really want to make the regex shorter, you can try:
[^\/]+\/
This will create 3 matches instead of groups, but you would have to check, using code, that:
there are exactly three matches
the end of each match is the start of the next match
the first match starts at the start of the string
the last match ends at the end of the string
in order to achieve the same effect as your original regex.
The pattern ^([^\/]+\/){3}$ repeats the group 3 times but group 1 will only contain the value of the last iteration. Perhaps this page at The Returned Value for a Given Group is the Last One Captured can be helpful.
If you want group 1, 2 and 3 you have to use 3 capturing groups in the pattern.
Not sure if this qualifies as more elegant, but perhaps is an option to get 3 separate matches using \G to get iterative matches and a positive lookahead (?= to assert that the pattern of not a forward slash followed by a / occurs 3 times:
(?:(?=^(?:[^/]+/){3}$)|\G(?!^))[^/]+/
(?: Non capturing group
(?= Positive lookahead, assert what is on the right is
^(?:[^/]+/){3}$ Match 3 times a not a forward slash, then a /
) Close positive lookahead
| Or
\G(?!^) Assert postion at the end of the previous match, not at the start
) Close non capturing group
[^/]+/ Match not a forward slash, then /
See a regex demo

How to use regular expression to use as few groups as possible to match as long string as possible

For example, this is the regular expression
([a]{2,3})
This is the string
aaaa // 1 match "(aaa)a" but I want "(aa)(aa)"
aaaaa // 2 match "(aaa)(aa)"
aaaaaa // 2 match "(aaa)(aaa)"
However, if I change the regular expression
([a]{2,3}?)
Then the results are
aaaa // 2 match "(aa)(aa)"
aaaaa // 2 match "(aa)(aa)a" but I want "(aaa)(aa)"
aaaaaa // 3 match "(aa)(aa)(aa)" but I want "(aaa)(aaa)"
My question is that is it possible to use as few groups as possible to match as long string as possible?
How about something like this:
(a{3}(?!a(?:[^a]|$))|a{2})
This looks for either the character a three times (not followed by a single a and a different character) or the character a two times.
Breakdown:
( # Start of the capturing group.
a{3} # Matches the character 'a' exactly three times.
(?! # Start of a negative Lookahead.
a # Matches the character 'a' literally.
(?: # Start of the non-capturing group.
[^a] # Matches any character except for 'a'.
| # Alternation (OR).
$ # Asserts position at the end of the line/string.
) # End of the non-capturing group.
) # End of the negative Lookahead.
| # Alternation (OR).
a{2} # Matches the character 'a' exactly two times.
) # End of the capturing group.
Here's a demo.
Note that if you don't need the capturing group, you can actually use the whole match instead by converting the capturing group into a non-capturing one:
(?:a{3}(?!a(?:[^a]|$))|a{2})
Which would look like this.
Try this Regex:
^(?:(a{3})*|(a{2,3})*)$
Click for Demo
Explanation:
^ - asserts the start of the line
(?:(a{3})*|(a{2,3})*) - a non-capturing group containing 2 sub-sequences separated by OR operator
(a{3})* - The first subsequence tries to match 3 occurrences of a. The * at the end allows this subsequence to match 0 or 3 or 6 or 9.... occurrences of a before the end of the line
| - OR
(a{2,3})* - matches 2 to 3 occurrences of a, as many as possible. The * at the end would repeat it 0+ times before the end of the line
-$ - asserts the end of the line
Try this short regex:
a{2,3}(?!a([^a]|$))
Demo
How it's made:
I started with this simple regex: a{2}a?. It looks for 2 consecutive a's that may be followed by another a. If the 2 a's are followed by another a, it matches all three a's.
This worked for most cases:
However, it failed in cases like:
So now, I knew I had to modify my regex in such a way that it would match the third a only if the third a is not followed by a([^a]|$). So now, my regex looked like a{2}a?(?!a([^a]|$)), and it worked for all cases. Then I just simplified it to a{2,3}(?!a([^a]|$)).
That's it.
EDIT
If you want the capturing behavior, then add parenthesis around the regex, like:
(a{2,3}(?!a([^a]|$)))