How to make optional capturing groups be matched first - regex

For example I want to match three values, required text, optional times and id, and the format of id is [id=100000], how can I match data correctly when text contains spaces.
my reg: (?<text>[\s\S]+) (?<times>\d+)? (\[id=(?<id>\d+)])?
example source text: hello world 1 [id=10000]
In this example, all of source text are matched in text

The problem with your pattern is that matches any whitespace and non whitespace one and unlimited times, which captures everything without getting the other desired capture groups. Also, with a little help with the positive lookahead and alternate (|) , we can make the last 2 capture groups desired optional.
The final pattern (?<text>[a-zA-Z ]+)(?=$|(?<times>\d+)? \[id=(?<id>\d+)])
Group text will match any letter and spaces.
The lookahead avoid consuming characters and we should match either the string ended, or have a number and [id=number]
Said that, regex101 with further explanation and some examples

You could use:
:\s*(?<text>[^][:]+?)\s*(?<times>\d+)? \[id=(?<id>\d+)]
Explanation
: Match literally
\s* Match optional whitespace chars
(?<text> Group text
[^][:]+? match 1+ occurrences of any char except [ ] :
) Close group text
\s* Match optional whitespace chars
(?<times>\d+)? Group times, match 1+ digits
\[id= Match [id=
(?<id>\d+) Group id, match 1+ digirs
] Match literally
Regex demo

Related

What is the proper regex for capturing everything after "String" and between two delimeters ('=' and and non alphanumeric))

Details={
AwsEc2SecurityGroup={GroupName=m.com-rds, OwnerId=123, VpcId=vpc-123,
IpPermissions=[{FromPort=3306, ToPort=3306, IpProtocol=tcp, IpRanges=[{CidrIp=1.1.1.1/32}, {CidrIp=2.2.2.2/32}, {CidrIp=0.0.0.0/0}, {CidrIp=3.3.3.3/32}],
UserIdGroupPairs=[{UserId=123, GroupId=sg-123abc}]}], IpPermissionsEgress=[{IpProtocol=-1, IpRanges=[{CidrIp=0.0.0.0/0}]}], GroupId=sg-123abc}},
Region=us-east-1, Id=arn:aws:ec2:us-east-1:123:security-group/sg-123abc}]
}
I want to capture exactly arn:aws:ec2:us-east-1:123:security-group/sg-123abc in this example. Generically, I want to capture the value of Id regardless of placement. My current solution is /Details={.*Id=(.*\w)/, but this only works if it's the last object in the data. How can I take into account the following potential scenario:
Id=arn:aws:ec2:us-east-1:123:security-group/sg-123abc, Thing=123abc}]
You have a pattern with 2 times .* which will first match till the end of the line/string (depending on if the dot matches a newline) and it will backtrack to match the last occurrence where this part of the pattern Id=(.*\w) can match.
If you want to use a capture group, you can make the format and the allowed characters a bit more specific:
\bId=(\w+(?:[:\/-]\w+)+)
The pattern in parts
\b A word boundary to prevent a partial word match
Id= Match literally
( Capture group 1
\w+ Match 1+ word chars
(?:[:\/-]\w+)+ Repeat 1+ times either : / - and 1+ word chars
) Close group 1
Regex demo
Or if you know that it starts with Id=arn:
\bId=(arn:[\w:\/-]+)
Regex demo
Note that you don't have to escape the \/ only when the delimiters of the regex are forward slashes, but there is no language tagged.
You can use look-behind to check that there is the Id= prefix, and then match anything that is not a space, comma or closing brace:
(?<=\bId=)[^,}\s]*

Match group followed by group with different ending

For example, let's say I have a list of words:
words.txt
accountable
accountant
accountants
accounted
I want to match "accountant\naccountants"
I've tried /(\n\w+){2}s/, but \w+ seems to be perfectly matching different things.
My RegEx also matches the following undesirable texts:
action
actionables
actionable
actions
Am I reaching out too far in what regex can do?
You could for example use a capture group, and match a newline followed by a backreference to the same captured text and an s char.
If the first word can also be at the start of the string, instead of being preceded by a newline, you can use an anchor ^ instead.
^(\w+)\n\1s$
^ Start of string
(\w+) Capture group 1, match 1+ word chars
\n\1s Match a newline, backreference \1 to match the same text as group 1 and an s char
$ End of string
Regex demo

Regular expression matching and remove spaces

Please how can I get the address using regex:
Address 123 Mayor Street, LAG Branch ABC
used (?<=Address(\s))(.*(?=\s)) but it includes the spaces after "Address". Trying to get an expression that extracts the address without the spaces. (There are a couple of spaces after "Address" before "123")
Thanks!
The pattern (?<=Address(\s))(.*(?=\s)) that you tried asserts Address followed by a single whitespace char to the left, and then matches the rest of the line asserting a whitespace char to the right.
For the example data, that will match right before the last whitespace char in the string, and the match will also contain all the whitespace chars that are present right after Address
One option to match the bold parts in the question is to use a capture group.
\bAddress\s+([^,]+,\s*\S+)
The pattern matches:
\bAddress\s+ Match Address followed by 1+ whitespace chars
( Capture group 1
[^,]+, Match 1+ occurrences of any char except , and then match ,
\s*\S+ Match optional whitespace chars followed by 1+ non whitespace chars
) Close group 1
.NET regex demo (Click on the Table tab to see the value for group 1)
Note that \s and [^,] can also match a newline
A variant with a positive lookbehind to get a match only:
(?<=\bAddress\s+)[^,\s][^,]+,\s*\S+
.NET Regex demo

Regex to pull first two fields from a comma separated file

I want to pull the second string in a commma delimited list where the first value is numeric and the second is alpha.
I'm using \d[^,]+(?=,) to pull the numeric value in the first field and just need help with pulling the second value from the "Name" column.
Here's part of a sample file that I'm trying to extract data from:
Address Number,Name,Employee Master Exist(Y/N),Auto-Deposit Exists(Y/N),Supplier Master Exists(Y/N),Supplier Master Created,ACH Account Exists(Y/N),ACH Account Created,ACH Same as Auto-deposit(Y/N)
//line break here is for clarity and does not exist in file//
4398,Presley Elvis Aaron,Y,N,Y,N,Y,N,N
10154,Shepard Alan Barrett,Y,Y,Y,N,Y,N,N
You could make use of a capturing group if you want to match the second string by first matching 1+ digits and a comma.
Then capture in a group matching 1+ chars a-zA-Z and match the trailing comma.
^\d+,([a-zA-Z]+(?: [a-zA-Z]+)*),
^ Start of string
\d+, Match 1+ digits and a comma (Or use (\d+), if the digits should also be a group)
( Capture group 1
[a-zA-Z]+ Match 1+ chars a-zA-Z
(?: [a-zA-Z]+)* Repeat matching the same as previous preceded by a space
), Close capturing group and match trailing comma
Regex demo
To get a bit broader match you could use this pattern to match at least a single char a-zA-Z
\d+,([a-zA-Z ]*[a-zA-Z][a-zA-Z ]*),
Regex demo
Note that this part in your pattern \d[^,]+ matches not only digits, but 1 digit followed by 1+ times any char except a comma which would for example also match 4a$ .
You could try this regex:
^\d+,([^,]+),
This will look for lines:
starting with one or more digits
followed by a comma
capture anything that is not a comma
followed by a comma
See it at Regex 101
If not all lines contain a name, then change the + to a *:
^\d+,([^,]*),
See alternative regex

Repeated capturing group PCRE

Can't get why this regex (regex101)
/[\|]?([a-z0-9A-Z]+)(?:[\(]?[,][\)]?)?[\|]?/g
captures all the input, while this (regex101)
/[\|]+([a-z0-9A-Z]+)(?:[\(]?[,][\)]?)?[\|]?/g
captures only |Func
Input string is |Func(param1, param2, param32, param54, param293, par13am, param)|
Also how can i match repeated capturing group in normal way? E.g. i have regex
/\(\(\s*([a-z\_]+){1}(?:\s+\,\s+(\d+)*)*\s*\)\)/gui
And input string is (( string , 1 , 2 )).
Regex101 says "a repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations...". I've tried to follow this tip, but it didn't helped me.
Your /[\|]+([a-z0-9A-Z]+)(?:[\(]?[,][\)]?)?[\|]?/g regex does not match because you did not define a pattern to match the words inside parentheses. You might fix it as \|+([a-z0-9A-Z]+)(?:\(?(\w+(?:\s*,\s*\w+)*)\)?)?\|?, but all the values inside parentheses would be matched into one single group that you would have to split later.
It is not possible to get an arbitrary number of captures with a PCRE regex, as in case of repeated captures only the last captured value is stored in the group buffer.
What you may do is get mutliple matches with preg_match_all capturing the initial delimiter.
So, to match the second string, you may use
(?:\G(?!\A)\s*,\s*|\|+([a-z0-9A-Z]+)\()\K\w+
See the regex demo.
Details:
(?:\G(?!\A)\s*,\s*|\|+([a-z0-9A-Z]+)\() - either the end of the previous match (\G(?!\A)) and a comma enclosed with 0+ whitespaces (\s*,\s*), or 1+ | symbols (\|+), followed with 1+ alphanumeric chars (captured into Group 1, ([a-z0-9A-Z]+)) and a ( symbol (\()
\K - omit the text matched so far
\w+ - 1+ word chars.