Match group followed by group with different ending - regex

For example, let's say I have a list of words:
words.txt
accountable
accountant
accountants
accounted
I want to match "accountant\naccountants"
I've tried /(\n\w+){2}s/, but \w+ seems to be perfectly matching different things.
My RegEx also matches the following undesirable texts:
action
actionables
actionable
actions
Am I reaching out too far in what regex can do?

You could for example use a capture group, and match a newline followed by a backreference to the same captured text and an s char.
If the first word can also be at the start of the string, instead of being preceded by a newline, you can use an anchor ^ instead.
^(\w+)\n\1s$
^ Start of string
(\w+) Capture group 1, match 1+ word chars
\n\1s Match a newline, backreference \1 to match the same text as group 1 and an s char
$ End of string
Regex demo

Related

How to make optional capturing groups be matched first

For example I want to match three values, required text, optional times and id, and the format of id is [id=100000], how can I match data correctly when text contains spaces.
my reg: (?<text>[\s\S]+) (?<times>\d+)? (\[id=(?<id>\d+)])?
example source text: hello world 1 [id=10000]
In this example, all of source text are matched in text
The problem with your pattern is that matches any whitespace and non whitespace one and unlimited times, which captures everything without getting the other desired capture groups. Also, with a little help with the positive lookahead and alternate (|) , we can make the last 2 capture groups desired optional.
The final pattern (?<text>[a-zA-Z ]+)(?=$|(?<times>\d+)? \[id=(?<id>\d+)])
Group text will match any letter and spaces.
The lookahead avoid consuming characters and we should match either the string ended, or have a number and [id=number]
Said that, regex101 with further explanation and some examples
You could use:
:\s*(?<text>[^][:]+?)\s*(?<times>\d+)? \[id=(?<id>\d+)]
Explanation
: Match literally
\s* Match optional whitespace chars
(?<text> Group text
[^][:]+? match 1+ occurrences of any char except [ ] :
) Close group text
\s* Match optional whitespace chars
(?<times>\d+)? Group times, match 1+ digits
\[id= Match [id=
(?<id>\d+) Group id, match 1+ digirs
] Match literally
Regex demo

What is the proper regex for capturing everything after "String" and between two delimeters ('=' and and non alphanumeric))

Details={
AwsEc2SecurityGroup={GroupName=m.com-rds, OwnerId=123, VpcId=vpc-123,
IpPermissions=[{FromPort=3306, ToPort=3306, IpProtocol=tcp, IpRanges=[{CidrIp=1.1.1.1/32}, {CidrIp=2.2.2.2/32}, {CidrIp=0.0.0.0/0}, {CidrIp=3.3.3.3/32}],
UserIdGroupPairs=[{UserId=123, GroupId=sg-123abc}]}], IpPermissionsEgress=[{IpProtocol=-1, IpRanges=[{CidrIp=0.0.0.0/0}]}], GroupId=sg-123abc}},
Region=us-east-1, Id=arn:aws:ec2:us-east-1:123:security-group/sg-123abc}]
}
I want to capture exactly arn:aws:ec2:us-east-1:123:security-group/sg-123abc in this example. Generically, I want to capture the value of Id regardless of placement. My current solution is /Details={.*Id=(.*\w)/, but this only works if it's the last object in the data. How can I take into account the following potential scenario:
Id=arn:aws:ec2:us-east-1:123:security-group/sg-123abc, Thing=123abc}]
You have a pattern with 2 times .* which will first match till the end of the line/string (depending on if the dot matches a newline) and it will backtrack to match the last occurrence where this part of the pattern Id=(.*\w) can match.
If you want to use a capture group, you can make the format and the allowed characters a bit more specific:
\bId=(\w+(?:[:\/-]\w+)+)
The pattern in parts
\b A word boundary to prevent a partial word match
Id= Match literally
( Capture group 1
\w+ Match 1+ word chars
(?:[:\/-]\w+)+ Repeat 1+ times either : / - and 1+ word chars
) Close group 1
Regex demo
Or if you know that it starts with Id=arn:
\bId=(arn:[\w:\/-]+)
Regex demo
Note that you don't have to escape the \/ only when the delimiters of the regex are forward slashes, but there is no language tagged.
You can use look-behind to check that there is the Id= prefix, and then match anything that is not a space, comma or closing brace:
(?<=\bId=)[^,}\s]*

Regular expression matching and remove spaces

Please how can I get the address using regex:
Address 123 Mayor Street, LAG Branch ABC
used (?<=Address(\s))(.*(?=\s)) but it includes the spaces after "Address". Trying to get an expression that extracts the address without the spaces. (There are a couple of spaces after "Address" before "123")
Thanks!
The pattern (?<=Address(\s))(.*(?=\s)) that you tried asserts Address followed by a single whitespace char to the left, and then matches the rest of the line asserting a whitespace char to the right.
For the example data, that will match right before the last whitespace char in the string, and the match will also contain all the whitespace chars that are present right after Address
One option to match the bold parts in the question is to use a capture group.
\bAddress\s+([^,]+,\s*\S+)
The pattern matches:
\bAddress\s+ Match Address followed by 1+ whitespace chars
( Capture group 1
[^,]+, Match 1+ occurrences of any char except , and then match ,
\s*\S+ Match optional whitespace chars followed by 1+ non whitespace chars
) Close group 1
.NET regex demo (Click on the Table tab to see the value for group 1)
Note that \s and [^,] can also match a newline
A variant with a positive lookbehind to get a match only:
(?<=\bAddress\s+)[^,\s][^,]+,\s*\S+
.NET Regex demo

Regex Extract a string between two words containing a particular string

I have the below string
abc-12d-ef-oy-5678-xyz--**--20190120075439322am--**--ghi-66d-ef-oy-8877-sdf--**--sfdfdsgfg--**--20190120075765487am
It is kind of multi character delimited string, delimited by '--**--' I am trying to extract the first and second words which has the -oy- tag in it. This is a column in a table. I am using the regex_extract method but i am not able extract the string which contains a string and ends with a string.
Here is one pattern that i tried .*(.*oy.*)--
If the -oy- can not be at the start or at the end, you could use this pattern to match the 2 hyphen delimited strings with -oy-:
[a-z0-9]+(?:-[a-z0-9]+)*-oy(?:-[a-z0-9]+)+
Regex details
[a-z0-9]+ Match 1+ times a-z0-9
(?: Non capturing group
-[a-z0-9]+ Match - and 1+ times a-z0-9
)* Close group and repeat 0+ times
-oy Match literally
(?:-[a-z0-9]+)+ Repeat 1+ times a group which will match - and 1+ times a-z0-9
You can extend the character class [A-Za-z0-9] to allow what you want to match like uppercase chars.
Regex demo | Java demo
If the matches should be between delimiters, you could use a positive lookbehind and positive lookahead and an alternation:
(?<=^|--\\*\\*--)[a-z0-9]+(?:-[a-z0-9]+)*-oy(?:-[a-z0-9]+)+(?=--\\*\\*--|$)
See a Java demo
You can use this regex which will match string containing -oy- and capture them in group1 and group2.
^.*?(\w+(?:-\w+)*-oy-\w+(?:-\w+)*).*?(\w+(?:-\w+)*-oy-\w+(?:-\w+)*)
This regex basically matches two strings delimiter separated containing -oy- using this (\w+(?:-\w+)*-oy-\w+(?:-\w+)*) to capture the text.
Demo
Are you able to select values from capture groups?
(?:--\*\*--|^)(.*?-oy-.*?)(?:--\*\*--|$)
?: - Non-capture group, matches the delimiter, begin of line, or end of line but does not create a capture group
*? - Lazy match so you only grab the contents of the field
https://regex101.com/r/aUAvcx/1
--- Second stab at this follows ---
This is convoluted. Hopefully you can use Lookahead and Lookbehind. The last problem I had was the final record was being "Greedy" and sucking up the field before it too. So I had to add an exclusion in the capture group for your delimiter.
See if this works for you.
(?<=--\*\*--|^)((?:(?:(?!--\*\*--).)*)-oy-(?:(?:(?!--\*\*--).)*))(?=--\*\*--|$)
https://regex101.com/r/aUAvcx/3
Basically the (?: are so we are not getting too many capture groups to work with.
There are three parts to this:
The lookbehind - Make sure the field is framed by the delimiter (or start of line)
The capture group - Grab the contents of the field, making sure a delimiter isn't sucked up into it
The lookahead - Make sure the field is framed by the delimiter (or end of line)
As far as the capture group goes, I check the left and right side of the -oy- to make sure the delimiter isn't there.

Repeated capturing group PCRE

Can't get why this regex (regex101)
/[\|]?([a-z0-9A-Z]+)(?:[\(]?[,][\)]?)?[\|]?/g
captures all the input, while this (regex101)
/[\|]+([a-z0-9A-Z]+)(?:[\(]?[,][\)]?)?[\|]?/g
captures only |Func
Input string is |Func(param1, param2, param32, param54, param293, par13am, param)|
Also how can i match repeated capturing group in normal way? E.g. i have regex
/\(\(\s*([a-z\_]+){1}(?:\s+\,\s+(\d+)*)*\s*\)\)/gui
And input string is (( string , 1 , 2 )).
Regex101 says "a repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations...". I've tried to follow this tip, but it didn't helped me.
Your /[\|]+([a-z0-9A-Z]+)(?:[\(]?[,][\)]?)?[\|]?/g regex does not match because you did not define a pattern to match the words inside parentheses. You might fix it as \|+([a-z0-9A-Z]+)(?:\(?(\w+(?:\s*,\s*\w+)*)\)?)?\|?, but all the values inside parentheses would be matched into one single group that you would have to split later.
It is not possible to get an arbitrary number of captures with a PCRE regex, as in case of repeated captures only the last captured value is stored in the group buffer.
What you may do is get mutliple matches with preg_match_all capturing the initial delimiter.
So, to match the second string, you may use
(?:\G(?!\A)\s*,\s*|\|+([a-z0-9A-Z]+)\()\K\w+
See the regex demo.
Details:
(?:\G(?!\A)\s*,\s*|\|+([a-z0-9A-Z]+)\() - either the end of the previous match (\G(?!\A)) and a comma enclosed with 0+ whitespaces (\s*,\s*), or 1+ | symbols (\|+), followed with 1+ alphanumeric chars (captured into Group 1, ([a-z0-9A-Z]+)) and a ( symbol (\()
\K - omit the text matched so far
\w+ - 1+ word chars.