Regex for allowing optional string at end of expression without interfering capture groups [duplicate] - regex

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 5 days ago.
I'm attempting to write a regex expression that has the following logic
"(.*):(.*)" responds with "(.*)\/(.*)" delayed "(.*)"|"(.*):(.*)" responds with "(.*)\/(.*)"
So either of the following two string could be valid:
"GET:route" responds with "sever/mock"
"GET:route" responds with "sever/mock" delayed "3000"
But without re-writing the first part of the expression twice.
So essentially I want something like "(.*):(.*)" responds with "(.*)\/(.*)"( delayed "(.*)")?.
In this link, you can see the error I'm currently running into; the 4th capture group is also capturing the 5th capture group which is preventing me from separating the last two inputs into separate variables in my code.

Try this:
"(.*?):(.*?)" responds with "(.*?)\/(.*?)"(?: delayed "(.*?)")?
"(.*?):(.*?)" Group1 and Group2, this will match "GET:route" in your 1st example. The .*? means match any character except for line terminators as few times as possible, so .*?: will match up to the first :, then match the :.
responds with this will match responds with .
"(.*?)\/(.*?)" Group3 and Group4, this will match "sever/mock".
(?: delayed "(.*?)")? this is an optional non-capturing group:
(?: the opening of the non-capturing group.
delayed will match delayed
"(.*?)" Group5, this will match "3000" in the 2nd example.
See regex demo

Related

Powershell - Parsing Cisco "Show run" text with regular expression [duplicate]

Suppose I have the following regex that matches a string with a semicolon at the end:
\".+\";
It will match any string except an empty one, like the one below:
"";
I tried using this:
\".+?\";
But that didn't work.
My question is, how can I make the .+ part of the, optional, so the user doesn't have to put any characters in the string?
To make the .+ optional, you could do:
\"(?:.+)?\";
(?:..) is called a non-capturing group. It only does the matching operation and it won't capture anything. Adding ? after the non-capturing group makes the whole non-capturing group optional.
Alternatively, you could do:
\".*?\";
.* would match any character zero or more times greedily. Adding ? after the * forces the regex engine to do a shortest possible match.
As an alternative:
\".*\";
Try it here: https://regex101.com/r/hbA01X/1

RegExp: If-Clause for capturing group possible?

tl;dr:
I am searching for a way to match the closing character sequence based upon the style of the opening sequence syntax in PHP with PCRE-style regular expressions.
The task
I am writing a module to capture all translatable strings from written PHP code. One responsibility of this module will be to also capture any translation context stated within the code. This context is provided as part of an options array.
In PHP (afair starting with version 5.4), there are two different styles possible to define an array:
a) array(...)
b) [...]
I now want to write a regular expression that is able to recognize both styles. The pattern should be able to correctly match the ending character sequence depending on the style chosen to start the array.
Unfortunately, I was not able to find any documentation on how to apply the IF-statement to a given capturing group.
In theory it should look something like this:
/ ... (array\(|\[) ... (?(?=\1==\[)\]|\)) ... /
(Note: "..." in the line above should indicate that the regex pattern is longer than stated here. This should only serve as an example for what I am trying to achieve)
The (?(?=\1==\[)\]|\))part translated to "normal language": If the contents of the first capturing group is an opening square bracket, then the pattern should capture a closing square bracket, otherwise a closing round bracket is required.
Is it possible to achieve something like this? Any help is greatly appreciated!
Thanks in advance
Chris
The regex answer is
(?:array(\()|\[).*?(?(1)\)|])
See the regex demo
Details
(?:array(\()|\[) - a non-capturing group matching either array( while capturing ( into Group 1, or [ char
.*? - any 0 or more chars other than line break chars as few as possible
(?(1)\)|]) - a conditional construct: if Group 1 is matched (the ( char is in the group memory buffer) the ) must match at the current position, else ].
If you want to capture the values using the same capturing group, you could also use a branch reset group (?| to refer to group 1 for the value.
To get the values between the opening and closing parenthesis or square brackets, you could use a negated character class [^ to match any char except the listed in the character class.
(?|array(\([^()]*\))|(\[[^][]*]))
Explanation
(?| Branch reset group
array match literally
( Capture group 1
\([^()]*\) Match (...)
) Close group 1
| Or
( Capture group 2
\[[^][]*] Match [...]
) Close group 2
) close branch reset group
Regex demo

Regexp to match multi-line string [duplicate]

This question already has answers here:
What is a non-capturing group in regular expressions?
(18 answers)
Closed 2 years ago.
I have this regexp:
^(?<FOOTER_TYPE>[ a-zA-Z0-9-]+)?(?<SEPARATOR>:)?(?<FOOTER>(?<=:)(.|[\r\n](?![\r\n]))*)?
Which I'm using to match text like:
BREAKING CHANGE: test
my multiline
string.
This is not matched
You can see the result here https://regex101.com/r/gGroPK/1
However, why is there the last Group 4 ?
You will need to make last group non-capturing:
^(?<FOOTER_TYPE>[ a-zA-Z0-9-]+)?(?<SEPARATOR>:)?(?<FOOTER>(?<=:)(?:.|[\r\n](?![\r\n]))*)?
Make note of:
(?:.|[\r\n](?![\r\n]))*)?
(?: at the start makes this optional group non-capturing.
Updated Demo
it is group 4 because the fourth parentheses you defined is:
(.|[\r\n](?![\r\n]))*)
it translate to
"either dot, or the following regex"
and in the example you have, it ends on a dot.
string.
so as regex is usually greedy, it captures dot as the forth group

Is there a way to match Regex based on previous capture group, not captured previously?

Okay, so the task is that there is a string that can either look like post, or post put or even get put post. All of these must be matched. Preferably deviances like [space]post, or get[space] should not be matched.
Currently I came up with this
^(post|put|delete|get)(( )(post|put|delete|get))*$
However I'm not satisfied with it, because I had to specify (post|put|delete|get) twice. It also matches duplications like post post.
I'd like to somehow use a backreference(?) to the first group so that I don't have to specify the same condition twice.
However, backreference \1 would help me only match post post, for example, and that's the opposite of what I want. I'd like to match a word in the first capture group that was NOT previously found in the string.
Is this even possible? I've been looking through SO questions, but my Google-fu is eluding me.
If you are using a PCRE-based regex engine, you may use subroutine calls like (?n) to recurse the subpatterns.
^(post|put|delete|get)( (?!\1)(?1))*$
^^^^
See the regex demo
Expression details:
^ - start of string
(post|put|delete|get) - Group 1 matching one of the alternatives as literal substrings
( (?!\1)(?1))* - zero or more sequences of:
- a space
(?!\1) - a negative lookahead that fails the match if the text after the current location is identical to the one captured into Group 1 due to backreference \1
(?1) - a subroutine call to the first capture group (i.e. it uses the same pattern used in Group 1)
$ - end of string
UPDATE
In order to avoid matching strings like get post post, you need to also add a negative lookahead into Group 1 so that the subroutine call was aware that we do not want to match the same value that was captured into Group 1.
^((post|put|delete|get)(?!.*\2))( (?1))*$
See the regex demo
The difference is that we capture the alternations into Group 2 and add the negative lookahead (?!.*\2) to disallow any occurrences of the word we captured further in the string. The ( (?1))* remains intact: now, the subroutine recurses the whole Capture Group 1 subpattern with the lookahead.

Capture filename parts: Why doesn't this regexp work?

I'm faily new to regexp and I miss something from capturing groups.
Let's suppose I have a filepath like that
test.orange.john.edn
I want to capture two groups:
test.orange.john (which is the body)
edn (which is the extension)
I used this (and variants of it, taking the $ outside, etc.)
^([a-z]*.)*.([a-z]*$)
But it captures xm only
What did I miss? I do not understand why l is not captured and the body too...
I found answers on the web to capture the extension but I do not understand the problem there.
Thanks
The ^([a-z]*.)*.([a-z]*$) regex is very inefficient as there are lots of unnecessary backtracking steps here.
The start of string is matched, and then [a-z]*. is matched 0+ times. That means, the engine matches as many [a-z] as possible (i.e. it matches test up to the first dot), and then . matches the dot (but only because . matches any character!). So, this ([a-z]*.)* matches test.orange.john.edn only capturing edn since repeating capturing groups only keep the last captured value.
You already have edn in Group 1 at this step. Now, .([a-z]*$) should allocate a substring for the . (any character) pattern. Backtracking goes back and finds n - now, Group 1 only contains ed.
For your task, you should escape the last . to match a literal dot and perhaps, the best expression is
^(.*)\.(.*)$
See demo
It will match all the string up to the end with the first (.*), and then will backtrack to find the last . symbol (so, Group 1 will have all text from the beginning till the last .), and then capturing the rest of the string into Group 2.
If a dot does not have to be present (i.e. if a file name has no extension), add an optional group:
^(.*)(?:\.(.*))?$
See another demo
You can try with:
^([a-z.]+)\.([a-z]+)$
online example