Regex - match until group that may or may not occur - regex

I have following text:
:3:Start!##$%^&*():31:Start!##$%^&*():31:End!##$%^&*():3:End
and with following regex:
(:3:Start)(.*)(:31:Start.*:31:End)?(.*)(:3:End)
Why group3 is not found even though it exists. Even if I set group2 as not greedy:
(:3:Start)(.*?)(:31:Start.*:31:End)?(.*)(:3:End)
How Can I capture group with optional subgroup if it occurs in the middle of the text

You may achieve what you need if you enclose the (.*?) and (:31:Start.*:31:End) groups into an optional non-capturing group (quantified with a greedy ? quantifier) and making the optional group obligatory:
(:3:Start)(?:(.*?)(:31:Start.*:31:End))?(.*)(:3:End)
|____________________________|
See the regex demo. It will work like this:
(:3:Start) - will capture into Group 1 the :3:Start` string
(?:(.*?)(:31:Start.*:31:End))? - will attempt to match once a sequence of patterns:
(.*?) - Group 2: any 0 or more chars other than line break chars as few as possible
(:31:Start.*:31:End) - Group 3: :31:Start.*:31:End string
(.*) - Group 4: any 0 or more chars other than line break chars as many as possible
(:3:End) - captures into Group 5 :3:End string
Why doesn't your pattern work?
See your pattern demo, the !##$%^&*():31:Start!##$%^&*():31:End!##$%^&*() substring is captured into Group 4, matched with (.*) pattern. It happens because (.*?)(:31:Start.*:31:End)? first skips the .*? pattern (it is lazy, non-greedy, the engine does not even attempt to match it when it sees such a pattern the first time, it goes on matching with obligatory patterns and only comes back when the subsequent patterns do not match), and (:31:Start.*:31:End)? matches an empty string right after :3:Start substring. The rest finds a match, thus, no optional text is matched into your expected group.

Related

Regex to validate subtract equations like "abc-b=ac"

I've stumbled upon a regex question.
How to validate a subtract equation like this?
A string subtract another string equals to whatever remains(all the terms are just plain strings, not sets. So ab and ba are different strings).
Pass
abc-b=ac
abcde-cd=abe
ab-a=b
abcde-a=bcde
abcde-cde=ab
Fail
abc-a=c
abcde-bd=ace
abc-cd=ab
abcde-a=cde
abc-abc=
abc-=abc
Here's what I tried and you may play around with it
https://regex101.com/r/lTWUCY/1/
Disclaimer: I see that some of the comments were deleted. So let me start by saying that, though short (in terms of code-golf), the following answer is not the most efficient in terms of steps involved. Though, looking at the nature of the question and its "puzzle" aspect, it will probably do fine. For a more efficient answer, I'd like to redirect you to this answer.
Here is my attempt:
^(.*)(.+)(.*)-\2=(?=.)\1\3$
See the online demo
^ - Start line anchor.
(.*) - A 1st capture group with 0+ non-newline characters right upto;
(.+) - A 2nd capture group with 1+ non-newline characters right upto;
(.*) - A 3rd capture group with 0+ non-newline characters right upto;
-\2= - An hyphen followed by a backreference to our 2nd capture group and a literal "=".
(?=.) - A positive lookahead to assert position is followed by at least a single character other than newline.
\1\3 - A backreference to what was captured in both the 1st and 3rd capture group.
$ - End line anchor.
EDIT:
I guess a bit more restrictive could be:
^([a-z]*)([a-z]+)((?1))-\2=(?=.)\1\3$
You may use this more efficient regex with a lookahead at the start with a capture group that matches text on the right hand side of - i.e. substring between - and = and captures it in group #1. Then in the main body of regex we just check presence of capture group #1 and capture text before and after \1 in 2 separate groups.
^(?=[^-]+-([^=]+)=.)([^-]*?)\1([^-]*)-[^=]+=\2\3$
RegEx Demo
RegEx Demo:
^: Start
(?=[^-]+-([^=]+)=.): Lookahead to make sure we have expression structure of pqr-pq=r and also more importantly capture substring between - and = in capture group #1. . after = is there for a reason to disallow any empty string after =.
([^-]*?): Match 0 or more non-- characters in capture group #2
\1: Back-reference to group #1 to make sure we match same value as in capture group #1
([^-]*): Match 0 or more non-- characters in capture group #3
-: Match a -
[^=]+: Match 0 or more non-= characters
=: Match a =
\2\3: Back-reference to group #2 and #3 which is difference of substraction
$: End

Seperate string by recognizing first digit with regex

I'm using ([^\d]+)\s?(.+) for dividing a string by taking the first digit that appears inside the string.
Exp.: Test123 --> Group1: Test, Group2: 123 # that works
but
Exp.: Test --> Group1: Tes, Group2: t # I expect: Group1: Test, Group 2: [empty]
How to edit the regex, so it fits my expcetation?
If you need to match up to the first digit if there is one, you may use
^(.*?)\s*(\d.*)?$
See the regex demo
^ - start of string
(.*?) - Group 1: any 0+ chars other than line break chars, as few as possible (since *? is a lazy quantifier)
\s* - 0+ whitespaces
(\d.*)? - Group 2: an optional capturing group matching 1 or 0 occurrences of a digit and then any 0+ chars other than line break chars as many as possilbe (* is a greedy quantifier)
$ - end of string.
Your regex almost works
Problem: The problem lies in your second capturing group (.+) this means at least one of any character. It will grab the 't' at the end of test in order to make a match, since it must have at least one character in it.
Solution: replace your second capturing group with (.*) this means at least zero of any character. (ie): it does not need to have any characters in it to make a match and it will grab any number of characters after 'Test'
here is your new working regex code:
([^\d]+)\s?(.*)

Regular expression to exclude group with 0 and more occurence issue

I need to extract 1234567 from below URLs
http://www.test.in/some--wonders-1234567---2
http://www.test.in/some--wonders-1234567
I tried with .*\-([0-9]+)(?:-{2,}2)?.
but for the first URL it returned 2, but this is in non-capturing group.
Please give me a solution. I am digging it for so long. not getting any idea.
Try this one:
.*?\-([0-9]+)(?:-{2,}2|$)
It sets lazy mode for first .* pattern, you can also remove it at all with same effect:
\-([0-9]+)(?:-{2,}2|$)
If your regex engine supports negative look behinds (some do not), you can do it this way:
(?<!\d+-+)\d+
It gives you any non-empty digit string, which is not preceded by (minuses followed by digits).
Big advantage is that you don't have to use groups here - regex itself returns what you want.
You could match a - followed by one or more digits which you could capture in a group ([0-9]+). This group will contain the value you want to extract.
Then an optional part (?:-{2,}[0-9]+)? that would match ---2 followed by asserting the end of the line $.
-(\d+)(?:-{2,}\d+)?$
Explanation
- Match literally
(\d+) Capture one or more digits in a group
(?: Non capturing group
-{2,} Match 2 or more times -
\d+ Match one or more digits
)? close non capturing group and make it optional
$ Assert position at the end of the line

Find all lines using regular expression

There is a text like this (many lines)
1. sdfsdf werwe werwemax45 rwrwerwr
2. 34348878 max max44444445666 sdf
3. 4353424 23423eedf max55 dfdg dfgdf
4. max45
5. 4324234234sdfsdf maxx34534
Using regular expressions I need to find all lines and include a word max<digits> (containing digits instead of literally <digits>) into a matching group.
So I've tried this regular expression:
^.*?\b(max\d+)\b.*?$
But it finds only lines containing max... and ignores others.
Then I’ve tried
^.*?\b(max\d+)?\b.*?$
It finds all lines but without matching group containing max....
The issue can be "debugged" with a slightly modified pattern, ^(.*?)\b(max\d+)?\b(.*?)$, with the rest of the pattern wrapped into separate capturing groups. You can see that the lines are all matched by the Group 3 pattern, the last .*?. It happens because the first .*? is skipped (since it is a lazy pattern), then (max\d=)? matches an empty string at the start of the line (none begins with max + digits - but if any line starts with that pattern, you would get it captured), and the last .*? captures the whole line.
You can fix it by wrapping the first part into a non-capturing optional group capturing the max\d+ into an obligatory capturing group
^(?:.*?\b(max\d+)\b)?.*?$
Or even without ?$ at the end since .* will match greedily up to the end of the line:
^(?:.*?\b(max\d+)\b)?.*
See the regex demo
Details
^ - start of string (with m option, start of a line)
(?:.*?\b(max\d+)\b)? - an optional non-capturing group:
.*? - any 0+ chars, other than line break chars as few as possible
\b - a word boundary
(max\d+) - Group 1 (obligatory, will be tried once): max and 1+ digits
\b - a word boundary
.* - rest of the line

Repeated capturing group PCRE

Can't get why this regex (regex101)
/[\|]?([a-z0-9A-Z]+)(?:[\(]?[,][\)]?)?[\|]?/g
captures all the input, while this (regex101)
/[\|]+([a-z0-9A-Z]+)(?:[\(]?[,][\)]?)?[\|]?/g
captures only |Func
Input string is |Func(param1, param2, param32, param54, param293, par13am, param)|
Also how can i match repeated capturing group in normal way? E.g. i have regex
/\(\(\s*([a-z\_]+){1}(?:\s+\,\s+(\d+)*)*\s*\)\)/gui
And input string is (( string , 1 , 2 )).
Regex101 says "a repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations...". I've tried to follow this tip, but it didn't helped me.
Your /[\|]+([a-z0-9A-Z]+)(?:[\(]?[,][\)]?)?[\|]?/g regex does not match because you did not define a pattern to match the words inside parentheses. You might fix it as \|+([a-z0-9A-Z]+)(?:\(?(\w+(?:\s*,\s*\w+)*)\)?)?\|?, but all the values inside parentheses would be matched into one single group that you would have to split later.
It is not possible to get an arbitrary number of captures with a PCRE regex, as in case of repeated captures only the last captured value is stored in the group buffer.
What you may do is get mutliple matches with preg_match_all capturing the initial delimiter.
So, to match the second string, you may use
(?:\G(?!\A)\s*,\s*|\|+([a-z0-9A-Z]+)\()\K\w+
See the regex demo.
Details:
(?:\G(?!\A)\s*,\s*|\|+([a-z0-9A-Z]+)\() - either the end of the previous match (\G(?!\A)) and a comma enclosed with 0+ whitespaces (\s*,\s*), or 1+ | symbols (\|+), followed with 1+ alphanumeric chars (captured into Group 1, ([a-z0-9A-Z]+)) and a ( symbol (\()
\K - omit the text matched so far
\w+ - 1+ word chars.