Problem getting nested groups in Regex

Problem getting nested groups in Regex - regex

Given the following text:
//[&][$][*]\n81723&8992%9892*2343%8734
I need to get:
1. &
2. $
3. *
4. 81723&8992%9892*2343%8734
The first line defines delimiters that separates the numbers at the second line.
There is an undefined number of delimiters.
I made this regex:
//(?:\[([^\]]+)\])+\n(.+)
But only 2 groups are obtained. The first is the last delimiter and the second is the string containing the numbers. I tried but I couldn't get all the delimiters.
I'm not good at regex, but I think the first group is being overwritten on every iteration of (?:[([^]]+)])+ and I can't solve this.
Any help?
Regards
Victor

That's not a nested group you're dealing with, it's a repeated group. And you're right: when a capturing group is controlled by a quantifier, it gets repopulated on every iteration, so the final value is whatever was captured the last time around.
What you're trying to do isn't possible in any regex flavor I'm familiar with.
Here's a fuller explanation: Repeating a Capturing Group vs. Capturing a Repeated Group

The best thing I see that you could do (with regex) would be something like this:
(?:\[([^\]]+)\])?(?:\[([^\]]+)\])? #....etc....# \n(.+)

You can’t write something like (foo)+ and match against "foofoofoo" and expect to get three groups back. You only get one per open paren. That means you need more groups that you’ve written.

The following regex works for javascript:
(\[.+\])(\[.+\])(\[.+\])\\n(.*)
This assumes your & $ * will have values.

Related

A short way to capture/back-reference every digit of a number individually

So basically I want to reformat a 10 digit number like so:
1234567890 --> (123) 456-7890
A long way to do this would be to have each number be its own capture group and then back-reference each one individually:
'([0-9])([0-9])...([0-9])' --> (\1\2\3) \4\5\6-\7\8\9\10
This seems unnecessary and verbose, but when I try the following
'([0-9]){10}'
There appears to be only one back-reference and its of the last digit in the number.
Is there is a more elegant way to reference each character as its own capture group?
Thanks!

The following pattern will do the job: ^(\d{3})(\d{3})(\d{4})$
^(\d{3}): beginning of the string, then exactly 3 digits
(\d{3}): exactly 3 digits
(\d{4})$: exactly 4 digits, then end of the string.
Then replace by: (\1) \2-\3

Although the other answer with its example regex patterns hopefully shed light on the correct application of capture groups, it does not directly answer the question. If you fail to understand how regular expressions work (capture groups in particular), you may find yourself wanting to do the same thing with a different pattern in the future.
Is there is a more elegant way to reference each character as its own
capture group?
The initial answer is "No", there is no way to reference an individual capture of a single capture group using traditional replacement syntax - regardless of whether it is a single digit or any other capture group. Consider that you indicate a precise number of matches with {10} and it seems perfectly reasonable to be able to access each capture. But what if you had indicated a variable number of matches with + or {,3}? There would be no well-defined way of knowing how many possible captures occurred. If the same regex pattern had had more capture groups following the "repeated" capture group, there would be no way of correctly referencing the later groups. Example: Given the pattern ([a-z])+(\d){3}, the first capture group could match 4 letters one time, then the next time match 11 letters. If you wanted to refer to the captured digits, how would you do that? You could not, since \1, \2, \3, ... would all be reserved for possible capture instances of the first group.
But the inability of basic regular expressions syntax to do what you want does not remove the validity of your question, nor does it necessarily place the solution outside the realm of many regular expression implementations. Various regex implementations (i.e. language syntax and regex libraries) resolve this limitation by facilitating regex matching with various objects for accessing repeated captures. (c# and .Net regex library is one example, like match.Groups[1].Captures[3]) So even though you can't use basic replacement patterns to get want you want, the answer is often "Yes", depending on the specific implementation.

How can different quantifiers make regex behave differently?

Playing around with a question asked earlier (put on hold, but I wanted to fiddle with it ;) I stumbled across a peculiarity I'd like to ask this knowledgeable community about. Namely - why do these two regexes give different results?
(\b\w+(?:\s+\w+)+)(?:.*?(\1))(?:.*?(\1))?(?:.*?(\1))?
vs.
(\b\w+(?:\s+\w+)+)(?:.*?(\1)){1,3}
First at regex101 - Second at regex101
What I wanted to do, was to have this regex:
(\b\w+(?:\s+\w+)+)(?:.*?(\1))+
detect repeated word sequences - regex101. (a word followed by at least one more. Then anything up to a repetition of the identified sequence, then this last part possibly repeated any number of times. I.e. one or more repetitions.)
What it did was find a sequence that repeated it self later in the document, but it skipped to the last one. OK, though I consider me somewhat comfortable around regexes, I know greediness vs. lazy can be confusing. And I wanted it to catch all repetitions.
So I tried to force it by repeating the second part instead of using a quantifier:
(\b\w+(?:\s+\w+)+)(?:.*?(\1))(?:.*?(\1))
and then it worked like expected - regex101.
That made me try the two regexes first mentioned, that in my opinion should yield the same result, but they don't. So, again - What makes them give different results?

Your original pattern, (\b\w+(?:\s+\w+)+)(?:.*?(\1))+, is going to skip to the last repeated sub-pattern because you are telling it to do that with that last + - you are quantifying a capture group, which means that (?:.*?(\1))+ will not stop when it first hits "my cat is black", it'll keep repeating itself until the longest match is found, at which point all intermediate matches of the capture group are discarded.
Generally speaking, don't quantify capture groups, capture quantified groups.
I think what you want is simply this:
(\b\w+(?:\s+\w+)+).*?(\1)
https://regex101.com/r/OzDdCs/7

When you repeat a capture group, only the last "capture" is put in the back reference.
For example /A(B)+/ used on the string "ABBB" would put the last "B" in capture group $1.
But /A(B)(B)(B)/ has 3 capture groups and thus will have a "B" in $1 & $2 & $3
That's why in those 2 regex examples you showed, the first will also mark that 2nd "my cat is black".
But the second regex example won't.

Regex - orderless extraction of string

I have 2 strings which are 2 records
string1 = "abc/BS-QANTAS\\/DS-12JUL15\\dfd"
string2 = "/DS-10JUN15\\/BS-AIRFRANCE\\dfdsfsdf"
BS is booking airline
DS is Date
I want to use a single regex and extract the booking source & date. Please let me know if it is feasible.
I have tried lookaheads and still couldn't achieve
The target language is Splunk and not Javascript.
Whatever may be the language please post I'll give a try in Splunk

You mentioned that you've tried lookahead, what about lookbehind?
(?<=BS-|DS-)(\w+)
Tested at Regex101

Here's a more scalable (and more readable, IMO) alternative to miroxlav's answer:
(?:\/BS-(?P<source>\w+)|\/DS-(?P<date>\w+)|[^\/\v]+)+
I'm assuming the fields you're interested in always start with a slash. That allows me to use [^/]+ to safely consume the junk between/around them.
demo
This is effectively three regexes in one, wrapped in a group, to give each one a chance to match in turn, and applied multiple times. If the first alternative matches, you're looking at a "source airline" field, and the name is captured in the group named "source". If then second alternative matches, you're looking at the date, which is captured in the "date" group.
But, because the fields aren't in a predetermined order, the regex has to match the whole string to be sure of matching both fields (in fact, I should have used start and end anchors--^ and $--to enforce that; I've added them below). The third alternative, [^/]+, allows it to consume the parts that the first two can't, thus making an overall match possible. Here's the updated regex:
^(?:\/BS-(?P<source>\w+)|\/DS-(?P<date>\w+)|[^\/\v]+)+$
...and the updated demo. As noted in the comment, the \v is there only because I'm combining your two examples into one multiline string and doing two matches. You shouldn't need it in real life.

This gives you both strings filled either in match groups airline1+date1 or in airline2+date2:
((BS-(?<airline1>\w+).*DS-(?<date1>[\w]+))|(DS-(?<date2>[\w]+).*BS-(?<airline2>\w+)))
>> view at regex101.com
Since there are only 2 groups, I used simple permutation.
This regex will take last of occurrences, if there are more. If you need earliest one (using lookbehind), let me know.

Regex - Get string between two words that doesn't contain word

I've been looking around and could not make this happen. I am not totally noob.
I need to get text delimited by (including) START and END that doesn't contain START. Basically I can't find a way to negate a whole word without using advanced stuff.
Example string:
abcSTARTabcSTARTabcENDabc
The expected result:
STARTabcEND
Not good:
STARTabcSTARTabcEND
I can't use backward search stuff. I am testing my regex here: www.regextester.com
Thanks for any advice.

Try this
START(?!.*START).*?END
See it here online on Regexr
(?!.*START) is a negative lookahead. It ensures that the word "START" is not following
.*? is a non greedy match of all characters till the next "END". Its needed, because the negative lookahead is just looking ahead and not capturing anything (zero length assertion)
Update:
I thought a bit more, the solution above is matching till the first "END". If this is not wanted (because you are excluding START from the content) then use the greedy version
START(?!.*START).*END
this will match till the last "END".

START(?:(?!START).)*END
will work with any number of START...END pairs. To demonstrate in Python:
>>> import re
>>> a = "abcSTARTdefENDghiSTARTjlkENDopqSTARTrstSTARTuvwENDxyz"
>>> re.findall(r"START(?:(?!START).)*END", a)
['STARTdefEND', 'STARTjlkEND', 'STARTuvwEND']
If you only care for the content between START and END, use this:
(?<=START)(?:(?!START).)*(?=END)
See it here:
>>> re.findall(r"(?<=START)(?:(?!START).)*(?=END)", a)
['def', 'jlk', 'uvw']

The really pedestrian solution would be START(([^S]|S*S[^ST]|ST[^A]|STA[^R]|STAR[^T])*(S(T(AR?)?)?)?)END. Modern regex flavors have negative assertions which do this more elegantly, but I interpret your comment about "backwards search" to perhaps mean you cannot or don't want to use this feature.
Update: Just for completeness, note that the above is greedy with respect to the end delimiter. To only capture the shortest possible string, extend the negation to also cover the end delimiter -- START(([^ES]|E*E[^ENS]|EN[^DS]|S*S[^STE]|ST[^AE]|STA[^RE]|STAR[^TE])*(S(T(AR?)?)?|EN?)?)END. This risks to exceed the torture threshold in most cultures, though.
Bug fix: A previous version of this answer had a bug, in that SSTART could be part of the match (the second S would match [^T], etc). I fixed this but by the addition of S in [^ST] and adding S* before the non-optional S to allow for arbitrary repetitions of S otherwise.

May I suggest a possible improvement on the solution of Tim Pietzcker?
It seems to me that START(?:(?!START).)*?END is better in order to only catch a START immediately followed by an END without any START or END in between. I am using .NET and Tim's solution would match also something like START END END. At least in my personal case this is not wanted.

[EDIT: I have left this post for the information on capture groups but the main solution I gave was not correct.
(?:START)((?:[^S]|S[^T]|ST[^A]|STA[^R]|STAR[^T])*)(?:END)
as pointed out in the comments would not work; I was forgetting that the ignored characters could not be dropped and thus you would need something such as ...|STA(?![^R])| to still allow that character to be part of END, thus failing on something such as STARTSTAEND; so it's clearly a better choice; the following should show the proper way to use the capture groups...]
The answer given using the 'zero-width negative lookahead' operator "?!", with capture groups, is: (?:START)((?!.*START).*)(?:END) which captures the inner text using $1 for the replace. If you want to have the START and END tags captured you could do (START)((?!.*START).*)(END) which gives $1=START $2=text and $3=END or various other permutations by adding/removing ()s or ?:s.
That way if you are using it to do search and replace, you can do, something like BEGIN$1FINISH. So, if you started with:
abcSTARTdefSTARTghiENDjkl
you would get ghi as capture group 1, and replacing with BEGIN$1FINISH would give you the following:
abcSTARTdefBEGINghiFINISHjkl
which would allow you to change your START/END tokens only when paired properly.
Each (x) is a group, but I have put (?:x) for each of the ones except the middle which marks it as a non-capturing group; the only one I left without a ?: was the middle; however, you could also conceivably capture the BEGIN/END tokens as well if you wanted to move them around or what-have-you.
See the Java regex documentation for full details on Java regexes.

Regex multiple matches and $1, $2 variables (quick and easy!)

I need to extract numeric values from strings like "£17,000 - £35,000 dependent on experience"
([0-9]+k?[.,]?[0-9]+)
That string is just an example, i can have 17k 17.000 17 17,000, in every string there can be 0,1 or 2 numbers (not more than 2), they can be everywhere in the string, separated by anything else. I just need to extract them, put the first extracted in a place and the second in another.
I could come up with this, but it gives me two matches (don't mind the k?[,.], it's correct), in the $1 grouping. I need to have 17,000 in $1 and 35,000 in $2, how can i accomplish this? I can also manage to use 2 different regex

Using regex
With every opening round bracket you create a new capturing group. So to have a second capturing group $2, you need to match the second number with another part of your regex that is within brackets and of course you need to match the part between the to numbers.
([0-9]+k?[.,]?[0-9]+)\s*-\s*.*?([0-9]+k?[.,]?[0-9]+)
See here on Regexr
But could be that Solr has regex functions that put all matches into an array, that would maybe be easier to use.

Match the entire dollar range with 2 capture groups rather than matching every dollar amount with one capture group:
([0-9]+k?[.,]?[0-9]+) - ([0-9]+k?[.,]?[0-9]+)
However, I'm worried (yeah, I'm minding it :p) about that regex as it will match some strange things:
182k,938 - 29.233333
will both be matched, it can definitely be improved if you can give more information on your input types.

What about something along the lines of
[£]?([0-9]+k?[.,]?[0-9]+) - [£]([0-9]+k?[.,]?[0-9]+)
This should now give you two groups.
Edit: Might need to clean up the spaces too

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js