Clarification regarding nested groups in regex

Clarification regarding nested groups in regex - regex

I have the following regular expression: "Jan 1987" that I want it spliced into two groups:
The first group should match the whole string
The second group should match only the year
The following expression: (.+(\d+)) creates the first group but the second group only matches the the last digit, if I add a space like this (.+ (\d+)) the second group matches correctly the whole year.
Can someone explain me why?
Thanks in advance.

The following expression: (.+(\d+)) creates the first group but the second group only matches the the last digit, if I add a space like this (.+ (\d+)) the second group matches correctly the whole year.
Can someone explain me why?
Yes, because this term .+ is greedy and will match all the way up until the
last digit which only 1 is really required to then satisfy the whole match.
By adding the space requirement it tells the engine that it has to find
at least a space followed by a digit, which there is only 1 place in the sample,
therefore the Month is matched in the correct group and likewise
the year in its correct group.

The actual regex you most likely want to be using here is something like this:
(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (\d{4})
If you want to access the entire match, then this is already the whole string. Depending on the regex tool/language you are using, you could also access the zeroth capture group. The 4 digit year would be available in the first capture group.
Demo

Related

Extra groups in regex

I'm building a regex to be able to parse addresses and am running into some blocks. An example address I'm testing against is:
5173B 63rd Ave NE, Lake Forest Park WA 98155
I am looking to capture the house number, street name(s), city, state, and zip code as individual groups. I am new to regex and am using regex101.com to build and test against, and ended up with:
(^\d+\w?)\s((\w*\s?)+).\s(\w*\s?)+([A-Z]{2})\s(\d{5})
It matches all the groups I need and matches the whole string, but there are extra groups that are null value according to the match information (3 and 4). I've looked but can't find what is causing this issue. Can anyone help me understand?

Your regex expression was almost good:
(^\d+\w?)\s([\w*\s?]+).\s([\w*\s?]+)\s([A-Z]{2})\s(\d{5})
What I changed are the second and third groups: in both you used a group inside a group ((\w*\s?)+), where a class inside a group (([\w*\s?]+)) made sure you match the same things and you get the proper group content.
With your previous syntax, the inner group would be able to match an empty substring, since both quantifiers allow for a zero-length match (* is 0 to unlimited matches and ? is zero or one match). Since this group was repeated one or more times with the +, the last occurrence would match an empty string and only keep that.

For this you'll need to use a non-capturing group, which is of the form (?:regex), where you currently see your "null results". This gives you the regex:
(^\d+\w?)\s((?:\w*\s?)+).\s(?:\w*\s?)+([A-Z]{2})\s(\d{5})
Here is a basic example of the difference between a capturing group and a non-capturing group: ([^s]+) (?:[^s]+):
See how the first group is captured into "Group 1" and the second one is not captured at all?

Matching an address can be difficult due to the different formats.
If you can rely on the comma to be there, you can capture the part before it using a negated character class:
^(\d+[A-Z]?)\s+([^,]+?)\s*,\s*(.+?)\s+([A-Z]{2})\s(\d{5})$
Regex demo
Or take the part before the comma that ends on 2 or more uppercase characters, and then match optional non word characters using \W* to get to the first word character after the comma:
^(\d+[A-Z]?)\s+(.*?\b[A-Z]{2,}\b)\W*(.+?)\s+([A-Z]{2})\s(\d{5})$
Regex demo

Regex with optional word inside string

I try to Regex the following line (each word separated by one space):
Firstpartstring thisisoptional secondpartstring
I expect each string to match as group:
Group 1. Firstpartstring
Group 2. thisisoptional
Group 3. secondpartstring
This is what I tried so far:
(.*?)\s(thisisoptional)?\s(.*)
Only problem is, if "thisisoptional" does not exist inside the string, I don't get any results.
I expect:
Group 1. Firstpartstring
Group 2.
Group 3. secondpartstring
Please check this demo: https://regex101.com/r/YBlYXm/1
Can anyone get me in the right direction?
Thanks

Your problem is that you are asking for two spaces (\s) in your Regex which does not match your case if thisisoptional is not included. The easy fix is to include the second space in your 2nd capturing group:
(.*?)\s(thisisoptional\s)?(.*)
this selects anything followed by thisisoptional then followed by anything

The space before the optional word should be made optional as well; otherwise it would require two spaces between the first and the last word to match:
(.*?)(?:\s(thisisoptional))?\s(.*)
https://regex101.com/r/YBlYXm/2

Can't you just group all non-whitespace characters with (\S+) and then remove the middle one if you get three matches?
Example of this regex running: https://regex101.com/r/IIyM5Z/1

regex match combined with something before a string if exists

I tried to get the sub-strings from a string
such like:
test strings:
cat_zoo_New_York_US
dog_zoo_South_Carolina
dolphin_zoo_Montreal_Canada
pokemon_home_d_K2-155
returned sub strings:
cat, New_York
dog, South_Carolina
dolphin, Montreal
pokemon, d
the Regex pattern I have tried is
([\w]+)(?:(_zoo_|_home_))(((?!(_US|_Canada|_K2-155))\w)+)
which I don't think is very concise and it returns other sub-strings besides what I need. Do you have any other suggestions?
Thanks!
Some updates
after #The fourth bird's answer #03/15/2018.
First of all, I like the idea of utilizing both ([^_]+) and the (?:) for different part of the sample strings.But let me extend a little more of the sample strings.
cat_zoo_New_York_US
dog_zoo_South_Carolina
yellow_dolphin_zoo_Montreal_Canada
pokemon_home_d_K2-155
pokemon_home_zoo_d_K2-155
I actually want to use the anchor strings such as 'zoo','home' or 'home_zoo' to separate the characters before and after, together with matching(and discarding) the last part of the country(or whatever specified place ID), which makes this question a bit less general(I like the idea of using _,but let me make it more tricky to learn better).
two questions here
what is the function of (?=) and .* here in
(?=(?:_US|_Canada|_K2-155|$)).*$? It seems if I use
(?:_US|_Canada|_K2-155|$), it is still ok...
since I extended a little bit on the anchor string to let it support
_, I used:
(.*?)(?:_*)(?:home_zoo|zoo|home)(?:_*)(.*?)(?:_*)(?:US|Canada|K2-155|$)
It seems ok, but if I use:
(.*?)(?:_*)(?:home|zoo|home_zoo)(?:_*)(.*?)(?:_*)(?:US|Canada|K2-155|$)
It will firstly match home for the last sample string. Is there a
greedy algorithm to catch this without specify the order of the pattern
string?
Well again, I don't like to make a long list of anchor strings, but I don't have other ideas make it more general without doing so.
Thanks again!

You could try it like this:
^([^_]+)_[^_]+_(.*?)(?=(?:_US|_Canada|_K2-155|$)).*$
This will capture 2 groups. You could for example use this in a replacement with group1, group2.
First capture the first part ending on an underscore in group 1 like cat_. Then match the second part ending with an underscore like zoo_ or home_.
From that point capture in a group until you encounter one of your values using a lookahead (?= or the end of the string.
That would match:
^ Begin of the string
([^_]+) Match in a capturing group not an _ one or more times (group 1)
_[^_]+_ match _ then not an _ one or more times followed by _
(.*?) Capture in a group any character zero or more times greedy (group 2)
(?= Positive lookahead that asserts what is on the right side is
(?: Non capturing group
_US|_Canada|_K2-155|$ your values or end of the string
) Close group
) Close group
.*$ Match any character zero or more times till the end of the string
Edit: After the updated question, perhaps this will suit your requirements:
^(.*?)_(?:home_zoo|zoo|home)(.*?)(?=(?:_US|_Canada|_K2-155|$))
This will match any charcter zero or more times non greedy (.*?), then an underscore and a non capturing group (?:home|zoo|home_zoo) to separate the characters before and after.

Well, I tried a more straightforward approach. If your data is more complex than the sample that you gave above, this may fail. Otherwise, for the above text, it works fine.
Here is the expression that I used:
^([^_]*)_[^_]*_(.*)_.*$
1 23 45 67
Basically what I did was:
Group the first char stream, which does not contain _, starting at the beginning of the line.
Then there is an _ following the above group
Follows an arbitrary length string, which does not have _'s in it
Then comes an _
Group the next arbitrary length string
Comes and _ afterwards
Rest of the string
replace it with \1, \2 (first group, second group).
You can find a fiddle here
If you are using vim, you can also achieve the same thing in vim with the following command:
:%s/^[^_]*_\([^_]*\)_\(.*\)_.*$/\1, \2/g
UPDATE
^([^_]*)_[^_]*_(((?:South_)|(?:New_))*[^_]*)((?:_US)|(?:_Canada)|(?:_K2-155))*$
You can find the new fiddle (here)[https://regex101.com/r/qQ2dE4/273]
What is the difference between this one and the previous one?
Now, I cheat a little, as such that I look for adjectives, which modify the state name, like South_ or New_. You can add more here, like East_, West_, Old_ or whatever if there is a case in your date.
There are cases where country is skipped in data. Plus looks like that last token on the very last line does not follow up a pattern. So, I explicitly listed those options in the expression, like US, Canada etc. You may need to add more exceptional cases in here as well.

Is there a way to match Regex based on previous capture group, not captured previously?

Okay, so the task is that there is a string that can either look like post, or post put or even get put post. All of these must be matched. Preferably deviances like [space]post, or get[space] should not be matched.
Currently I came up with this
^(post|put|delete|get)(( )(post|put|delete|get))*$
However I'm not satisfied with it, because I had to specify (post|put|delete|get) twice. It also matches duplications like post post.
I'd like to somehow use a backreference(?) to the first group so that I don't have to specify the same condition twice.
However, backreference \1 would help me only match post post, for example, and that's the opposite of what I want. I'd like to match a word in the first capture group that was NOT previously found in the string.
Is this even possible? I've been looking through SO questions, but my Google-fu is eluding me.

If you are using a PCRE-based regex engine, you may use subroutine calls like (?n) to recurse the subpatterns.
^(post|put|delete|get)( (?!\1)(?1))*$
^^^^
See the regex demo
Expression details:
^ - start of string
(post|put|delete|get) - Group 1 matching one of the alternatives as literal substrings
( (?!\1)(?1))* - zero or more sequences of:
- a space
(?!\1) - a negative lookahead that fails the match if the text after the current location is identical to the one captured into Group 1 due to backreference \1
(?1) - a subroutine call to the first capture group (i.e. it uses the same pattern used in Group 1)
$ - end of string
UPDATE
In order to avoid matching strings like get post post, you need to also add a negative lookahead into Group 1 so that the subroutine call was aware that we do not want to match the same value that was captured into Group 1.
^((post|put|delete|get)(?!.*\2))( (?1))*$
See the regex demo
The difference is that we capture the alternations into Group 2 and add the negative lookahead (?!.*\2) to disallow any occurrences of the word we captured further in the string. The ( (?1))* remains intact: now, the subroutine recurses the whole Capture Group 1 subpattern with the lookahead.

Regex matching problem

I don't know how to write such a regex. I will start with example.
My bad regex:
(\d*),?(\d*\.?\d*)-?(\d*\.?\d*),?([0-1]?),?([0-1]?),?([^\/]*)
Matches that are OK:
1,2-3,1,1,asdf
1,2-3,1,1
1,2-3,1
1,2-3
1,2
1
But unfortunately this will also be matched and I don't want it to be:
asdf
1,asdf
Ideally, I would like something like - match, if previous groups was matched.
I know that probably positive look behind should be used, but if I'm not wrong, it should be used right in front each group, except 1st and regex would be large and smelly after that. Um, and it would probably be variable length.
Is there any elegant way to do that?
EDIT
I want to match all lines given below Matches that are OK.
I would like to match \d* to first group. Then, if there was a match to \d* followed by ,, I would like to match (\d*\.?\d*) to second group. After that, if there was a match in first group followed by , and match in second group followed by - I would like to match another (\d*\.?\d*)... etc. to the end of Regex.

You're not very clear in your question, but from the examples I think this is what you need:
^\d(,\d-\d(,\d(,\d(,[a-z]+)?)?)?)?$
It matches:
1,2-3,1,1,asdf
1,2-3,1,1
1,2-3,1
1,2-3
1,2
1
Test link.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Clarification regarding nested groups in regex - regex

Related

Extra groups in regex

Regex with optional word inside string

regex match combined with something before a string if exists

Is there a way to match Regex based on previous capture group, not captured previously?

Regex matching problem

Categories

Resources