Capture filename parts: Why doesn't this regexp work? - regex

I'm faily new to regexp and I miss something from capturing groups.
Let's suppose I have a filepath like that
test.orange.john.edn
I want to capture two groups:
test.orange.john (which is the body)
edn (which is the extension)
I used this (and variants of it, taking the $ outside, etc.)
^([a-z]*.)*.([a-z]*$)
But it captures xm only
What did I miss? I do not understand why l is not captured and the body too...
I found answers on the web to capture the extension but I do not understand the problem there.
Thanks

The ^([a-z]*.)*.([a-z]*$) regex is very inefficient as there are lots of unnecessary backtracking steps here.
The start of string is matched, and then [a-z]*. is matched 0+ times. That means, the engine matches as many [a-z] as possible (i.e. it matches test up to the first dot), and then . matches the dot (but only because . matches any character!). So, this ([a-z]*.)* matches test.orange.john.edn only capturing edn since repeating capturing groups only keep the last captured value.
You already have edn in Group 1 at this step. Now, .([a-z]*$) should allocate a substring for the . (any character) pattern. Backtracking goes back and finds n - now, Group 1 only contains ed.
For your task, you should escape the last . to match a literal dot and perhaps, the best expression is
^(.*)\.(.*)$
See demo
It will match all the string up to the end with the first (.*), and then will backtrack to find the last . symbol (so, Group 1 will have all text from the beginning till the last .), and then capturing the rest of the string into Group 2.
If a dot does not have to be present (i.e. if a file name has no extension), add an optional group:
^(.*)(?:\.(.*))?$
See another demo

You can try with:
^([a-z.]+)\.([a-z]+)$
online example

Related

Extra groups in regex

I'm building a regex to be able to parse addresses and am running into some blocks. An example address I'm testing against is:
5173B 63rd Ave NE, Lake Forest Park WA 98155
I am looking to capture the house number, street name(s), city, state, and zip code as individual groups. I am new to regex and am using regex101.com to build and test against, and ended up with:
(^\d+\w?)\s((\w*\s?)+).\s(\w*\s?)+([A-Z]{2})\s(\d{5})
It matches all the groups I need and matches the whole string, but there are extra groups that are null value according to the match information (3 and 4). I've looked but can't find what is causing this issue. Can anyone help me understand?
Your regex expression was almost good:
(^\d+\w?)\s([\w*\s?]+).\s([\w*\s?]+)\s([A-Z]{2})\s(\d{5})
What I changed are the second and third groups: in both you used a group inside a group ((\w*\s?)+), where a class inside a group (([\w*\s?]+)) made sure you match the same things and you get the proper group content.
With your previous syntax, the inner group would be able to match an empty substring, since both quantifiers allow for a zero-length match (* is 0 to unlimited matches and ? is zero or one match). Since this group was repeated one or more times with the +, the last occurrence would match an empty string and only keep that.
For this you'll need to use a non-capturing group, which is of the form (?:regex), where you currently see your "null results". This gives you the regex:
(^\d+\w?)\s((?:\w*\s?)+).\s(?:\w*\s?)+([A-Z]{2})\s(\d{5})
Here is a basic example of the difference between a capturing group and a non-capturing group: ([^s]+) (?:[^s]+):
See how the first group is captured into "Group 1" and the second one is not captured at all?
Matching an address can be difficult due to the different formats.
If you can rely on the comma to be there, you can capture the part before it using a negated character class:
^(\d+[A-Z]?)\s+([^,]+?)\s*,\s*(.+?)\s+([A-Z]{2})\s(\d{5})$
Regex demo
Or take the part before the comma that ends on 2 or more uppercase characters, and then match optional non word characters using \W* to get to the first word character after the comma:
^(\d+[A-Z]?)\s+(.*?\b[A-Z]{2,}\b)\W*(.+?)\s+([A-Z]{2})\s(\d{5})$
Regex demo

RegEx - double condition to find some string

I'd like to find word RADU3_ or RADU3- in a sentence that begins with xlink:href= and ends with .svg
How to do this?
I've tried following, but does not give the result I'm expecting.
(?=\wxlink:href=|\wsvg\b)|\bRADU3_|\bRADU3-
Just last line in example is good result (RADU3_)
ProductionGraphics\GP1**RADU3-**11_HeatingFurnaceF1.svg
PB:ExpressionText id="RADU3_FUEL GAS _SUM_EX" PBD:LinkUses
xlink:href="C:\ProcBookImport\MaintenanceGraphics\RADU3_AI.svg"
Example...
Not sure exactly how you want to use it but the below pattern finds the string. I put the RADU3 part in a group where I matches RADU3 followed by - or _ ([_-])
(xlink:href=.*)(RADU3[_-]*)(.*\.svg)
Edit, handle multiple occurences
If a string might contain the pattern several times then use ? to allow a group to repeat itself
(RADU3[_-]*?)(.*?\.svg?)
The above could be used in a replace expression like
\1someotherword\3
Where \2 is the second group that is replaced
If you want to make sure that the string starts with xlink:href= and ends with \.svg you could use anchors to assert the start ^ and the end $ of the string.
Use 1 capturing group to make sure xlink:href= comes before RADU3 followed by an underscore or a hyphen. Then you could match it and in the replacement use that capturing group follwed by your replacement.
You could use a positive lookahead to assert that the string ends with \.svg
That will match:
^(xlink:href=.*)\bRADU3[_-](?=.*\.svg$)
^ Assert the start of the string
(xlink:href=.*) Capturing group, match up until the last occurence of ..
\bRADU3[_-] Word boundary to prevent matching part of a larger word. Match RADU3 followed by an underscore or hyphen
(?=.*\.svg$) Positive lookahead to assert the string ends with .svg
See the regex demo
It sounds like you only want the word (substring) if it is in a specific context?
In your case, you can restart the regex midways if you want to have starting and ending conditions (multiple conditions) for a string, but at the same time only want to use these conditions as "if-statements" and not as part of the result.
The following uses this method, and utilizes restarts (\K) in order to only extract the substring you are looking for.
# The string has to start with "xlink:href="
xlink:href=
# Fetch everything up to our match, and the restart the regex
.*\K
# The strings we are looking for
(RADU3[-_])
# String has to end with ".svg"
(?=(.*\.svg))
If you want the entire string matching our rules you are looking for something like this:
#The string has to start with "xlink:href"
^(xlink:href=).*
# The strings we are looking for
(RADU3[-_])
# String has to end with ".svg"
(\w+\.svg)
#Get everything after .svg too
.*
If you only want the ending " after the .svg, you'd want to modify the last part where I just take everything after .svg
You can play around with what I have come up with at regex101 (no affiliation, just love their site): https://regex101.com/r/g0v07V/3/

regex match combined with something before a string if exists

I tried to get the sub-strings from a string
such like:
test strings:
cat_zoo_New_York_US
dog_zoo_South_Carolina
dolphin_zoo_Montreal_Canada
pokemon_home_d_K2-155
returned sub strings:
cat, New_York
dog, South_Carolina
dolphin, Montreal
pokemon, d
the Regex pattern I have tried is
([\w]+)(?:(_zoo_|_home_))(((?!(_US|_Canada|_K2-155))\w)+)
which I don't think is very concise and it returns other sub-strings besides what I need. Do you have any other suggestions?
Thanks!
Some updates
after #The fourth bird's answer #03/15/2018.
First of all, I like the idea of utilizing both ([^_]+) and the (?:) for different part of the sample strings.But let me extend a little more of the sample strings.
cat_zoo_New_York_US
dog_zoo_South_Carolina
yellow_dolphin_zoo_Montreal_Canada
pokemon_home_d_K2-155
pokemon_home_zoo_d_K2-155
I actually want to use the anchor strings such as 'zoo','home' or 'home_zoo' to separate the characters before and after, together with matching(and discarding) the last part of the country(or whatever specified place ID), which makes this question a bit less general(I like the idea of using _,but let me make it more tricky to learn better).
two questions here
what is the function of (?=) and .* here in
(?=(?:_US|_Canada|_K2-155|$)).*$? It seems if I use
(?:_US|_Canada|_K2-155|$), it is still ok...
since I extended a little bit on the anchor string to let it support
_, I used:
(.*?)(?:_*)(?:home_zoo|zoo|home)(?:_*)(.*?)(?:_*)(?:US|Canada|K2-155|$)
It seems ok, but if I use:
(.*?)(?:_*)(?:home|zoo|home_zoo)(?:_*)(.*?)(?:_*)(?:US|Canada|K2-155|$)
It will firstly match home for the last sample string. Is there a
greedy algorithm to catch this without specify the order of the pattern
string?
Well again, I don't like to make a long list of anchor strings, but I don't have other ideas make it more general without doing so.
Thanks again!
You could try it like this:
^([^_]+)_[^_]+_(.*?)(?=(?:_US|_Canada|_K2-155|$)).*$
This will capture 2 groups. You could for example use this in a replacement with group1, group2.
First capture the first part ending on an underscore in group 1 like cat_. Then match the second part ending with an underscore like zoo_ or home_.
From that point capture in a group until you encounter one of your values using a lookahead (?= or the end of the string.
That would match:
^ Begin of the string
([^_]+) Match in a capturing group not an _ one or more times (group 1)
_[^_]+_ match _ then not an _ one or more times followed by _
(.*?) Capture in a group any character zero or more times greedy (group 2)
(?= Positive lookahead that asserts what is on the right side is
(?: Non capturing group
_US|_Canada|_K2-155|$ your values or end of the string
) Close group
) Close group
.*$ Match any character zero or more times till the end of the string
Edit: After the updated question, perhaps this will suit your requirements:
^(.*?)_(?:home_zoo|zoo|home)(.*?)(?=(?:_US|_Canada|_K2-155|$))
This will match any charcter zero or more times non greedy (.*?), then an underscore and a non capturing group (?:home|zoo|home_zoo) to separate the characters before and after.
Well, I tried a more straightforward approach. If your data is more complex than the sample that you gave above, this may fail. Otherwise, for the above text, it works fine.
Here is the expression that I used:
^([^_]*)_[^_]*_(.*)_.*$
1 23 45 67
Basically what I did was:
Group the first char stream, which does not contain _, starting at the beginning of the line.
Then there is an _ following the above group
Follows an arbitrary length string, which does not have _'s in it
Then comes an _
Group the next arbitrary length string
Comes and _ afterwards
Rest of the string
replace it with \1, \2 (first group, second group).
You can find a fiddle here
If you are using vim, you can also achieve the same thing in vim with the following command:
:%s/^[^_]*_\([^_]*\)_\(.*\)_.*$/\1, \2/g
UPDATE
^([^_]*)_[^_]*_(((?:South_)|(?:New_))*[^_]*)((?:_US)|(?:_Canada)|(?:_K2-155))*$
You can find the new fiddle (here)[https://regex101.com/r/qQ2dE4/273]
What is the difference between this one and the previous one?
Now, I cheat a little, as such that I look for adjectives, which modify the state name, like South_ or New_. You can add more here, like East_, West_, Old_ or whatever if there is a case in your date.
There are cases where country is skipped in data. Plus looks like that last token on the very last line does not follow up a pattern. So, I explicitly listed those options in the expression, like US, Canada etc. You may need to add more exceptional cases in here as well.

Is there a way to match Regex based on previous capture group, not captured previously?

Okay, so the task is that there is a string that can either look like post, or post put or even get put post. All of these must be matched. Preferably deviances like [space]post, or get[space] should not be matched.
Currently I came up with this
^(post|put|delete|get)(( )(post|put|delete|get))*$
However I'm not satisfied with it, because I had to specify (post|put|delete|get) twice. It also matches duplications like post post.
I'd like to somehow use a backreference(?) to the first group so that I don't have to specify the same condition twice.
However, backreference \1 would help me only match post post, for example, and that's the opposite of what I want. I'd like to match a word in the first capture group that was NOT previously found in the string.
Is this even possible? I've been looking through SO questions, but my Google-fu is eluding me.
If you are using a PCRE-based regex engine, you may use subroutine calls like (?n) to recurse the subpatterns.
^(post|put|delete|get)( (?!\1)(?1))*$
^^^^
See the regex demo
Expression details:
^ - start of string
(post|put|delete|get) - Group 1 matching one of the alternatives as literal substrings
( (?!\1)(?1))* - zero or more sequences of:
- a space
(?!\1) - a negative lookahead that fails the match if the text after the current location is identical to the one captured into Group 1 due to backreference \1
(?1) - a subroutine call to the first capture group (i.e. it uses the same pattern used in Group 1)
$ - end of string
UPDATE
In order to avoid matching strings like get post post, you need to also add a negative lookahead into Group 1 so that the subroutine call was aware that we do not want to match the same value that was captured into Group 1.
^((post|put|delete|get)(?!.*\2))( (?1))*$
See the regex demo
The difference is that we capture the alternations into Group 2 and add the negative lookahead (?!.*\2) to disallow any occurrences of the word we captured further in the string. The ( (?1))* remains intact: now, the subroutine recurses the whole Capture Group 1 subpattern with the lookahead.

Regex - Capturing group with Alteration vs. Character classes

I am working on the problem 6 on regexone.com and I am not able to understand how grouping works with 'alteration'.
This is the test string:
The quick brown fox...
The task is to capture The quick brown fox... without the extra whitespace which can be done with ^\s*([\w\s.]*)\s*$.
But ^\s*(\w|\s|\.)*\s*$ captures a group '.' ([27-28]) Why? As a result, ^\s*((\w|\s|\.)*)\s*$ captures two groups - The quick brown fox... [6-28] and '.' ([27-28]).
How does grouping work? What are the differences in working with alteration and character classes besides that character classes match by characters whereas alteration matches by words (my basic understanding)?
P.S.: How should I search for documentation like info on such problems when I don't even know what are they called?
^\s*(\w|\s|\.)*\s*$ captures a group '.' ([27-28]) Why?
The reason is that capturing groups store the text they match in a kind of a buffer or stack. The * quantifier makes the regex engine repeat capturing unlimited times and writes to that buffer each alphanumeric, or whitespace, or dot, each time rewriting the value in the buffer.
The ^\s*((?:\w|\s|\.)*)\s*$ has 2 capturing groups, thus it captures your whole text into Group 1 (wih the outer (...)), and the second capturing group is the one that stores the characters from the alternation matched one by one with only the last symbol remaining in the 2nd buffer.
The solution would be using a non-capturing group for alternations and a capturuing group for all the found submatches: ^\s*((?:\w|\s|\.)*)\s*$.
Mind it is very inefficient! Use character classes wherever possible (i.e. ([\w\s.]*)).
Each capture group captures the string that matched that group. ((\w|\s|\.)*) matches The quick..., so it sets the captured string correctly. But (\w|\s|\.) matches many times, once for each character; the captured string is then the last match, which is the . at the end of the text.