Regex - Capturing group with Alteration vs. Character classes - regex

I am working on the problem 6 on regexone.com and I am not able to understand how grouping works with 'alteration'.
This is the test string:
The quick brown fox...
The task is to capture The quick brown fox... without the extra whitespace which can be done with ^\s*([\w\s.]*)\s*$.
But ^\s*(\w|\s|\.)*\s*$ captures a group '.' ([27-28]) Why? As a result, ^\s*((\w|\s|\.)*)\s*$ captures two groups - The quick brown fox... [6-28] and '.' ([27-28]).
How does grouping work? What are the differences in working with alteration and character classes besides that character classes match by characters whereas alteration matches by words (my basic understanding)?
P.S.: How should I search for documentation like info on such problems when I don't even know what are they called?

^\s*(\w|\s|\.)*\s*$ captures a group '.' ([27-28]) Why?
The reason is that capturing groups store the text they match in a kind of a buffer or stack. The * quantifier makes the regex engine repeat capturing unlimited times and writes to that buffer each alphanumeric, or whitespace, or dot, each time rewriting the value in the buffer.
The ^\s*((?:\w|\s|\.)*)\s*$ has 2 capturing groups, thus it captures your whole text into Group 1 (wih the outer (...)), and the second capturing group is the one that stores the characters from the alternation matched one by one with only the last symbol remaining in the 2nd buffer.
The solution would be using a non-capturing group for alternations and a capturuing group for all the found submatches: ^\s*((?:\w|\s|\.)*)\s*$.
Mind it is very inefficient! Use character classes wherever possible (i.e. ([\w\s.]*)).

Each capture group captures the string that matched that group. ((\w|\s|\.)*) matches The quick..., so it sets the captured string correctly. But (\w|\s|\.) matches many times, once for each character; the captured string is then the last match, which is the . at the end of the text.

Related

Extra groups in regex

I'm building a regex to be able to parse addresses and am running into some blocks. An example address I'm testing against is:
5173B 63rd Ave NE, Lake Forest Park WA 98155
I am looking to capture the house number, street name(s), city, state, and zip code as individual groups. I am new to regex and am using regex101.com to build and test against, and ended up with:
(^\d+\w?)\s((\w*\s?)+).\s(\w*\s?)+([A-Z]{2})\s(\d{5})
It matches all the groups I need and matches the whole string, but there are extra groups that are null value according to the match information (3 and 4). I've looked but can't find what is causing this issue. Can anyone help me understand?
Your regex expression was almost good:
(^\d+\w?)\s([\w*\s?]+).\s([\w*\s?]+)\s([A-Z]{2})\s(\d{5})
What I changed are the second and third groups: in both you used a group inside a group ((\w*\s?)+), where a class inside a group (([\w*\s?]+)) made sure you match the same things and you get the proper group content.
With your previous syntax, the inner group would be able to match an empty substring, since both quantifiers allow for a zero-length match (* is 0 to unlimited matches and ? is zero or one match). Since this group was repeated one or more times with the +, the last occurrence would match an empty string and only keep that.
For this you'll need to use a non-capturing group, which is of the form (?:regex), where you currently see your "null results". This gives you the regex:
(^\d+\w?)\s((?:\w*\s?)+).\s(?:\w*\s?)+([A-Z]{2})\s(\d{5})
Here is a basic example of the difference between a capturing group and a non-capturing group: ([^s]+) (?:[^s]+):
See how the first group is captured into "Group 1" and the second one is not captured at all?
Matching an address can be difficult due to the different formats.
If you can rely on the comma to be there, you can capture the part before it using a negated character class:
^(\d+[A-Z]?)\s+([^,]+?)\s*,\s*(.+?)\s+([A-Z]{2})\s(\d{5})$
Regex demo
Or take the part before the comma that ends on 2 or more uppercase characters, and then match optional non word characters using \W* to get to the first word character after the comma:
^(\d+[A-Z]?)\s+(.*?\b[A-Z]{2,}\b)\W*(.+?)\s+([A-Z]{2})\s(\d{5})$
Regex demo

Is there a way to capture a pattern, skip a set of random subsequent characters, then continue to capture another pattern using regex?

I have an example of text like
PART I BLABLABLABLALBLA GROUP 2
I need to capture PART I and GROUP 2 using the same regex, skipping the set of characters in the middle.
You can do a capture group for each part you want, and use .* for the middle.
/(PART I).*(GROUP 2)/
and then use $1 and $2 (or the equivalent in your programming language) to get the matched parts.
(Upgraded my comment to an answer, since OP said it worked.)

Adjustment to this code to stop after finding two words

In my haste to get this working I failed to ask how to stop after the second word in my original post. Grab first 4 characters of two words RegEx
If I have Awesome Sauce Today I would like to have AwesSauc
The code in my first post will capture the first 4 characters of any word and combine them. so Awesome Sauce Today will become AwesSaucToda. I want it to stop capturing after the second word. So in my example Today would be ignored but it will still capture 4 characters of the first two words it encounters to create the new wor AwesSauc
You may still use the Replace Text action and use
Pattern: (?s)^\P{L}*(\p{L}{1,4})\p{L}*\P{L}+(\p{L}{1,4}).*
Replacement text: $1$2
See the regex demo.
The difference between this solution and the previous one is that the pattern is anchored at the start with ^, instead of a \W (that matches any non-word char) I am using a \P{L} that matches any non-letter char (adjust as you see fit), and to match the first and second word beginning, I am using 2 capturing groups now ((\p{L}{1,4})...(\p{L}{1,4})), hence two backreferences in the replacement pattern. The (?s) modifier makes the . pattern to match any char, including a newline. The .* at the end is necessary to remove the rest of the string after the necessary text is captured into the 2 capturing groups.

RegEx for capturing everything except numbers and one word

I am quite stuck with a regex I can't get to work. It should capture everything except digits and the word fiktiv (not single characters of it!). Objective is to get rid of this content.
I have tried something like (?!\d|fiktiv).* on my sample string 123456788daswqrt fiktiv
https://regex101.com/r/kU8mF3/1
However this does match the fiktiv at the end as well.
One possibility would be to use a neglected character class, which can be used by putting a ^ in [] braces. So you basically say don't match digits, and as many non digits as you can get until a space occurs and the word fiktiv appears.
This capturing will be "saved" in the capturing group 1 for later use.
([^\d]+)\s+fiktiv
Testing could be done here:
https://regex101.com/
It should capture everything except digits and the word fiktiv (not single characters of it!). Objective is to get rid of this content.
So, you want to remove any character that is not a digit (that is, \D or [^0-9] pattern) and not a fiktiv char sequence.
You may use a regex with a capturing group and alternation:
(fiktiv)|[^0-9]
and replace with the contents of Group 1 using a $1 backreference, fiktiv, to restore it in the replaced string.
See the regex demo
C# implementation:
Regex.Replace(input‌​, "(fiktiv)|[^0-9]", "$1")
Also, see Use RegEx in SQL with CLR Procs.

Capture filename parts: Why doesn't this regexp work?

I'm faily new to regexp and I miss something from capturing groups.
Let's suppose I have a filepath like that
test.orange.john.edn
I want to capture two groups:
test.orange.john (which is the body)
edn (which is the extension)
I used this (and variants of it, taking the $ outside, etc.)
^([a-z]*.)*.([a-z]*$)
But it captures xm only
What did I miss? I do not understand why l is not captured and the body too...
I found answers on the web to capture the extension but I do not understand the problem there.
Thanks
The ^([a-z]*.)*.([a-z]*$) regex is very inefficient as there are lots of unnecessary backtracking steps here.
The start of string is matched, and then [a-z]*. is matched 0+ times. That means, the engine matches as many [a-z] as possible (i.e. it matches test up to the first dot), and then . matches the dot (but only because . matches any character!). So, this ([a-z]*.)* matches test.orange.john.edn only capturing edn since repeating capturing groups only keep the last captured value.
You already have edn in Group 1 at this step. Now, .([a-z]*$) should allocate a substring for the . (any character) pattern. Backtracking goes back and finds n - now, Group 1 only contains ed.
For your task, you should escape the last . to match a literal dot and perhaps, the best expression is
^(.*)\.(.*)$
See demo
It will match all the string up to the end with the first (.*), and then will backtrack to find the last . symbol (so, Group 1 will have all text from the beginning till the last .), and then capturing the rest of the string into Group 2.
If a dot does not have to be present (i.e. if a file name has no extension), add an optional group:
^(.*)(?:\.(.*))?$
See another demo
You can try with:
^([a-z.]+)\.([a-z]+)$
online example