I am trying to match the word group or match the absence of the word group
http://rubular.com/r/TKJPFvnzZ0
I can match a space but I would like it to actually match nothing. I am struggling with finding the correct syntax.
Match group 3 should contain either group or empty string.
Thanks!
Not sure if I understood you correctly, but would this solve your problem:
^I post a "(.*?)" to the "(.*?)"(?: (group))? which the entire world can see$
?
Basically says that group is optional.
The ?: inside the parenthesis marks that group as a "non-capturing group", which means that we're only enclosing that part of the expression in parenthesis to group it, but we don't want to capture the content to use after. group is simply enclosed in parenthesis because we want to capture that match as a group.
Related
I'm building a regex to be able to parse addresses and am running into some blocks. An example address I'm testing against is:
5173B 63rd Ave NE, Lake Forest Park WA 98155
I am looking to capture the house number, street name(s), city, state, and zip code as individual groups. I am new to regex and am using regex101.com to build and test against, and ended up with:
(^\d+\w?)\s((\w*\s?)+).\s(\w*\s?)+([A-Z]{2})\s(\d{5})
It matches all the groups I need and matches the whole string, but there are extra groups that are null value according to the match information (3 and 4). I've looked but can't find what is causing this issue. Can anyone help me understand?
Your regex expression was almost good:
(^\d+\w?)\s([\w*\s?]+).\s([\w*\s?]+)\s([A-Z]{2})\s(\d{5})
What I changed are the second and third groups: in both you used a group inside a group ((\w*\s?)+), where a class inside a group (([\w*\s?]+)) made sure you match the same things and you get the proper group content.
With your previous syntax, the inner group would be able to match an empty substring, since both quantifiers allow for a zero-length match (* is 0 to unlimited matches and ? is zero or one match). Since this group was repeated one or more times with the +, the last occurrence would match an empty string and only keep that.
For this you'll need to use a non-capturing group, which is of the form (?:regex), where you currently see your "null results". This gives you the regex:
(^\d+\w?)\s((?:\w*\s?)+).\s(?:\w*\s?)+([A-Z]{2})\s(\d{5})
Here is a basic example of the difference between a capturing group and a non-capturing group: ([^s]+) (?:[^s]+):
See how the first group is captured into "Group 1" and the second one is not captured at all?
Matching an address can be difficult due to the different formats.
If you can rely on the comma to be there, you can capture the part before it using a negated character class:
^(\d+[A-Z]?)\s+([^,]+?)\s*,\s*(.+?)\s+([A-Z]{2})\s(\d{5})$
Regex demo
Or take the part before the comma that ends on 2 or more uppercase characters, and then match optional non word characters using \W* to get to the first word character after the comma:
^(\d+[A-Z]?)\s+(.*?\b[A-Z]{2,}\b)\W*(.+?)\s+([A-Z]{2})\s(\d{5})$
Regex demo
I have the following regular expression on nginx:
^(?<subdomain>.+)\.test\.com$
If parenthesis are for group, then how does it matches 'something.test.com', or 'foobar.test.com' ?
I was expecting to match something that only the word 'subdomain'. I think I am not understanding the ?, and the <> symbols. Also I can't see the use for the .+ at the end.
(?<name>.+) is a named capture group. The only pattern part of this group is the .+
The benefit to using named capture groups is that you can reference them by name rather than number, so in this case "something" or "foobar" can be referenced using the subdomain capture group.
The .+ at the end just means to match one or more of any character except newline characters.
This should help you visualize it better
(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+)+)\s+(\w+)
used to match string
123 FEX-1-80 Online N2K-C2248TP-1GE SSDFDFWFw23r23
How come this works in regexr.com but Python 3.5.1 can't find a match
r'(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+))'
can match up to
123 FEX-1-80 Online N2K-C2248TP
but the second hyphen - in group(4) is not matched
From what I understand, non-capture group character can appear more than once in the group, what went wrong here?
Just a comment, not really an answer but for the sake of clarity I have put it as an answer.
Being relatively new to regular expressions, one should use the verbose mode. With this, your expression becomes much much more readable:
(1[0-9]{2})\s+ # three digits, the first one needs to be 1
(\w+(?:-\w+)+)\s+ # a word character (wc), followed by - and wcs
(\w+)\s+ # another word
(\w+(?:-\w+)+)\s+ # same expression as above
(\w+) # another word
Also, check if your (second and fourth) expression could be rewritten as [\w-]+ - it is not the same as yours and will match other substrings but try to avoid nested parenthesis in general.
Concerning your question, the second string cannot be matched as you made all of your expressions mandatory (and group 5 is missing in the second example, so it will fail).
See a demo on regex101.com.
This regular expression matches the full input string:
(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+)+)\s+(\w+)
This one doesn't:
(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+))
The latter is missing a + after the last non-capturing group, and it's missing the \s+(\w+) at the end that matches the SSDFDFWFw23r23 at the end of the input string.
From what I understand, non-capture group character can appear more than once in the group, what went wrong here?
I'm not sure I follow. A non-capturing group is really just there to group a part of a regular expression.
(?:-\w+) or just -\w+ will both match a hyphen (-) followed by one or more "word" characters (\w+). It doesn't matter whether that regular expression is in a non-capturing group or not. If you want to match repetitions of that pattern, you can use the + modifier after the non-capturing group, e.g. (?:-\w+)+. That pattern will match a string like -foo-bar-baz.
So the reason your second regular expression doesn't match the repeated pattern is because it's lacking the + modifier.
This text
"dhdhd89(dd)"
Matched against this regex
.+?(?:\()
..returns "dhdhd89(".
Why is the start parenthesis included in the capture?
Two different tools, as well as the .NET Regex class, returns the same result. So I gather there is something I don't understand about this.
The way I read my regex is.
Match any character, at least one occurrence. As few as possible.
The matched string should be followed by a start parenthesis, but not to be included in the capture.
I can find workaround, but I still want to know what is going on.
Just turn the non-capturing group to positive lookahead assertion.
.+?(?=\()
.+? non-greedy match of one or more characters followed by an opening parenthesis. Assertions won't match any characters but asserts whether a match is possible or not. But the non-capturing group will do the matching operation.
DEMO
You can just use this negation based regex to capture only text before a literal (:
^([^(]+)
When you use:
.+?(?:\()
Regex engine does match ( after initial text but it just doesn't return that in a captured group to you.
You havn't defined capture groups then I guess you display the whole match (group 0), you can do:
(.+?)(?:\()
and the string you want is in group 1
or use lookahead as #AvinashRaj said.
I have a response:
MS1:111980613994
124 MS2:222980613994124
I have the following regex:
MS\d:(\d(?:\r?\n?)){15}
According to Regex, the "(?:\r?\n?)" part should let it match for the group but exclude it from the capture (so I get a contiguous value from the group).
Problem is that for "MS1:xxx" it matches the [CR][LF] and includes it in the group. It should be excluded from the capture ...
Help please.
The (?:...) syntax does not mean that the enclosed pattern will be excluded from any capture groups that enclose the (?:...).
It only means that that the group formed by (?:...) will be a non-capturing group, as opposed to a new capture group.
Put another way:
(?:...) only groups
(...) has two functions: it both groups and captures.
Capture groups capture all of the text matched by the pattern they enclose, even the parts that are matched by nested groups (whether they are capturing or not).
An example
With the regex...
.*(l.*(o.*o).*l).*
...there are two capture groups. If we match this against hello world we get the following captures:
1: lo worl
2: o wo
Note that the text captured by group 2 is also captured by group 1.
If we change the inner group to be non-capturing...
.*(l.*(?:o.*o).*l).*
...group 1's capture will not be changed (when matched against the same string), but there is no longer a group 2:
1: lo worl
As you can see, if a non-capturing group is enclosed by a capture group, that enclosing capture group will capture the characters matched by the non-capturing group.
What are they For?
The purpose of non-capturing groups is not to exclude content from other capturing groups, but rather to act as a way to group operations without also capturing.
For example, if you want to repeat a substring, you might write (?:substring)*.
How do I solve my real problem?
If you really want to ignore embedded \rs and \ns your best bet is to strip them out in a second step. You don't say what language you're using, but something equivalent to this (Python) should work:
s = re.sub(r'[\r\n]', '', s)
Perhaps what you mean to do here is place the [CR][LF] matching part outside of the captured group, something like: MS\d:(\d){15}(?:\r?\n?)
So far as I know, you'll have to use 2 regexes. One is "MS\d:(\d(?:\r?\n?)){15}", the other is used to remove the line breaks from the matches.
Please refer to "Regular expression to skip character in capture group".
How about MS\d:(?:(\d)\r?\n?){15}