Regex with non-capturing group using stringr in R - regex

I am trying to use non-capturing groups with the str_extract function from the stringr package. Here is an example:
library(stringr)
txt <- "foo"
str_extract(txt,"(?:f)(o+)")
This returns
"foo"
while i expect it to return only
"oo"
like in this post: https://stackoverflow.com/a/14244553/3750030
How do i use non-capturing groups in R to remove the content of the groups from the returned value while using it for matching?

When you are using regex (?:f)(o+) this won't Capture but it will match it for sure.
What capturing means is storing in memory for back-referencing, so that it can be used for repeated match in same string or replacing captured string.
like in this post: https://stackoverflow.com/a/14244553/3750030
You misunderstood that answer. Non-Capturing groups doesn't means Non-Matching. It's captured in $1 ( group 1 ) because there is no group prior to it.
If you wish to Only match suppose B followed by A then you should use positive lookbehind like this.
Regex: (?<=f)(o+)
Explanation:
(?<=f) This will look for f to be present behind the following token but won't match.
(o+) This will match and capture as group (here in $1)if previous condition is true.
Regex101 Demo

Related

Is there a way to match Regex based on previous capture group, not captured previously?

Okay, so the task is that there is a string that can either look like post, or post put or even get put post. All of these must be matched. Preferably deviances like [space]post, or get[space] should not be matched.
Currently I came up with this
^(post|put|delete|get)(( )(post|put|delete|get))*$
However I'm not satisfied with it, because I had to specify (post|put|delete|get) twice. It also matches duplications like post post.
I'd like to somehow use a backreference(?) to the first group so that I don't have to specify the same condition twice.
However, backreference \1 would help me only match post post, for example, and that's the opposite of what I want. I'd like to match a word in the first capture group that was NOT previously found in the string.
Is this even possible? I've been looking through SO questions, but my Google-fu is eluding me.
If you are using a PCRE-based regex engine, you may use subroutine calls like (?n) to recurse the subpatterns.
^(post|put|delete|get)( (?!\1)(?1))*$
^^^^
See the regex demo
Expression details:
^ - start of string
(post|put|delete|get) - Group 1 matching one of the alternatives as literal substrings
( (?!\1)(?1))* - zero or more sequences of:
- a space
(?!\1) - a negative lookahead that fails the match if the text after the current location is identical to the one captured into Group 1 due to backreference \1
(?1) - a subroutine call to the first capture group (i.e. it uses the same pattern used in Group 1)
$ - end of string
UPDATE
In order to avoid matching strings like get post post, you need to also add a negative lookahead into Group 1 so that the subroutine call was aware that we do not want to match the same value that was captured into Group 1.
^((post|put|delete|get)(?!.*\2))( (?1))*$
See the regex demo
The difference is that we capture the alternations into Group 2 and add the negative lookahead (?!.*\2) to disallow any occurrences of the word we captured further in the string. The ( (?1))* remains intact: now, the subroutine recurses the whole Capture Group 1 subpattern with the lookahead.

Regex optional with two matching groups

Make a regex that matches this:
aaGrest with matching groups [aa, G, rest]
bb with matching groups [bb]
I am trying to make Grest part optional this doesn't work:
^([a-z]{2}[a-z]?)[(P|G)(.*)]?
Ps: dont complicate stuff or downvote!
Try this:
^([a-z]{2}[a-z]?)(?:(P|G)(.*))?
See live demo.
This uses a non-capturing group (syntax (?:...)) to group the optional part without forming a capturing group, which keeps the capturing groups numbered as 1, 2 and 3.
Square brackets form a character class, which isn't what you intended.

Trouble with non-capturing groups in regular expression

I'm attempting to capture the 6 digit number in the following:
ObjectID: !nrtdms:0:!session:slonwswtest1:!database:TEST:!folder:ordinary,486150:
I tried the following regex:
\d+(?::$)
attempting to use a non-capturing group to strip the colon out of the returned match, but it returns the colon as in:
486150:
Any ideas what I'm doing wrong?
You want a positive lookahead:
\d+(?=:$)
A non-capturing group is simply a group that cannot be accessed via a backreference; they still are part of the match, nonetheless.
Alternatively, you can use
(\d+):$
and obtain the 1st match group.
You should use a positive lookahead rather than a non-capturing group
\d+(?=:$)
Non-capturing groups are groups that will not create a capture (to be used in backreferences or extracted from the match result). Nonetheless they will match the expression.
What you're looking for is lookahead - to test the expression but exclude it from the match:
\d+(?=:$)
Probably your regex tool is returning the complete match since you don't have any capture group there. Try to enclose the \d+ in a capture group, and find the way to get capture group 1 in your regex tool.
Alternatively, you can also use positive look-ahead:
\d+(?=:$)
And given that you want to capture 6 digits, you can make that explicit:
\d{6}

How does (?:\;jsessionid=[^\?#]*)? in a regular expression work?

Suppose I have this text to match:
http://localhost:8080/start.jsp;jsessionid=9E4CDB636248C9610F57704E5E07F782?whatever=true&somethingelse=true
Using this regular expression:
^(.*?start\.jsp)(?:\;jsessionid=[^\?#]*)?(\?[^#]*)?(#.*)?$
The resulting groups are:
http://localhost:8080/start.jsp
?whatever=true&somethingelse=true
A. Why isn't group number 2 this: ;jsessionid=9E4CDB636248C9610F57704E5E07F782?
What does this part ?:\ at the beginning of second group do?
B. And also, how can I create an expression to extract the same groups as for the example above, if my options are begin.jsp and start.jsp (not just start.jsp) before the jsessionid part?
(?: ) is syntax for a non-capturing group. As the name explains it doesn't capture its match.
put alternate matching non-capturing group: (.*?(?:start|begin)\.jsp)

Regex Exclude Character From Group

I have a response:
MS1:111980613994
124 MS2:222980613994124
I have the following regex:
MS\d:(\d(?:\r?\n?)){15}
According to Regex, the "(?:\r?\n?)" part should let it match for the group but exclude it from the capture (so I get a contiguous value from the group).
Problem is that for "MS1:xxx" it matches the [CR][LF] and includes it in the group. It should be excluded from the capture ...
Help please.
The (?:...) syntax does not mean that the enclosed pattern will be excluded from any capture groups that enclose the (?:...).
It only means that that the group formed by (?:...) will be a non-capturing group, as opposed to a new capture group.
Put another way:
(?:...) only groups
(...) has two functions: it both groups and captures.
Capture groups capture all of the text matched by the pattern they enclose, even the parts that are matched by nested groups (whether they are capturing or not).
An example
With the regex...
.*(l.*(o.*o).*l).*
...there are two capture groups. If we match this against hello world we get the following captures:
1: lo worl
2: o wo
Note that the text captured by group 2 is also captured by group 1.
If we change the inner group to be non-capturing...
.*(l.*(?:o.*o).*l).*
...group 1's capture will not be changed (when matched against the same string), but there is no longer a group 2:
1: lo worl
As you can see, if a non-capturing group is enclosed by a capture group, that enclosing capture group will capture the characters matched by the non-capturing group.
What are they For?
The purpose of non-capturing groups is not to exclude content from other capturing groups, but rather to act as a way to group operations without also capturing.
For example, if you want to repeat a substring, you might write (?:substring)*.
How do I solve my real problem?
If you really want to ignore embedded \rs and \ns your best bet is to strip them out in a second step. You don't say what language you're using, but something equivalent to this (Python) should work:
s = re.sub(r'[\r\n]', '', s)
Perhaps what you mean to do here is place the [CR][LF] matching part outside of the captured group, something like: MS\d:(\d){15}(?:\r?\n?)
So far as I know, you'll have to use 2 regexes. One is "MS\d:(\d(?:\r?\n?)){15}", the other is used to remove the line breaks from the matches.
Please refer to "Regular expression to skip character in capture group".
How about MS\d:(?:(\d)\r?\n?){15}