Regex optional with two matching groups - regex

Make a regex that matches this:
aaGrest with matching groups [aa, G, rest]
bb with matching groups [bb]
I am trying to make Grest part optional this doesn't work:
^([a-z]{2}[a-z]?)[(P|G)(.*)]?
Ps: dont complicate stuff or downvote!

Try this:
^([a-z]{2}[a-z]?)(?:(P|G)(.*))?
See live demo.
This uses a non-capturing group (syntax (?:...)) to group the optional part without forming a capturing group, which keeps the capturing groups numbered as 1, 2 and 3.
Square brackets form a character class, which isn't what you intended.

Related

Regex with non-capturing group using stringr in R

I am trying to use non-capturing groups with the str_extract function from the stringr package. Here is an example:
library(stringr)
txt <- "foo"
str_extract(txt,"(?:f)(o+)")
This returns
"foo"
while i expect it to return only
"oo"
like in this post: https://stackoverflow.com/a/14244553/3750030
How do i use non-capturing groups in R to remove the content of the groups from the returned value while using it for matching?
When you are using regex (?:f)(o+) this won't Capture but it will match it for sure.
What capturing means is storing in memory for back-referencing, so that it can be used for repeated match in same string or replacing captured string.
like in this post: https://stackoverflow.com/a/14244553/3750030
You misunderstood that answer. Non-Capturing groups doesn't means Non-Matching. It's captured in $1 ( group 1 ) because there is no group prior to it.
If you wish to Only match suppose B followed by A then you should use positive lookbehind like this.
Regex: (?<=f)(o+)
Explanation:
(?<=f) This will look for f to be present behind the following token but won't match.
(o+) This will match and capture as group (here in $1)if previous condition is true.
Regex101 Demo

Regex Optional Match

I have this regex pattern which I made myself (I'm a noob though, and made it through following tutorials):
^([a-z0-9\p{Greek}].*)\s(Ε[0-9\p{Greek}]+|Θ)\s[\(]([a-z1-9\p{Greek}]+.*)[\)]\s-\s([a-z0-9\p{Greek}]+$)
And I'm trying to match the following sentences:
ΠΡΟΓΡΑΜΜΑΤΙΣΤΙΚΕΣ ΕΦΑΡΜ ΣΤΟ ΔΙΑΔΙΚΤΥΟ Ε2 (Ε.Β.Δ.) - ΔΗΜΗΤΡΙΟΥ
ΠΡΟΓΡΑΜΜΑΤΙΣΜΟΣ 1 Θ (ΑΜΦ) - ΜΑΣΤΟΡΟΚΩΣΤΑΣ
ΕΙΣΑΓΩΓΗ ΣΤΗΝ ΠΛΗΡΟΦΟΡΙΚΗ Θ (ΑΜΦ) - ΒΟΛΟΓΙΑΝΝΙΔΗΣ
And so on.
This pattern splits the string into 4 parts.
For example, for the string:
ΠΡΟΓΡΑΜΜΑΤΙΣΤΙΚΕΣ ΕΦΑΡΜ ΣΤΟ ΔΙΑΔΙΚΤΥΟ Ε2 (Ε.Β.Δ.) - ΔΗΜΗΤΡΙΟΥ
The first match is: ΠΡΟΓΡΑΜΜΑΤΙΣΤΙΚΕΣ ΕΦΑΡΜ ΣΤΟ ΔΙΑΔΙΚΤΥΟ (Subject's Name)
Second match is: Ε2 (Class)
Third match is: Ε.Β.Δ. (Room)
And the forth match is: ΔΗΜΗΤΡΙΟΥ (Teacher)
Now in some entries E*/Θ is not defined, and I want to get the 3 matches without the E*/Θ. How should I modify my pattern so that (Ε[0-9\p{Greek}]+|Θ) is an optional match?
I tried ? so far, but because in my previous matches i'm defining \s and \s it requires 2 whitespaces to get 3 matches and i only have one in my string.
I think you need to do two things:
Make .* lazy (i.e. .*?)
Enclose (?:\s(Ε[0-9\p{Greek}]+|Θ))? with a non-capturing optional group.
The regex will look like
^([a-z0-9\p{Greek}].*?)(?:\s(Ε[0-9\p{Greek}]+|Θ))?\s[\(]([a-z1-9\p{Greek}]+.*)[\)]\s-\s([a-z0-9\p{Greek}]+)$
^^ ^^ ^
See demo
If you do not make the first .* lazy, it will eat up the second group that is optional. Making it lazy will ensure that if there is some text that can be matched by the second capturing group, it will be "set".
Note you call capture groups matches, which is wrong. Matches are whole texts matched by the entire regular expression and captures are just substrings matched by parts of regexp enclosed in unescaped round brackets. See more on capture groups at regular-expressions.info.
You can use something like:
(E[0-9\p{Greek}]+|0)?
The whole group will be optional (?).

Trouble with non-capturing groups in regular expression

I'm attempting to capture the 6 digit number in the following:
ObjectID: !nrtdms:0:!session:slonwswtest1:!database:TEST:!folder:ordinary,486150:
I tried the following regex:
\d+(?::$)
attempting to use a non-capturing group to strip the colon out of the returned match, but it returns the colon as in:
486150:
Any ideas what I'm doing wrong?
You want a positive lookahead:
\d+(?=:$)
A non-capturing group is simply a group that cannot be accessed via a backreference; they still are part of the match, nonetheless.
Alternatively, you can use
(\d+):$
and obtain the 1st match group.
You should use a positive lookahead rather than a non-capturing group
\d+(?=:$)
Non-capturing groups are groups that will not create a capture (to be used in backreferences or extracted from the match result). Nonetheless they will match the expression.
What you're looking for is lookahead - to test the expression but exclude it from the match:
\d+(?=:$)
Probably your regex tool is returning the complete match since you don't have any capture group there. Try to enclose the \d+ in a capture group, and find the way to get capture group 1 in your regex tool.
Alternatively, you can also use positive look-ahead:
\d+(?=:$)
And given that you want to capture 6 digits, you can make that explicit:
\d{6}

How does (?:\;jsessionid=[^\?#]*)? in a regular expression work?

Suppose I have this text to match:
http://localhost:8080/start.jsp;jsessionid=9E4CDB636248C9610F57704E5E07F782?whatever=true&somethingelse=true
Using this regular expression:
^(.*?start\.jsp)(?:\;jsessionid=[^\?#]*)?(\?[^#]*)?(#.*)?$
The resulting groups are:
http://localhost:8080/start.jsp
?whatever=true&somethingelse=true
A. Why isn't group number 2 this: ;jsessionid=9E4CDB636248C9610F57704E5E07F782?
What does this part ?:\ at the beginning of second group do?
B. And also, how can I create an expression to extract the same groups as for the example above, if my options are begin.jsp and start.jsp (not just start.jsp) before the jsessionid part?
(?: ) is syntax for a non-capturing group. As the name explains it doesn't capture its match.
put alternate matching non-capturing group: (.*?(?:start|begin)\.jsp)

Regex Exclude Character From Group

I have a response:
MS1:111980613994
124 MS2:222980613994124
I have the following regex:
MS\d:(\d(?:\r?\n?)){15}
According to Regex, the "(?:\r?\n?)" part should let it match for the group but exclude it from the capture (so I get a contiguous value from the group).
Problem is that for "MS1:xxx" it matches the [CR][LF] and includes it in the group. It should be excluded from the capture ...
Help please.
The (?:...) syntax does not mean that the enclosed pattern will be excluded from any capture groups that enclose the (?:...).
It only means that that the group formed by (?:...) will be a non-capturing group, as opposed to a new capture group.
Put another way:
(?:...) only groups
(...) has two functions: it both groups and captures.
Capture groups capture all of the text matched by the pattern they enclose, even the parts that are matched by nested groups (whether they are capturing or not).
An example
With the regex...
.*(l.*(o.*o).*l).*
...there are two capture groups. If we match this against hello world we get the following captures:
1: lo worl
2: o wo
Note that the text captured by group 2 is also captured by group 1.
If we change the inner group to be non-capturing...
.*(l.*(?:o.*o).*l).*
...group 1's capture will not be changed (when matched against the same string), but there is no longer a group 2:
1: lo worl
As you can see, if a non-capturing group is enclosed by a capture group, that enclosing capture group will capture the characters matched by the non-capturing group.
What are they For?
The purpose of non-capturing groups is not to exclude content from other capturing groups, but rather to act as a way to group operations without also capturing.
For example, if you want to repeat a substring, you might write (?:substring)*.
How do I solve my real problem?
If you really want to ignore embedded \rs and \ns your best bet is to strip them out in a second step. You don't say what language you're using, but something equivalent to this (Python) should work:
s = re.sub(r'[\r\n]', '', s)
Perhaps what you mean to do here is place the [CR][LF] matching part outside of the captured group, something like: MS\d:(\d){15}(?:\r?\n?)
So far as I know, you'll have to use 2 regexes. One is "MS\d:(\d(?:\r?\n?)){15}", the other is used to remove the line breaks from the matches.
Please refer to "Regular expression to skip character in capture group".
How about MS\d:(?:(\d)\r?\n?){15}