Optional groups in a regular expression - regex

For some reason I fail to write a correct regular expression to match the following strings:
a/b/c/d:d0/d1/d2/d3/:e;f'
a/b/c/d:d0/:e;f'
a/b/c/d:d0/:e'
a/b/c/d:d0'
In each string the c and e should be extracted. As you see, e is optional and the last string doesn't contain it. In that case the regular expression should still match and return the c.
This is the expression that I came up with, but it does not support an optional e:
a\/b\/(?<the_c>\w*)\/.*?\/:(?<the_e>\w*)
I thought to make the last part optional, but then it just doesn't find the e at all:
a\/b\/(?<the_c>\w*)\/.*?(?:\/:(?<the_e>\w*))?
^^^ ^^
Here is a link to test out this example: https://regex101.com/r/C2Jkhq/1
What's wrong with my regex here?

You can use
a\/b\/(?<the_c>\w*)\/(?:.*\/:(?<the_e>\w*))?
Details:
a\/b\/ - a/b/ string
(?<the_c>\w*) - zero or more word chars captured into "the_c" group
\/ - a / char
(?:.*\/:(?<the_e>\w*))? - an optional sequence (that is tried at least once) matching:
.* - any zero or more chars other than line break chars as many as possible
\/: - /: string
(?<the_e>\w*) - zero or more word chars captured into "the_e" group .
See this regex demo.

Related

RegEx: how to don't match a repetition

I have followings String:
test_abc123_firstrow
test_abc1564_secondrow
test_abc123_abc234_thirdrow
test_abc1663_fourthrow
test_abc193_abc123_fifthrow
I want to get the abc + following number of each row.
But just the first one if it has more than one.
My current pattern looks like this: ([aA][bB][cC]\w\d+[a-z]*)
But this doesn't involve the first one only.
If somebody could help how I can implement that, that would be great.
You can use
^.*?([aA][bB][cC]\d+[a-z]*)
Note the removed \w, it matches letters, digits and underscores, so it looks redundant in your pattern.
The ^.*? added at the start matches the
^ - start of string
.*? - any zero or more chars other than line break chars as few as possible
([aA][bB][cC]\d+[a-z]*) - Capturing group 1: a or A, b or B, c or C, then one or more digits and then zero or more lowercase ASCII letters.
Use the following regex:
^.*?([aA][bB][cC]\d+)
Use ^ to begin at the start of the input
.*? matches zero or more characters (except line breaks) as few times as possible (lazy approach)
The rest is then captured in the capturing group as expected.
Demo

golang regex get the string including the search character

I am extracting a piece of string from a string (link):
https://arteptweb-vh.akamaihd.net/i/am/ptweb/100000/100000/100095-000-A_0_VO-STE%5BANG%5D_AMM-PTWEB_XQ.1V7rLEYkPH.smil/master.m3u8
The desired output should be 100000/100000/100095-000-A_
I am using the Regex ^.*?(/[i,na,fm,d]([,/]?)(/am/ptweb/|.+=.+,))([^_]*).*?$ in Golang flavor and I can get only the group 4 with the folowing output 100000/100000/100095-000-A
However I want the underscore after A.
Bit stuck on this, any help on this is appreciated.
You can use
(/(i|na|fm|d)(/am/ptweb/|.+=.+,))([^_]*_?)
See the regex demo.
Details:
(/(i|na|fm|d)(/am/ptweb/|.+=.+,)) - Group 1:
/ - a / char
(i|na|fm|d) - Group 2: i, na, fm or d
(/am/ptweb/|.+=.+,) - Group 3: /amp/ptweb/ or one or more chars as many as possible (other than line break chars), =, one or more chars as many as possible (other than line break chars) and a , char
([^_]*_?) - Group 4: zero or more chars other than _ and then an optional _.
You can match the underscore after the A like:
^.*?(/(?:[id]|na|fm)([,/]?)(/am/ptweb/|.+=.+,))([^_]*_).*$
See a regex demo
A few notes about the pattern that you tried:
This notation is a character class [i,na,fm,d] which should be a grouping (?:[id]|na|fm)
In this group ([,/]?) you optionally capture either , or / so in theory it could match a string that has /i//am/ptweb/
The last part .*?$ does not have to be non greedy as it is the last part of the pattern
This part [^_]* can also match spaces and newlines

Matching words & partial colon-delimited words within parentheses (excluding parentheses)

I am trying to extract stock symbols from a body of text. These matches usually come in the following forms:
(<symbol>) => (VOO)
(<market>:<symbol>) => (NASDAQ:C)
In the sample cases shown above, I'd like to match VOO and C, skipping everything else. This regex gets me halfway there:
(?<=\()(.*?)(?=\))
With this, I match what's included within the parentheses, but the logic that ignores "noise" like NASDAQ: eludes me. I'd love to learn how to conditionally specify this pattern/logic.
Any ideas? Thanks!
You can use
[A-Z]+(?=\))
See the regex demo.
Details:
[A-Z]+ - one or more uppercase ASCII letters
(?=\)) - a positive lookahead that matches a location that is immediately followed with a ) char.
Alternatively, you can use the following to capture the values into Group 1:
\((?:[^():]*:)?([A-Z]+)\)
See this regex demo. Details:
\( - a ( char
(?:[^():]*:)? - an optional sequence of any zero or more chars other than (, ) and : and then a : char
([A-Z]+) - Group 1: one or more uppercase ASCII letters
\) - a ) char.

Regex that checks for the validity of options, disallowing duplicates

I'm trying to find a regex to check for the validity of options that are supplied with a command.
Say that -a, -b and -c are valid options. They may be combined, for example as -ac or -abc. Order doesn't matter, so -ba is also valid.
I thought this regex would do the trick:
^-[abc]{1,3}$
But it has a downside. This regex also accepts duplicates, i.e. -abb.
How do I modify this regex to disallow duplicates?
You may use this regex with a capture group and a negative lookahead:
^-((?!.*\1)[abc]){1,3}$
RegEx Demo
RegEx Details:
^: Start
-: Match a -
(: Start capture group #1
(?!.*\1): Negative lookahead to make sure we don't have repeat of what we have in capture group #1 anywhere in the input
[abc]: Match a or b or c
){1,3}: End capture group #1. Repeat this group 1 to 3 times
$: End
You could list all the alternatives, but if it is a long character class, you can check that on the right side there is no char that is already captured using a capture group and a backreference.
^-(?![abc]*?([abc])[abc]*?\1)[abc]{1,3}$
^ Start of string
- Match a hyphen
(?! Negative lookahead, assert that at the right is not
[abc]*([abc])[abc]*\1 Match optional chars a, b or c and then capture 1 char. Then check that the captured char does not occur at the right side
) Close lookahead
[abc]{1,3} Match 1-3 times a b or c
$ End of string
Regex demo
Or a short version using only non whitespace chars, as the character class can only match 3 chars.
^-(?!\S*(\S)\S*\1)[abc]{1,3}$
Regex demo

How do I make this regular expression not match anything after forward slash /

I have this regular expression:
/^www\.example\.(com|co(\.(in|uk))?|net|us|me)\/?(.*)?[^\/]$/g
It matches:
www.example.com/example1/something
But doesn't match
www.example.com/example1/something/
But the problem is that, it matches: I do not want it to match:
www.example.com/example1/something/otherstuff
I just want it to stop when a slash is enountered after "something". If there is no slash after "something", it should continue matching any character, except line breaks.
I am a new learner for regex. So, I get confused easily with those characters
You may use this regex:
^www\.example\.(?:com|co(?:\.(?:in|uk))?|net|us|me)(?:\/[^\/]+){2}$
RegEx Demo
This will match following URL:
www.example.co.uk/example1/something
You can use
^www\.example\.(?:com|co(?:\.(?:in|uk))?|net|us|me)\/([^\/]+)\/([^\/]+)$
See the regex demo
The (.*)? part in your pattern matches any zero or more chars, so it won't stop even after encountering two slashes. The \/([^\/]+)\/([^\/]+) part in the new pattern will match two parts after slash, and capture each part into a separate group (in case you need to access those values).
Details:
^ - start of string
www\.example\. - www.example. string
(?:com|co(?:\.(?:in|uk))?|net|us|me) - com, co.in, co.uk, co, net, us, me strings
\/ - a / char
([^\/]+) - Group 1: one or more chars other than /
\/ - a / char
([^\/]+) - Group 2: one or more chars other than /
$ - end of string.