Optional groups in regex

Optional groups in regex - regex

I am trying do a regex for this cases:
1) Any title
2) Any title (2016)
3) Any title (2016 Any text)
I need the text before the parenthesis(title) and the year inside(year). Example for the cases:
1)
Title: Any title
Year: null
2)
Title: Any title
Year: 2016
3)
Title: Any title
Year: 2016
I do this regex:
(.*)(?:\s(\((\d+)(?:\D*)?)\))?
But dont work.

You could go for:
^ # start of the string
(?P<title>[^\n()]+) # anything not a (, ) or newline -> "title"
(?: # non-capturing group
\( # ( literally
(?P<year>\d{4}) # four digits -> "year"
[^)]*\) # anything not ), followed by )
)? # make the whole group optional
See a demo on regex101.com.
If named captured groups are not supported (you did not specify any programming language), just omit the ?P<> part and use the group numbers 1 and 2 instead.

Related

Using regex to parse data between delimeter and ending at a specified substring

I'm trying to parse out the names from a bunch of semi-unpredictable strings. More specifically, I'm using ruby, but I don't think that should matter much. This is a contrived example but some example strings are:
Eagles vs Bears
NFL Matchup: Philadelphia Eagles VS Chicago Bears TUNE IN
NFL Matchup: Philadelphia Eagles VS Chicago Bears - TUNE IN
Philadelphia Eagles vs Chicago Bears - NFL Match
Phil.Eagles vs Chic.Bears
3agles vs B3ars
The regex I've come up with is
([0-9A-Z .]*) vs ([0-9A-Z .]*)(?:[ -:]*tune)?/i
but in the case of "NFL Matchup: Philadelphia Eagles VS Chicago Bears TUNE IN" I'm receiving Chicago Bears TUNE as the second match. I'm trying to remove "tune in" so it's in it's own group.
I thought that by adding (?:[ -:]*tune)? it would separate the ending portion of the expression the same way that having vs in the middle was able to, but that doesnt seem to be the case. If I remove the ? at the end, it matches correctly for the above example, but it no longer matches for Eagles vs Bears
If anyone could help me, I would greatly appreciate it if you could breakdown your regex piece by piece.

You can capture the second group up to a -, : or tune preceded with zero or more whitespaces or till end of the line while making the second group pattern lazy:
([\w .]*) vs ([\w .]*?)(?=\s*(?:[:-]|tune|$))
See the regex demo.
Details:
([\w .]*) - Group 1: zero or more word, space or . chars as many as possible
vs - a vs string
([\w .]*?) - Group 2: zero or more word, space or . chars as few as possible
(?=\s*(?:[:-]|tune|$)) - a positive lookahead that requires the following pattern to appear immediately to the right of the current location:
\s* - zero or more whitespaces
(?:[:-]|tune|$) - : or -, tune or end of a line.

You can use the following regular expression which I have expressed in free-spacing mode to make it self-documenting (search for "Free-Spacing Mode" at the link).
rgx = /
(?: |\A) # match space or beginning of string
(?<team1> # begin capture group team1
(?<team> # begin capture group team
(?<word> # begin capture group word
(?:\p{Lu}|\d) # word begins with an uppercase letter or digit
(?:\p{Ll}|\d)+ # ...followed by 1+ lowercase letters or digits
) # end capture group word
(?: # begin non-capture group
[ .] # match a space or period
\g<word> # match another word
)* # end non-capture group and execute 1+ times
) # end capture group team
) # end capture group team1
[ ]+ # match one or more spaces
(?:VS|vs) # match literal
[ ]+ # match one or more spaces
(?<team2> # begin capture group team2
\g<team> # match the second team name
) # end capture group team2
(?: # begin non-capture group
[ ] # match a space
(?: # begin non-capture group
(?:-[ ])? # optionally match literal
TUNE[ ]IN # match literal
| # or
-[ ]NFL[ ]Match # match literal
) # end inner capture group
)? # end outer non-capture group and make it optional
\z # match end of string
/x # free-spacing regex definition mode
examples = [
"Eagles vs Bears",
"NFL Matchup: Philadelphia Eagles VS Chicago Bears TUNE IN",
"NFL Matchup: Philadelphia Eagles VS Chicago Bears - TUNE IN",
"Philadelphia Eagles vs Chicago Bears - NFL Match",
"Phil.Eagles vs Chic.Bears",
"3agles vs B3ars"
]
examples.map do |s|
m = s.match(rgx)
[m[:team1], m[:team2]]
end
#=> [["Eagles", "Bears"],
# ["Philadelphia Eagles", "Chicago Bears"],
# ["Philadelphia Eagles", "Chicago Bears"],
# ["Philadelphia Eagles", "Chicago Bears"],
# ["Phil.Eagles", "Chic.Bears"],
# ["3agles", "B3ars"]]
See Regexp#match and MatchData#[].
Note that \g<word> and \g<team> effectively copy the code contained in the capture groups word and team, respectively. These are called "Subexpression Calls". For additional information search for that term at Regexp. There are two advantages to using subroutine calls: less code is needed and the opportunities for coding errors is reduced.

A regex capturing multiple groups starting with a pattern

I am trying to figure out a regex that would capture multiple groups in a string where each group is defined as follows:
The group's title starts with ${{
An optional string may follow
The group's title ends with }}
Optional content may follow the title
An example would be
'${{an optional title}} some optional content'
Here are some examples of inputs and expected results
Input 1: '${{}} some text '
Result 1: ['${{}} some text ']
Input 2: '${{title1}} some text1 ${{title 2}} some text2'
Result 2: ['${{title1}} some text1 ', '${{title 2}} some text2']
Input 3 (no third group as the second ending curly bracket is missing)
'${{title1}} some text1 ${{}} some text2 ${{title2} some text3'
Result 3 ['${{title1}} some text1 ', '${{}} some text2 ${{title2} some text3']
Input 4 (a group with empty content immediately followed by another group)
'${{title1}}${{}} some text2'
Result 4 ['${{title1}}', '${{}} some text2']
Any suggestions will be appreciated!

You can achieve that with Lookaheads. Try the following pattern:
\$\{\{.*?\}\}.*?(?=\$\{\{.*?\}\}|$)
Demo.
Breakdown:
\$\{\{.*?\}\} # Matches a "group" (i.e., "${{}}") containing zero or more chars (lazy).
.*? # Matches zero or more characters after the "group" (lazy).
(?= # Start of a positive Lookahead.
\$\{\{.*?\}\} # Ensure that the match is either followed by a "group"...
| # Or...
$ # ..is at the end of the string.
) # Close the Lookahead.

Match movie filenames with optional parts with regex

I have a film title in the following format
(Studio Name) - Film Title Part-1** - Animation** (2014).mp4
The part in BOLD is optional, meaning I can have a title such as this
(Studio Name) - Film Title Part-1 (2014).mp4
With this regex
^\((?P<studio>.+)\) - (?P<title>.+)(?P<genre>-.+)\((?P<year>\d{4})\)
I get the following results
studio = Studio Name
title = Film Title Part-1
genre = - Animation
year = 2014
I have tried the following to make the "- Animation" optional by changing the regex to
^\((?P<studio>.+)\) - (?P<title>.+)(?:(?P<genre>-.+)?)\((?P<year>\d{4})\)
but I end up with the following results
studio = Studio Name
title = Film Title Part-1 - Animation
genre =
year = 2014
I am using Python, the code that I am executing to process the regex is
pattern = re.compile(REGEX)
matched = pattern.search(film)

You can omit the non capturing group around the genre, make change the first .* to a negated character class [^()] matching any char except parenthesis and make the .+ in greoup title non greedy to allow matching the optional genre group.
For the genre, you could match .+, or make the match more specific if you only want to match a single word.
^\((?P<studio>[^()]+)\) - (?P<title>.+?)(?P<genre>- \w+ )?\((?P<year>\d{4})\)
Regex demo
Explanation
^ Start of string
\((?P<studio>[^()]+)\) Named group studio match any char except parenthesis between ( and )
- Match literally
(?P<title>.+?) Named group title, match any char except a newline as least as possible
(?P<genre>- \w+ )? Named group genre, match - space, 1+ word chars and space
\((?P<year>\d{4})\) named group year, match 4 digits between ( and )
If you want to match the whole line:
^\((?P<studio>[^()]+)\) - (?P<title>.+?)(?P<genre>- \w+ )?\((?P<year>\d{4})\)\.mp4$

regex optional word

I am trying to find a regex that will match each of the following cases from a set of ldap objectclass definitions - they're just strings really.
The variations in the syntax are tripping my regex up and I don't seem to be able to find a balance between the greedy nature of the match and the optional word "MAY".
( class1-OID NAME 'class1' SUP top STRUCTURAL MUST description MAY ( brand $ details $ role ) )
DESIRED OUTPUT: description
ACTUAL GROUP1: description
ACTUAL GROUP1 with ? on the MAY group: description MAY
( class2-OID NAME 'class2' SUP top STRUCTURAL MUST groupname MAY description )
DESIRED OUTPUT: groupname
ACTUAL GROUP1: groupname
ACTUAL GROUP1 with ? on the MAY group: groupname MAY description
( class3-OID NAME 'class3' SUP top STRUCTURAL MUST ( code $ name ) )
DESIRED OUTPUT: code $ name
ACTUAL GROUP1: no match
ACTUAL GROUP1 with ? on the MAY group: code $ name
( class4-OID NAME 'class4' SUP top STRUCTURAL MUST ( code $ name ) MAY ( group $ description ) )
DESIRED OUTPUT: code $ name
ACTUAL GROUP1: code $ name
ACTUAL GROUP1 with ? on the MAY group: code $ name
Using this:
MUST \(?([\w\$\-\s]+)\)?\s*(?:MAY) (Regex101)
matches lines 1, 2 and 4, but doesn't match the 3rd one with no MAY statement.
Adding an optional "?" to the MAY group results in a good match for 3 and 4, but then the 1st and 2nd lines act greedily and run on into MAY (line 1) or the remainder of the string (line 2).
It feels like I need the regex to consider MAY as optional but also that if MAY is found it should stop - I don't seem to be able to find that balance.

If you can use a regex with two capturing groups you may use
MUST\s+(?:\(([^()]+)\)|(\S+))\s*(?:MAY)?
See the regex demo
Details
MUST - a word MUST
\s+ - 1+ whitespaces
(?:\(([^()]+)\)|(\S+)) - two alternatives:
\( - (
([^()]+) - Group 1: 1+ chars other than ( and )
\) - a ) char
| - or
(\S+) - Group 2: one or more non-whitespace chars
\s+ - 1+ whitespaces
(?:MAY)? - an optional word MAY

js regex replace match in certain paragraph

---
title: test
date: 2018/10/17
description: some thing
---
I want to replace what's behind date if it's between ---, in this case 2018/10/17. How to do that with regex in JS?
So far I've tried;
/(?<=---\n)[\s\S]*date.+(?=\n)/
but it only works when date is the first line after ---

It is possible though imo not advisable:
(^---)((?:(?!^---)[\s\S])+?^date:\s*)(.+)((?:(?!^---)[\s\S])+?)(^---)
This needs to be replaced by $1$2substitution$4$5, see a demo on regex101.com.
Broken down this reads
(^---) # capture --- -> group 1
(
(?:(?!^---)[\s\S])+? # capture anything not --- up to date:
^date:\s*
)
(.+) # capture anything after date
(
(?:(?!^---)[\s\S])+?) # same pattern as above
(^---) # capture the "closing block"
Please consider using the afore-mentioned two-step approach as this regex is not going to be readable in a couple of weeks (and the JS engine does not support a verbose mode).

Without using a positive lookbehind, you could use 2 capturing groups and use those in the replacement like $1replacement$2
(^---[\s\S]+?date: )\d{4}\/\d{2}\/\d{2}([\s\S]+?^---)
Regex demo
Explanation
( Capturing group
^---[\s\S]+?date: Match from the start of the line 3 times a - followed by matching any 0+ times any character non greedy and then date:
) Close first capturing group
\d{4}\/\d{2}\/\d{2} Match a date like pattern (Note that this does not validate a date itself)
( Capturing group
[\s\S]+?^--- Match any 0+ times any character non greedy followed by asserting the start of the line and match 3 times -
) Close capturing group
const regex = /(^---[\s\S]+?date: )\d{4}\/\d{2}\/\d{2}([\s\S]+?^---)/gm;
const str = `---
title: test
date: 2018/10/17
description: some thing
---`;
const subst = `$1replacement$2`;
const result = str.replace(regex, subst);
console.log(result);

I'm not sure Javascript supports look behind at all, but if your environment supports it, you can try this regex:
/(?<=---[\s\S]+)(?<=date: )[\d/]+(?=[\s\S]+---)/
It looks behind for '---' followed by anything, then it looks behind for 'date: ' before it matches digits or slash one or more times, followed by a look ahead for anything followed by '---'.
Now you can easily replace the match with a new date.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Optional groups in regex - regex

Related

Using regex to parse data between delimeter and ending at a specified substring

A regex capturing multiple groups starting with a pattern

Match movie filenames with optional parts with regex

regex optional word

js regex replace match in certain paragraph

Categories

Resources