Matching and Managing Optional Groups in Regex Python - python-2.7

I am trying to match the following string
(Studio) - Film (Year) - Segment Number
The string has to have the following order Studio, followed by Film then optional Year and finally optional (Segment + Number)
Studio and Film must be present
Year can be optional
Segment type (case insensitive), if present must be followed by a number ranging from 1 through 8
The types of segments incle
The regex should match the following strings
(studio) - film (1994) - CD 3
(studio) - film (1994)
(studio) - film - CD 3
I have tried the following Regex
\((?P<STUDIO>.+)\) - (?P<TITLE>.+) \((?P<YEAR>\d{4})\)?( - (?P<SEGMENT>(?i)(cd|disc|disk|dvd|part|pt|scene) \b[1-8]\b))?
Which gives the following for the string (studio) - film (1994) - cd 3
Named groups
STUDIO studio
TITLE film
SEGMENT cd 3
YEAR 1994
and for (studio) - film (1994)
Named groups
STUDIO studio
TITLE film
SEGMENT None
YEAR 1994
So it works as the segment is optional.
However, when I make the year optional by using the following regex:
\((?P<STUDIO>.+)\) - (?P<TITLE>.+)?( \((?P<YEAR>\d{4})\))??( - (?P<SEGMENT>(?i)(cd|disc|disk|dvd|part|pt|scene) \b[1-8]\b))?
I end up with this result:
Named groups
STUDIO studio
TITLE film (1994)
SEGMENT None
YEAR None
And if I remove the year all together so the string is like (studio) - film - cd 3, I get the following:
Named groups
STUDIO studio
TITLE film - cd 3
SEGMENT None
YEAR None
What I need is:
Named groups
STUDIO studio
TITLE film
SEGMENT cd 3 or None
YEAR 1994 or None

You might write the pattern as
(?i)^\((?P<STUDIO>[^()]*)\) - (?P<TITLE>.+?)?(?: \((?P<YEAR>\d{4})\))?(?: - (?P<SEGMENT>cd|disc|disk|dvd|part|pt|scene) [1-8])?$
^ Start of string
\( Match (
(?P<STUDIO>[^()]*) Group STUDIO Match chars other than ( and )
\) - Match ) -
(?P<TITLE>.+?)? Group TITLE Match 1+ chars as least as possible
(?: Non capture group and match
\((?P<YEAR>\d{4})\) Match ( then 4 digits in group YEAR and )
)? Close non capture group and make it optional
(?: Non capture group
- Match literally
(?P<SEGMENT>cd|disc|disk|dvd|part|pt|scene) Group SEGMENT Match any of the alternatives
[1-8] Match a space and a digit 1-8
)? Close non capture group and make it optional
$ End of string
Regex demo

Related

Using regex to parse data between delimeter and ending at a specified substring

I'm trying to parse out the names from a bunch of semi-unpredictable strings. More specifically, I'm using ruby, but I don't think that should matter much. This is a contrived example but some example strings are:
Eagles vs Bears
NFL Matchup: Philadelphia Eagles VS Chicago Bears TUNE IN
NFL Matchup: Philadelphia Eagles VS Chicago Bears - TUNE IN
Philadelphia Eagles vs Chicago Bears - NFL Match
Phil.Eagles vs Chic.Bears
3agles vs B3ars
The regex I've come up with is
([0-9A-Z .]*) vs ([0-9A-Z .]*)(?:[ -:]*tune)?/i
but in the case of "NFL Matchup: Philadelphia Eagles VS Chicago Bears TUNE IN" I'm receiving Chicago Bears TUNE as the second match. I'm trying to remove "tune in" so it's in it's own group.
I thought that by adding (?:[ -:]*tune)? it would separate the ending portion of the expression the same way that having vs in the middle was able to, but that doesnt seem to be the case. If I remove the ? at the end, it matches correctly for the above example, but it no longer matches for Eagles vs Bears
If anyone could help me, I would greatly appreciate it if you could breakdown your regex piece by piece.
You can capture the second group up to a -, : or tune preceded with zero or more whitespaces or till end of the line while making the second group pattern lazy:
([\w .]*) vs ([\w .]*?)(?=\s*(?:[:-]|tune|$))
See the regex demo.
Details:
([\w .]*) - Group 1: zero or more word, space or . chars as many as possible
vs - a vs string
([\w .]*?) - Group 2: zero or more word, space or . chars as few as possible
(?=\s*(?:[:-]|tune|$)) - a positive lookahead that requires the following pattern to appear immediately to the right of the current location:
\s* - zero or more whitespaces
(?:[:-]|tune|$) - : or -, tune or end of a line.
You can use the following regular expression which I have expressed in free-spacing mode to make it self-documenting (search for "Free-Spacing Mode" at the link).
rgx = /
(?: |\A) # match space or beginning of string
(?<team1> # begin capture group team1
(?<team> # begin capture group team
(?<word> # begin capture group word
(?:\p{Lu}|\d) # word begins with an uppercase letter or digit
(?:\p{Ll}|\d)+ # ...followed by 1+ lowercase letters or digits
) # end capture group word
(?: # begin non-capture group
[ .] # match a space or period
\g<word> # match another word
)* # end non-capture group and execute 1+ times
) # end capture group team
) # end capture group team1
[ ]+ # match one or more spaces
(?:VS|vs) # match literal
[ ]+ # match one or more spaces
(?<team2> # begin capture group team2
\g<team> # match the second team name
) # end capture group team2
(?: # begin non-capture group
[ ] # match a space
(?: # begin non-capture group
(?:-[ ])? # optionally match literal
TUNE[ ]IN # match literal
| # or
-[ ]NFL[ ]Match # match literal
) # end inner capture group
)? # end outer non-capture group and make it optional
\z # match end of string
/x # free-spacing regex definition mode
examples = [
"Eagles vs Bears",
"NFL Matchup: Philadelphia Eagles VS Chicago Bears TUNE IN",
"NFL Matchup: Philadelphia Eagles VS Chicago Bears - TUNE IN",
"Philadelphia Eagles vs Chicago Bears - NFL Match",
"Phil.Eagles vs Chic.Bears",
"3agles vs B3ars"
]
examples.map do |s|
m = s.match(rgx)
[m[:team1], m[:team2]]
end
#=> [["Eagles", "Bears"],
# ["Philadelphia Eagles", "Chicago Bears"],
# ["Philadelphia Eagles", "Chicago Bears"],
# ["Philadelphia Eagles", "Chicago Bears"],
# ["Phil.Eagles", "Chic.Bears"],
# ["3agles", "B3ars"]]
See Regexp#match and MatchData#[].
Note that \g<word> and \g<team> effectively copy the code contained in the capture groups word and team, respectively. These are called "Subexpression Calls". For additional information search for that term at Regexp. There are two advantages to using subroutine calls: less code is needed and the opportunities for coding errors is reduced.

Splunk - regex extract fields from source

I am trying to extract the job name , region from Splunk source using regex .
Below is the format of my sample source :
/home/app/abc/logs/20200817/job_DAILY_HR_REPORT_44414_USA_log
With the below , I am able to extract job name :
(?<logdir>\/[\W\w]+\/[\W\w]+\/)(?<date>[^\/]+)\/job_(?<jobname>.+)_\d+
Here is the match so far :
Full match 0-53 /home/app/abc/logs/20200817/job_DAILY_HR_REPORT_44414
Group `logdir` 0-19 /home/app/abc/logs/
Group `date` 19-27 20200817
Group `jobname` 32-47 DAILY_HR_REPORT
I also need USA (region) from the source . Can you please help suggest.
Region will always appear after number field (44414) , which can vary in number of digits.
Ex: 123, 1234, 56789
Thank you in advance.
You could make the pattern a bit more specific about what you would allow to match as [\W\w]+ and .+ will cause more backtracking to fit the rest of the pattern.
Then for the region you can add a named group at the end (?<region>[^\W_]+) matching one or more times any word character except an underscore.
In parts
(?<logdir>\/(?:[^\/]+\/)*)(?<date>(?:19|20)\d{2}(?:0?[1-9]|1[012])(?:0[1-9]|[12]\d|3[01]))\/job_(?<jobname>\w+)_\d+_(?<region>[^\W_]+)_log
(?<logdir> Group logdir
\/(?:[^\/]+\/)* match / and optionally repeat any char except / followed by matching the / again
) Close group
(?<date> Group date
(?:19|20)\d{2} Match a year starting with 19 or 20
(?:0?[1-9]|1[012]) Match a month
(?:0[1-9]|[12]\d|3[01]) Match a day
) Close group
\/job_ Match /job_
(?<jobname>\w+) Group jobname, match 1+ word chars
_\d+_ Match 1+ digits between underscores
(?<region>[^\W_]+) Group region Match 1+ occurrences of a word char except _
_log Match literally
Regex demo

Match movie filenames with optional parts with regex

I have a film title in the following format
(Studio Name) - Film Title Part-1** - Animation** (2014).mp4
The part in BOLD is optional, meaning I can have a title such as this
(Studio Name) - Film Title Part-1 (2014).mp4
With this regex
^\((?P<studio>.+)\) - (?P<title>.+)(?P<genre>-.+)\((?P<year>\d{4})\)
I get the following results
studio = Studio Name
title = Film Title Part-1
genre = - Animation
year = 2014
I have tried the following to make the "- Animation" optional by changing the regex to
^\((?P<studio>.+)\) - (?P<title>.+)(?:(?P<genre>-.+)?)\((?P<year>\d{4})\)
but I end up with the following results
studio = Studio Name
title = Film Title Part-1 - Animation
genre =
year = 2014
I am using Python, the code that I am executing to process the regex is
pattern = re.compile(REGEX)
matched = pattern.search(film)
You can omit the non capturing group around the genre, make change the first .* to a negated character class [^()] matching any char except parenthesis and make the .+ in greoup title non greedy to allow matching the optional genre group.
For the genre, you could match .+, or make the match more specific if you only want to match a single word.
^\((?P<studio>[^()]+)\) - (?P<title>.+?)(?P<genre>- \w+ )?\((?P<year>\d{4})\)
Regex demo
Explanation
^ Start of string
\((?P<studio>[^()]+)\) Named group studio match any char except parenthesis between ( and )
- Match literally
(?P<title>.+?) Named group title, match any char except a newline as least as possible
(?P<genre>- \w+ )? Named group genre, match - space, 1+ word chars and space
\((?P<year>\d{4})\) named group year, match 4 digits between ( and )
If you want to match the whole line:
^\((?P<studio>[^()]+)\) - (?P<title>.+?)(?P<genre>- \w+ )?\((?P<year>\d{4})\)\.mp4$

regex optional word

I am trying to find a regex that will match each of the following cases from a set of ldap objectclass definitions - they're just strings really.
The variations in the syntax are tripping my regex up and I don't seem to be able to find a balance between the greedy nature of the match and the optional word "MAY".
( class1-OID NAME 'class1' SUP top STRUCTURAL MUST description MAY ( brand $ details $ role ) )
DESIRED OUTPUT: description
ACTUAL GROUP1: description
ACTUAL GROUP1 with ? on the MAY group: description MAY
( class2-OID NAME 'class2' SUP top STRUCTURAL MUST groupname MAY description )
DESIRED OUTPUT: groupname
ACTUAL GROUP1: groupname
ACTUAL GROUP1 with ? on the MAY group: groupname MAY description
( class3-OID NAME 'class3' SUP top STRUCTURAL MUST ( code $ name ) )
DESIRED OUTPUT: code $ name
ACTUAL GROUP1: no match
ACTUAL GROUP1 with ? on the MAY group: code $ name
( class4-OID NAME 'class4' SUP top STRUCTURAL MUST ( code $ name ) MAY ( group $ description ) )
DESIRED OUTPUT: code $ name
ACTUAL GROUP1: code $ name
ACTUAL GROUP1 with ? on the MAY group: code $ name
Using this:
MUST \(?([\w\$\-\s]+)\)?\s*(?:MAY) (Regex101)
matches lines 1, 2 and 4, but doesn't match the 3rd one with no MAY statement.
Adding an optional "?" to the MAY group results in a good match for 3 and 4, but then the 1st and 2nd lines act greedily and run on into MAY (line 1) or the remainder of the string (line 2).
It feels like I need the regex to consider MAY as optional but also that if MAY is found it should stop - I don't seem to be able to find that balance.
If you can use a regex with two capturing groups you may use
MUST\s+(?:\(([^()]+)\)|(\S+))\s*(?:MAY)?
See the regex demo
Details
MUST - a word MUST
\s+ - 1+ whitespaces
(?:\(([^()]+)\)|(\S+)) - two alternatives:
\( - (
([^()]+) - Group 1: 1+ chars other than ( and )
\) - a ) char
| - or
(\S+) - Group 2: one or more non-whitespace chars
\s+ - 1+ whitespaces
(?:MAY)? - an optional word MAY

Optional groups in regex

I am trying do a regex for this cases:
1) Any title
2) Any title (2016)
3) Any title (2016 Any text)
I need the text before the parenthesis(title) and the year inside(year). Example for the cases:
1)
Title: Any title
Year: null
2)
Title: Any title
Year: 2016
3)
Title: Any title
Year: 2016
I do this regex:
(.*)(?:\s(\((\d+)(?:\D*)?)\))?
But dont work.
You could go for:
^ # start of the string
(?P<title>[^\n()]+) # anything not a (, ) or newline -> "title"
(?: # non-capturing group
\( # ( literally
(?P<year>\d{4}) # four digits -> "year"
[^)]*\) # anything not ), followed by )
)? # make the whole group optional
See a demo on regex101.com.
If named captured groups are not supported (you did not specify any programming language), just omit the ?P<> part and use the group numbers 1 and 2 instead.