Match movie filenames with optional parts with regex - regex

I have a film title in the following format
(Studio Name) - Film Title Part-1** - Animation** (2014).mp4
The part in BOLD is optional, meaning I can have a title such as this
(Studio Name) - Film Title Part-1 (2014).mp4
With this regex
^\((?P<studio>.+)\) - (?P<title>.+)(?P<genre>-.+)\((?P<year>\d{4})\)
I get the following results
studio = Studio Name
title = Film Title Part-1
genre = - Animation
year = 2014
I have tried the following to make the "- Animation" optional by changing the regex to
^\((?P<studio>.+)\) - (?P<title>.+)(?:(?P<genre>-.+)?)\((?P<year>\d{4})\)
but I end up with the following results
studio = Studio Name
title = Film Title Part-1 - Animation
genre =
year = 2014
I am using Python, the code that I am executing to process the regex is
pattern = re.compile(REGEX)
matched = pattern.search(film)

You can omit the non capturing group around the genre, make change the first .* to a negated character class [^()] matching any char except parenthesis and make the .+ in greoup title non greedy to allow matching the optional genre group.
For the genre, you could match .+, or make the match more specific if you only want to match a single word.
^\((?P<studio>[^()]+)\) - (?P<title>.+?)(?P<genre>- \w+ )?\((?P<year>\d{4})\)
Regex demo
Explanation
^ Start of string
\((?P<studio>[^()]+)\) Named group studio match any char except parenthesis between ( and )
- Match literally
(?P<title>.+?) Named group title, match any char except a newline as least as possible
(?P<genre>- \w+ )? Named group genre, match - space, 1+ word chars and space
\((?P<year>\d{4})\) named group year, match 4 digits between ( and )
If you want to match the whole line:
^\((?P<studio>[^()]+)\) - (?P<title>.+?)(?P<genre>- \w+ )?\((?P<year>\d{4})\)\.mp4$

Related

Multiline Regex with opening and closing word

I need to admit, I'm very basic if it comes to RegEx expressions.
I have an app written in C# that looks for certain Regex expressions in text files. I'm not sure how to explain my problem so I will go straight to example.
My text:
DeviceNr : 30
DeviceClass = ABC
UnitNr = 1
Reference = 29
PhysState = ENABLED
LogState = OPERATIVE
DevicePlan = 702
Manufacturer = CDE
Model = EFG
ready
DeviceNr : 31
DeviceClass = ABC
UnitNr = 9
Reference = 33
PhysState = ENABLED
LogState = OPERATIVE
Manufacturer = DDD
Model = XYZ
Description = something here
ready
I need to match a multiline text that starts with "DeviceNr" word, ends with "ready" and have "DeviceClass = ABC" and "Model = XYZ" - I can only assume that this lines will be in this exact order, but I cannot assume what will be between them, not even number of other lines between them. I tried with below regex, but it matched the whole text instead of only DeviceNr : 31
DeviceNr : ([0-9]+)(?:.*?\n)*? DeviceClass = ABC(?:.*?\n)*? Model = XYZ(?:.*?\n)*?ready\n\n
If you know that "DeviceClass = ABC" and "Model = XYZ" are present and in that order, you can also make use of a lookahead assertion on a per line bases first matching all lines that do not contain for example DeviceNr
Then match the lines that does, and also do this for Model and ready
^\s*DeviceNr : ([0-9]+)(?:\r?\n(?!\s*DeviceClass =).*)*\r?\n\s*DeviceClass = ABC\b(?:\r?\n(?!\s*Model =).*)*\r?\n\s*Model = XYZ\b(?:\r?\n(?!\s*ready).*)*\r?\n\s*ready\b
^ Start of string
\s*DeviceNr : ([0-9]+) Match DeviceNr : and capture 1+ digits 0-9 in group 1
(?: Non capture group
\r?\n(?!\s*DeviceClass =).* Match a newline, and assert that the line does not contain DeviceClass =
)* Close non capture group and optionally repeat as you don't know how much lines there are
\r?\n\s*DeviceClass = ABC\b Match a newline, optional whitespace chars and DeviceClass = ABC
(?:\r?\n(?!\s*Model =).*)*\r?\n\s*Model = XYZ\b The previous approach also for Model =
(?:\r?\n(?!\s*ready).*)*\r?\n\s*ready\b And the same approach for ready
Regex demo
Note that \s can also match a newline. If you want to prevent that, you can also use [^\S\r\n] to match a whitespace char without a newline.
Regex demo
The issue is that you want to match 'DeviceNr : 31' followed by 'DeviceClass = ABC' (possibly with some intervening characters) followed by 'Model = XYZ' (again possibly with some intervening characters) followed by 'ready' (again possibly with some intervening characters) making sure that none of those intervening characters are actually the start of of another 'DeviceNr' section.
So to match arbitrary intervening characters with the above enforcement, we can use the following regex expression that uses a negative lookahead assertion:
(?:(?!DeviceNr)[\s\S])*?
(?: - Start of a non-capturing group
(?!DeviceNr) - Asserts that the next characters of the input are not 'DeviceNr'
[\s\S] - Matches a whitespace or non-whitespace character, i.e. any character
) end of the non-capturing group
*? non-greedily match 0 or more characters as long as the next input does not match 'DeviceNr'
Then it's a simple matter to use the above regex repeatedly as follows:
DeviceNr : (\d+)\n(?:(?!DeviceNr)[\s\S])*?DeviceClass = ABC\n(?:(?!DeviceNr)[\s\S])*?Model = XYZ\n(?:(?!DeviceNr)[\s\S])*?ready
See Regex Demo
Capture Group 1 will have the DeviceNr value.
Important Note
The above regex is quite expensive in terms of the number of steps required for execution since it must check the negative lookahead assertion at just about every character position once it has matched DeviceNr : (\d+).

RegEx - Return pattern to the right of a text string for URL

I'm looking to return the URL string to the right of a specific set of text using RegEx:
URL:
www.websitename/countrycode/websitename/contact/thank-you/whitepaper/countrycode/whitepapername.pdf
What I would like to just return:
/whitepapername.pdf
I've tried using ^\w+"countrycode"(\w.*) but the match won't recognize countrycode.
In Google Data Studio, I want to create a new field to remove the beginning of the URL using the REGEX_REPLACE function.
Ideally using:
REGEX_REPLACE(Page,......)
The REGEXP_REPLACE function below does the trick, capturing all (.*) the characters after the last countrycode, where Page represents the respective field:
REGEXP_REPLACE(Page, ".*(countrycode)(.*)$", "\\2")
Alternatively - Adapting the RegEx by The fourth bird to Google Data Studio:
REGEXP_REPLACE(Page, "^.*/countrycode(/[^/]+\\.\\w+)$", "\\1")
Google Data Studio Report as well as a GIF to elaborate:
You could use a capturing group and replace with group 1. You could match /countrycode literally or use the pattern to match 2 times chars a-z with an underscore in between like /[a-z]{2}_[a-z]{2}
In the replacement use group 1 \\1
^.*/countrycode(/[^/]+\.\w+)$
Regex demo
Or using a country code pattern from the comments:
^.*/[a-z]{2}_[a-z]{2}(/[^/]+\.\w+)$
Regex demo
The second pattern in parts
^ Start of string
.*/ Match until the last occurrence of a forward slash
[a-z]{2}_[a-z]{2} Match the country code part, an underscore between 2 times 2 chars a-z
( Capture group 1
/[^/]+ Match a forward slash, then match 1+ occurrences of any char except / using a negated character class
\.\w+ Match a dot and 1+ word chars
) Close group
$ End of string

regex search von Bibtex file to extract cite key and title in capture groups

The problem I have is that i would like to extract only the cite key and the Title of an Bibtex library file using capture group.
My Data file looks like this.
#article{Wang2017,
author = {Wang, Yunsen and Kogan, Alexander},
file = {:/2017/2017{_}Designing Privacy-Preserving Blockchain based Accounting Information Systems.pdf:pdf},
keywords = {accounting information systems,blockchain,continuous auditing},
title = {{Designing Privacy-Preserving Blockchain based Accounting Information Systems}},
year = {2017}
}
For the extraction of the cite key I used the following regex:
#\w+{([\w:-]+)
For the extraction of the title I used the following regex:
title = {{(.*?)}}
Both work. But I'm not able to combine both into one regex command so that cite key is capture group 1 and title is capture group 2
You can find the example file and the already used regex command using the following link.
https://regex101.com/r/v4cIe6/1
My expected result would be one command to extract cite key and title at once and have it in different capture groups.
If a negative lookahead is supported, you could repeat all the lines that do not start with title. If the line does, match it followed by space, =, space and {{ and capture the title in capturing group 2
#\w+{([\w:-]+).*(?:\r?\n(?!title\b).*)*\ntitle = {{(.*?)}}
Explanation
#\w+{([\w:-]+) The pattern to match the cite key
.* Match any char except a newline non greedy
(?: Non capturing group
\r?\n(?!title\b).* Match a newline asserting the string does not start with title
)* Close Non capturing group and repeat 0+ times
\r?\ntitle = Match a newline, then title =
{{(.*?)}} The pattern to match the title, capture in group 2 matching between {{ and }}
Regex demo

Regex pattern in vbscript to match Text with multiple line

I have a long string with Slno. in it. I want to split the sentence from the string with Slno.
Sample text:
1. Able to click new button and proceed to ONB-002 dialogue.
2. - Partner connection name **(text field empty)(MANDATORY)**
- GS1 company prefix **(text field empty)(MANDATORY)**
I tried using vbscript regex to match a pattern. but it is matches only the first line of the string (1. text) not the second one.
^\d+\.\s(-?).*[\r\n].[\r\n\*+]*.*|^\d+\.\s(-?).*[\r\n]
And while splitting the string, for the Slno. 2 i want o get the below sentence as well. which am finding difficulty in getting.
Please assist me.
Set regex = CreateObject("VBScript.RegExp")
With regex
.Pattern = "^\d+\.\s(-?).*[\r\n].[\r\n\*+]*.*|^\d+\.\s(-?).*[\r\n]"
.Global = True
End With
Set matches = regex.Execute(txt)
My Expectation is am looking for a regex pattern that match
1. Able to click new button and proceed to ONB-002 dialogue.
&
2. - Partner connection name **(text field empty)(MANDATORY)**
- GS1 company prefix **(text field empty)(MANDATORY)**
as separate sentence or group.
If I am not mistaken, to get the 2 separate parts including the line after you could use:
^\d+\..*(?:\r?\n(?!\d+\.).*)*
Explanation
^ Start of string
\d+\. Match 1+ digits followed by a dot
.* Match any character except a newline 0+ times
(?: Non capturing group
\r?\n(?!\d+\.).* Match a newline and use a negative lookahead to asset what is on the right is not 1+ digits followed by a dot
)* Close non capturing group and repeat 0+ times
Regex demo

Extract different variations of hyphenated personal names with regex

I need to extract names after titles but I need to include hyphenated names too which can come in different variations.
The script below fails to pick up hyphenated names.
text = 'This is the text where Lord Lee-How and Sir Alex Smith are mentioned.\
Dame Ane Paul-Law is mentioned too. And just Lady Ball.'
names = re.compile(r'(Lord|Baroness|Lady|Baron|Dame|Sir) ([A-Z][a-z]+)[ ]?([A-Z][a-z]+)?')
names_with_titles = list(set(peers.findall(text)))
print(names_with_titles)
The current output is:
[('Lord', 'Lee', ''), ('Sir', 'Alex', 'Smith'), ('Dame', 'Ane', 'Paul'), ('Lady', 'Ball', '')]
The desired output should be:
[('Lord', 'Lee-How', ''), ('Sir', 'Alex', 'Smith'), ('Dame', 'Ane', 'Paul-Law'), ('Lady', 'Ball', '')]
I managed to extract hyphenated names with this pattern -
hyph_names = re.compile(r'(Lord|Baroness|Lady|Baron|Dame|Sir) ([A-Z]\w+(?=[\s\-][A-Z])(?:[\s\-][A-Z]\w+)+)')
But I cannot figure out how to combine the two. Will appreciate your help!
You may add a (?:-[A-Z][a-z]+)? optional group to the name part patterns:
(Lord|Baroness|Lady|Baron|Dame|Sir)\s+([A-Z][a-z]+(?:-[A-Z][a-z]+)?)(?:\s+([A-Z][a-z]+(?:-[A-Z][a-z]+)?))?
See the regex demo
Details
(Lord|Baroness|Lady|Baron|Dame|Sir) - one of the titles
\s+ - one or more whitespace chars
([A-Z][a-z]+(?:-[A-Z][a-z]+)?) - a capturing group #1:
[A-Z][a-z]+ - an uppercase letter followed with 1+ lowercase ones
(?:-[A-Z][a-z]+)? - an optional non-capturing group matching a hyphen and then an uppercase letter followed with 1+ lowercase ones
(?:\s+([A-Z][a-z]+(?:-[A-Z][a-z]+)?))? - an optional non-capturing group:
\s+ - 1+ whitespaces
([A-Z][a-z]+(?:-[A-Z][a-z]+)?) - a capturing group #2 with the same pattern as in Group 1.
You may build it in Python 3.7 like
title = r'(Lord|Baroness|Lady|Baron|Dame|Sir)'
name = r'([A-Z][a-z]+(?:-[A-Z][a-z]+)?)'
rx = rf'{title}\s+{name}(?:\s+{name})?'
In older versions,
rx = r'{0}\s+{1}(?:\s+{1})?'.format(title, name)