Capturing string parts in RegEx - regex

I would like to map different parts of a string, some of them are optionally presented, some of them are always there. I'm using the Calibre's built in function (based on Python regex), but it is a general question: how can I do it in regex?
Sample strings:
!!Mixed Fortunes - An Economic History of China Russia and the West 0198703635 by Vladimir Popov (Jun 17, 2014 4_1).pdf
!Mixed Fortunes - An Economic History of China Russia and the West 0198703635 by Vladimir Popov (Jun 17, 2014 4_1).pdf
Mixed Fortunes - An Economic History of China Russia and the West 0198703635 by Vladimir Popov (Jun 17, 2014 4_1).pdf
!!Mixed Fortunes - An Economic History of China Russia and the West by Vladimir Popov (Jun 17, 2014 4_1).pdf
!!Mixed Fortunes - An Economic History of China Russia and the West by 1 Vladimir Popov (Jun 17, 2014 4_1).pdf
The strings' structure is the following:
[importance markings if any, it can be '!' or '!!'][title][ISBN-10 if available]by[author]([publication date and other metadata]).[file type]
Finally I created this regular expression, but it is not perfect, because if ISBN presented the title will contain the ISBN part too...
(?P<title>[A-Za-z0-9].+(?P<isbn>[0-9]{10})|([A-Za-z0-9].*))\sby\s.*?(?P<author>[A-Z0-9].*)(?=\s\()
Here is my sandbox: https://regex101.com/r/K2FzpH/1
I really appreciate any help!

Instead of using an alteration, you could use:
^!*(?P<title>[A-Za-z0-9].+?)(?:\s+(?P<isbn>[0-9]{10}))?\s+by\s+(?P<author>[A-Z0-9][^(]+)(?=\s\()
^ Start of the string
!* Match optional exclamation marks
(?P<title>[A-Za-z0-9].+?) Named group title, match of the ranges in the character class followed by matching as least as possible chars
(?:\s+(?P<isbn>[0-9]{10}))? Optionally match 1+ whitespace chars and named group isbn which matches 10 digits
\s+by\s+ Match by between 1 or more whitspace chars
(?P<author>[A-Z0-9][^(]+) Named group author Match either A-Z or 0-9 followed by 1+ times any char except (
(?=\s\() Positive lookahead to assert ( directly to the right
Regex demo

Related

Using regex to match groups that may not contain given text patterns

I'd like to use regex to extract birth dates and places, as well as (when they're defined) death dates and places, from a collection of encyclopedia entries. Here are some examples of such entries which illustrate the patterns that I'm trying to codify:
William Wordsworth, (born April 7, 1770, Cockermouth, Cumberland, England—died April 23, 1850, Rydal Mount, Westmorland), English poet...
Jane Goodall, in full Dame Jane Goodall, original name Valerie Jane Morris-Goodall, (born April 3, 1934, London, England), British ethologist...
Kenneth Wartinbee Spence, (born May 6, 1907, Chicago, Illinois, U.S.—died January 12, 1967, Austin, Texas), American psychologist...
I was hoping that the following regex pattern would identify the desired capture groups:
\(born (\w+ \d{1,2}, \d{4})(?:, )(.*?)(?:—died )?(\w+ \d{1,2}, \d{4})?(?:, )?(.*?)\)
But sadly, it does not. (Use https://regexr.com/ to view the results.)
Note: When I restructure the pattern around the phrase —died in the following way, the pattern does produce the expected results for entries like the first and third above (those with given death dates/places), but it obviously does not work in all cases.
\(born (\w+ \d{1,2}, \d{4})(?:, )(.*?)—died (\w+ \d{1,2}, \d{4})?(?:, )?(.*?)\)
What am I missing?
In general, you can mark the whole --died section as optional:
\(born (\w+ \d{1,2}, \d{4})(?:, )(.*?)(—died (\w+ \d{1,2}, \d{4})?(?:, )?(.*?))?\)
https://regexr.com/765fc
You can use the negative lookahead assertion (?!) to match groups that do not contain a given text pattern. The syntax for this is (?!pattern) where "pattern" is the text you want to exclude from the match. For example, if you want to match all groups of characters that do not contain the letter "a", you would use the regex "(?!a).*" This will match any group of characters that does not have an "a" in it.

Regex deselect anything in the line before match

I need to deselect anything in the line before a certain character. My example looks like this.
28.01.2023 11:00
GERMANY
OWVA2
07.01.2023 06:00
JAPAN
940-750
Laden
28.01.2023 12:00
FRANCE
ANDY
-
07.01.2023 09:30
NORWAY and SWEDEN
-
07.01.2023 09:30
SPAIN
I already have a Regex that selects anything in the line that follows a date (dd.MM.yyyy followed by HH:mm). Now I want to select what follows the date ONLY if the match isn't followed by - or NUMBER-NUMBER in the NEXT line. So basically deselect anything in the line before - or NUMBER-NUMBER. So in this case I want to select Germany, France and Spain. I don't want to select Japan and NORWAY and SWEDEN because they are followed by - or NUMBER-NUMBER.
My regex looks like this:
(?<=\d+.\d+.\d+ \d+:\d+[\r\n]+)([ a-zA-ZäöüÄÖÜßé0-9'-]{3,})+(?![\r\n](\d+)?-(\d+)?)
Sorry for this weird example but that is the best I have. Thanks in advance
Currently your negative lookahead only prevents matching the last character of the line, because if it's not matched then the match isn't followed by a linefeed and a number dash number anymore.
You can fix that by matching the lookahead before the content you want to capture :
(?<=\d+.\d+.\d+ \d+:\d+[\r\n]+)(?!.*[\r\n](\d+)?-(\d+)?)([ a-zA-ZäöüÄÖÜßé0-9'-]{3,})+
You can test it here.

Matching location with regular expressions | Python

I scraped several articles from a websites. Now I am trying to extract the location of the news. The location are written either capitalized with just the capital of the country (e.g. "BRUSSELS-") or in some cases along with the country (e.g. "BRUSELLS, Belgium-")
This is a sample of the articles:
|[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded two police officers with a knife in Brussels around...]
[<p>Advertisement , By DAVID JOLLY FEB. 8, 2016
, KABUL, Afghanistan — A Taliban suicide bomber killed at least three people on Monday and wounded]
The regular expression I used is this one:
text_open = open("Training_News_6.csv")
text_read = text_open.read()
pattern = ("[A-Z]{1,}\w+\s\—")
result = re.findall(pattern,text_read)
print(result)
The reason why I used the score sign (-) is because is a recurrent pattern that links to the location.
However, this regular expression manage to extract "BRUSSELS -" but when it comes to "KABUL, Afghanistan -" it only extract the last part, namely "Afghanistan -".
In the second case I would like to extract the whole location: the capital and the country. Any idea?
You may use
([A-Z]+(?:\W+\w+)?)\s*—
See the regex demo
Details:
([A-Z]+(?:\W+\w+)?) - Capture Group 1 (the contents of which will be returned as the result of re.findall) capturing
[A-Z]+ - 1 or more ASCII uppercase letters
(?:\W+\w+)? - 1 or 0 occurrences (due to ? quantifier) of 1+ non-word chars (\W+) and 1+ word chars (\w+)
\s* - 0+ whitespaces
— - a — symbol
Python demo:
import re
rx = r"([A-Z]+(?:\W+\w+)?)\s*—"
s = "|[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016 \n, BRUSSELS — A man wounded two police officers with a knife in Brussels around...] \n[<p>Advertisement , By DAVID JOLLY FEB. 8, 2016 \n, KABUL, Afghanistan — A Taliban suicide bomber killed at least three people on Mo"
print(re.findall(rx, s)) # => ['BRUSSELS', 'KABUL, Afghanistan']
One thing you could do is adding , and \s to your first chars selection, and then stripping all whitespaces and commas from the left.,[A-Z,\s]{1,}\w+\s\—
Or even something simpler, like this:,(.+)\—. $1 would be your match, containing extra symbols. Another option which might work: ,\s*([A-Za-z]*[,\s]*[A-Za-z]*)\s\— or simplified versions: ,\s*([A-Za-z,\s]*)\s\—. Once again $1 is your match.

Regex to find capitalized words and code string?

I'm looking to change a simple dash to an em dash in some obituaries we receive. But it's only after the City of Death where that em dash should go.
The text looks like this:
#M_DeathNoticeHed:Alex <\n>Ornelas
#M_DeathNoticeBod:ALAMO <\!-> Alex Ornelas <\n>, 25, died Tuesday, Aug. <\n>16, 2016 at Alamo. Me<\h>morial Funeral Home of <\n>San Juan is in charge of ar<\h>rangements.
#M_DeathNoticeHed:Almaquire Cadena
#M_DeathNoticeBod:RIO GRANDE CITY <\!-> Almaquire <\n>Cadena , 87, died Tues<\h>day, Aug. 16, 2016 at Pax <\n>Villa Hospice, in McAllen, <\n>TX. Sanchez Funeral Home <\n> of Rio Grande City is in <\n>charge of arrangements.
#M_DeathNoticeHed:AnaRose <\n>Collazi
#M_DeathNoticeBod:MISSION <\!-> AnaRose <\n>Collazo , 44, died Wednes<\h>day, Aug. 17, 2016 at Mis<\h>sion Regional Medical Cen<\h>ter in Mission. Virgil Wilson <\n>Mortuary of Mission is in <\n>charge of arrangements.
#M_DeathNoticeHed:Andy Garza
#M_DeathNoticeBod:RIO GRANDE CITY <\!-> Andy <\n>Garza , 21, died Tuesday, <\n>Aug. 16, 2016 at Chicago, <\n>IL. Rodriguez Funeral <\n>Home of Roma is in <\n>charge of arrangements.
Notice that after every "#M_DeathNoticeBod: CITY" is "<\!->" which symbolizes the dash that I need changed to an em dash.
My regex code is not getting the "<\!->" selected along with the preceding city and "#M_DeathNoticeHed:".
#M_DeathNoticeBod:([^A-Za-z]*?[A-Z][A-Za-z]*)([^A-Za-z]*?[A-Z][A-Za-z]*) [<\!->]
It is also not selecting cities with 3 names in it like "RIO GRANDE CITY". I'm selecting this because the dash appears in other spots in the file that I do not want replaced.
If I can select that section I can replace the dash here.
If the lines you care about always start with "#M_DeathNoticeBod:" followed by the city of death, followed by the <!-> you wish to replace, I think something simple would do the job:
(#M_DeathNoticeBod:.*)<\\!->
Capture group 1 will contain everything up until that first "<\!->", so you if you're doing a search and replace you can just replace each occurrence of that regex with the contents of group 1 (usually denoted by '\1') followed by an em dash.
This regex should do:
#M_DeathNoticeBod:([A-Z ]*) (<\\!->)
I think this is what you are actually looking for:
(?<=#M_DeathNoticeBod:).+<\\!->
To explain things, the first portion, (?<=#M_DeathNoticeBod:) within the parenthesis is the positive lookbehind that doesn't participate in matching but ensures the trailing portion shall always be preceded by that expression.
I the trailing portion .+ should capture any city name containing any character sequence followed by your <!-> delimiter, which is captured by <\\!-> regex.

Regex with negative lookahead across multiple lines

For the past few hours I've been trying to match address(es) from the following sample data and I can't get it to work:
medicalHistory None
address 24 Lewin Street, KUBURA,
NSW, Australia
email MaryBeor#spambob.com
address 16 Yarra Street,
LAWRENCE, VIC, Australia
name Mary Beor
medicalHistory None
phone 00000000000000000000353336907
birthday 26-11-1972
My plan was to find anything that starts with "address", is followed by any space followed by characters, numbers commas and newlines and ends with newline followed by a character. I came up with the following (and many variations of it):
address\s+([0-9a-zA-Z, \n\t]+)(?!\n\w)
Unfortunately that matches the following:
address 24 Lewin Street, KUBURA,
NSW, Australia
email MaryBeor
and
address 16 Yarra Street,
LAWRENCE, VIC, Australia
name Mary Beor
medicalHistory None
phone 00000000000000000000353336907
birthday 26
instead of
address 24 Lewin Street, KUBURA,
NSW, Australia
and
address 16 Yarra Street,
LAWRENCE, VIC, Australia
Can you please tell me what I'm doing wrong?
I would do it this way:
address\s+((?![\r\n]+\w)[0-9a-zA-Z, \r\n\t])+
See it here on Regexr.
This ((?![\r\n]+\w)[0-9a-zA-Z, \r\n\t])+ is the important part, where I say, match the next character from [0-9a-zA-Z, \r\n\t], if (?![\r\n]+\w) is not following. This is matching what you expect.
In both your cases the regex stopped matching because of a character that is not included in your character class. If you want to go that way than you would need to combine a lazy quantifier and a positive lookahead:
address\s+([0-9a-zA-Z, \n\r\t]+?)(?=\r\w)
[0-9a-zA-Z, \n\r\t]+? is matching as less as possible till the condition (?=\r\w) is true.
See it here at Regexr
The problem with your regex is that + is greedy and goes until it finds a character out of that group, the # in the first case and - in the second.
Another approach is to use a non-greedy quantifier and a positive look-ahead for a newline followed by a word-character, like (python version):
re.findall(r'address\s+.*?(?=\n\w)', s, re.DOTALL)
It yields:
['address 24 Lewin Street, KUBURA, \n NSW, Australia',
'address 16 Yarra Street, \n LAWRENCE, VIC, Australia']