Matching location with regular expressions | Python - regex

I scraped several articles from a websites. Now I am trying to extract the location of the news. The location are written either capitalized with just the capital of the country (e.g. "BRUSSELS-") or in some cases along with the country (e.g. "BRUSELLS, Belgium-")
This is a sample of the articles:
|[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded two police officers with a knife in Brussels around...]
[<p>Advertisement , By DAVID JOLLY FEB. 8, 2016
, KABUL, Afghanistan — A Taliban suicide bomber killed at least three people on Monday and wounded]
The regular expression I used is this one:
text_open = open("Training_News_6.csv")
text_read = text_open.read()
pattern = ("[A-Z]{1,}\w+\s\—")
result = re.findall(pattern,text_read)
print(result)
The reason why I used the score sign (-) is because is a recurrent pattern that links to the location.
However, this regular expression manage to extract "BRUSSELS -" but when it comes to "KABUL, Afghanistan -" it only extract the last part, namely "Afghanistan -".
In the second case I would like to extract the whole location: the capital and the country. Any idea?

You may use
([A-Z]+(?:\W+\w+)?)\s*—
See the regex demo
Details:
([A-Z]+(?:\W+\w+)?) - Capture Group 1 (the contents of which will be returned as the result of re.findall) capturing
[A-Z]+ - 1 or more ASCII uppercase letters
(?:\W+\w+)? - 1 or 0 occurrences (due to ? quantifier) of 1+ non-word chars (\W+) and 1+ word chars (\w+)
\s* - 0+ whitespaces
— - a — symbol
Python demo:
import re
rx = r"([A-Z]+(?:\W+\w+)?)\s*—"
s = "|[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016 \n, BRUSSELS — A man wounded two police officers with a knife in Brussels around...] \n[<p>Advertisement , By DAVID JOLLY FEB. 8, 2016 \n, KABUL, Afghanistan — A Taliban suicide bomber killed at least three people on Mo"
print(re.findall(rx, s)) # => ['BRUSSELS', 'KABUL, Afghanistan']

One thing you could do is adding , and \s to your first chars selection, and then stripping all whitespaces and commas from the left.,[A-Z,\s]{1,}\w+\s\—
Or even something simpler, like this:,(.+)\—. $1 would be your match, containing extra symbols. Another option which might work: ,\s*([A-Za-z]*[,\s]*[A-Za-z]*)\s\— or simplified versions: ,\s*([A-Za-z,\s]*)\s\—. Once again $1 is your match.

Related

Using regex to match groups that may not contain given text patterns

I'd like to use regex to extract birth dates and places, as well as (when they're defined) death dates and places, from a collection of encyclopedia entries. Here are some examples of such entries which illustrate the patterns that I'm trying to codify:
William Wordsworth, (born April 7, 1770, Cockermouth, Cumberland, England—died April 23, 1850, Rydal Mount, Westmorland), English poet...
Jane Goodall, in full Dame Jane Goodall, original name Valerie Jane Morris-Goodall, (born April 3, 1934, London, England), British ethologist...
Kenneth Wartinbee Spence, (born May 6, 1907, Chicago, Illinois, U.S.—died January 12, 1967, Austin, Texas), American psychologist...
I was hoping that the following regex pattern would identify the desired capture groups:
\(born (\w+ \d{1,2}, \d{4})(?:, )(.*?)(?:—died )?(\w+ \d{1,2}, \d{4})?(?:, )?(.*?)\)
But sadly, it does not. (Use https://regexr.com/ to view the results.)
Note: When I restructure the pattern around the phrase —died in the following way, the pattern does produce the expected results for entries like the first and third above (those with given death dates/places), but it obviously does not work in all cases.
\(born (\w+ \d{1,2}, \d{4})(?:, )(.*?)—died (\w+ \d{1,2}, \d{4})?(?:, )?(.*?)\)
What am I missing?
In general, you can mark the whole --died section as optional:
\(born (\w+ \d{1,2}, \d{4})(?:, )(.*?)(—died (\w+ \d{1,2}, \d{4})?(?:, )?(.*?))?\)
https://regexr.com/765fc
You can use the negative lookahead assertion (?!) to match groups that do not contain a given text pattern. The syntax for this is (?!pattern) where "pattern" is the text you want to exclude from the match. For example, if you want to match all groups of characters that do not contain the letter "a", you would use the regex "(?!a).*" This will match any group of characters that does not have an "a" in it.

Capturing string parts in RegEx

I would like to map different parts of a string, some of them are optionally presented, some of them are always there. I'm using the Calibre's built in function (based on Python regex), but it is a general question: how can I do it in regex?
Sample strings:
!!Mixed Fortunes - An Economic History of China Russia and the West 0198703635 by Vladimir Popov (Jun 17, 2014 4_1).pdf
!Mixed Fortunes - An Economic History of China Russia and the West 0198703635 by Vladimir Popov (Jun 17, 2014 4_1).pdf
Mixed Fortunes - An Economic History of China Russia and the West 0198703635 by Vladimir Popov (Jun 17, 2014 4_1).pdf
!!Mixed Fortunes - An Economic History of China Russia and the West by Vladimir Popov (Jun 17, 2014 4_1).pdf
!!Mixed Fortunes - An Economic History of China Russia and the West by 1 Vladimir Popov (Jun 17, 2014 4_1).pdf
The strings' structure is the following:
[importance markings if any, it can be '!' or '!!'][title][ISBN-10 if available]by[author]([publication date and other metadata]).[file type]
Finally I created this regular expression, but it is not perfect, because if ISBN presented the title will contain the ISBN part too...
(?P<title>[A-Za-z0-9].+(?P<isbn>[0-9]{10})|([A-Za-z0-9].*))\sby\s.*?(?P<author>[A-Z0-9].*)(?=\s\()
Here is my sandbox: https://regex101.com/r/K2FzpH/1
I really appreciate any help!
Instead of using an alteration, you could use:
^!*(?P<title>[A-Za-z0-9].+?)(?:\s+(?P<isbn>[0-9]{10}))?\s+by\s+(?P<author>[A-Z0-9][^(]+)(?=\s\()
^ Start of the string
!* Match optional exclamation marks
(?P<title>[A-Za-z0-9].+?) Named group title, match of the ranges in the character class followed by matching as least as possible chars
(?:\s+(?P<isbn>[0-9]{10}))? Optionally match 1+ whitespace chars and named group isbn which matches 10 digits
\s+by\s+ Match by between 1 or more whitspace chars
(?P<author>[A-Z0-9][^(]+) Named group author Match either A-Z or 0-9 followed by 1+ times any char except (
(?=\s\() Positive lookahead to assert ( directly to the right
Regex demo

Excluding text at the beginning of a string

I'm new to using RegEx and I'm still stumbling around a bit, so I'm sorry if this is a basic question. I'm trying to extract a string from between two parenthesis and I can't seem to figure out how to exclude the first part from my match.
This is my regex pattern:
(.+?)(?= -)
I want to extract a birth date, for example, excluding the "b." and the training "-". Here's a sample set:
( b. circa 1883 - d. Mar 03, 1960 )
( b. May 21, 1887 - d. Jan 24, 1979 )
( b. May 28, 1902 Zembin, BELARUS - d. Dec 22, 1998 Florida, USA )
( b. Jan 09, 1886 Philadelphia, Pennsylvania, USA - d. May 17, 1969 New York, New York, USA )
My regex matches ( b. Jan 09, 1886 Philadelphia, Pennsylvania, USA (for example) but also includes "( b. " prefix, which I want to exclude.
The regex also matches the following text, which I would like to exclude as well:
Husband of Sarah Wilder (August 2000
Also, I cannot get the following string to match, presumably because of the dot and space in St. Louis.
( b. Jun 28, 1920 St. Louis, Missouri, USA )
I've been banging my head for several hours and just can't quite get the rest of it. Any help or guidance would be very much appreciated. I've already gotten a lot of help from reading many of the posts here.
Thanks so much!
Assuming that your data always contains a hyphen followed by d., you can try this: (?<=b\. )(.*) - d\.
(?<=b\. ) matches the b. text without it being added to the matching text.
(.*) is a capturing group that contains the match. It captures everything until the terminating - d. is hit. Note that the . characters must be escaped to match correctly as they are regex special characters.
If it always starts with ( b. and end with - d. <something> ), you can simply do
(?<=^\( b\. ).*(?= - d\..*\))
Which actually means you are match any characters (.*), with <start of line>( b. in front of it ((?<=^\( b\. )), and with - d. <something>) behind it ((?= - d\..*\))). https://regex101.com/r/vB2fmP/1
Or, if you don't mind using matching group:
^\( b\. (.*) - d \..*\)$
^ start of line
\( b\. open parenthesis, space, b, dot, space
( ) capture group
.* any char, any occurence
- d \..*\) space, hyphen, space, d, dot,
then any char any occurrence,
close parenthesis,
$ end of line
and capture group 1 is the value you need (personally I prefer this one instead).
To prevent capturing the leading ( b. you could prefix your regex with \(\s*b\.\s* which will match the ( and the b. surrounded by zero or more whitespace characters \s*.
Then from that point you would capture your values in a group (.*?) and you could update your positive lookahead (?= (?:\-|\))) to include a whitespace with either a - or a ).
\(\s*b\.\s*(.*?)(?= (?:\-|\)))
You can do this be making two passes through the search string. On the first pass you capture all text inside brackets, and on the second you clean up your results by removing the unwanted expressions. You don't say what language you are using, so I will use PHP.
$want = "/\(.+?\)/";
$dontWant = "/(b/.|/-)/";
$desiredResult = array();
$result = preg_match_all($want, $searchText, $matches); // Get all text inside brackets
if (count($matches[0])>0) { // $matches[0] holds all the matches
foreach ($matches[0] as $match) { // Loop through the matches
$desiredResult[] = preg_replace( $dontWant, "", $match); // Remove unwanted text
}
}
You can adjust this to whatever language you are using.

Trouble with extracting times from schedule using regex

I am programming a function that will extract the different times from a schedule using regular expressions in Python. Here is an example of the schedule that I got from a website using BeautifulSoup:
Interactive talk with discussion17:00-18:00 Documentary ‘Occupy Gezi’
We present to you Taksim Gezi park boycott with all ways; day and
night, with good sides and bad sides18.00 - 19:00 Poet Maria van
Daalen ‘Haitian Vodoo’, poet from Querido publishers19:00
Food20:30-22:30
As shown above, the input text has starting times with and without ending times. There is also inconsistency with using either “:” or “.” when separating the hours from the minutes.
Using regex101, I have made the following (very ugly) regular expression that seems to work on all different times: \d\d[:|.]\d\d(\s*.\s*\d\d[:|.]\d\d)?
To search the text on Python I use the following code:
def extract_times(string):
list_of_times = re.findall('\d\d[:|.]\d\d(\s*.\s*\d\d[:|.]\d\d)?', string)
return list_of_times
However, when I put the example text from above in this function, it returns this:
['-18:00', ' - 19:00', '', '-22:30']
I expected something like [’17:00-18:00’], [’19:00’].
What have I done wrong?
Use this one : \d{1,2}[:.]([\d\s-]+[:.])?\d{2}}
Explanation
\d{1,2} one or two digits to match 1:00 and 01:00
[:.] to match 18:00 and 18.00
[\d\s-]+ n digit, whitespace or dash (optional)
[:.]\d{2} to match 18:00 and 18.00 (optional)
\d{2} 2 digits
In your sample text, the following will match (use full match) :
Match 1 17:00-18:00
Match 2 18.00 - 19:00
Match 3 19:00
Match 4 20:30-22:30
Demo

Regex to find capitalized words and code string?

I'm looking to change a simple dash to an em dash in some obituaries we receive. But it's only after the City of Death where that em dash should go.
The text looks like this:
#M_DeathNoticeHed:Alex <\n>Ornelas
#M_DeathNoticeBod:ALAMO <\!-> Alex Ornelas <\n>, 25, died Tuesday, Aug. <\n>16, 2016 at Alamo. Me<\h>morial Funeral Home of <\n>San Juan is in charge of ar<\h>rangements.
#M_DeathNoticeHed:Almaquire Cadena
#M_DeathNoticeBod:RIO GRANDE CITY <\!-> Almaquire <\n>Cadena , 87, died Tues<\h>day, Aug. 16, 2016 at Pax <\n>Villa Hospice, in McAllen, <\n>TX. Sanchez Funeral Home <\n> of Rio Grande City is in <\n>charge of arrangements.
#M_DeathNoticeHed:AnaRose <\n>Collazi
#M_DeathNoticeBod:MISSION <\!-> AnaRose <\n>Collazo , 44, died Wednes<\h>day, Aug. 17, 2016 at Mis<\h>sion Regional Medical Cen<\h>ter in Mission. Virgil Wilson <\n>Mortuary of Mission is in <\n>charge of arrangements.
#M_DeathNoticeHed:Andy Garza
#M_DeathNoticeBod:RIO GRANDE CITY <\!-> Andy <\n>Garza , 21, died Tuesday, <\n>Aug. 16, 2016 at Chicago, <\n>IL. Rodriguez Funeral <\n>Home of Roma is in <\n>charge of arrangements.
Notice that after every "#M_DeathNoticeBod: CITY" is "<\!->" which symbolizes the dash that I need changed to an em dash.
My regex code is not getting the "<\!->" selected along with the preceding city and "#M_DeathNoticeHed:".
#M_DeathNoticeBod:([^A-Za-z]*?[A-Z][A-Za-z]*)([^A-Za-z]*?[A-Z][A-Za-z]*) [<\!->]
It is also not selecting cities with 3 names in it like "RIO GRANDE CITY". I'm selecting this because the dash appears in other spots in the file that I do not want replaced.
If I can select that section I can replace the dash here.
If the lines you care about always start with "#M_DeathNoticeBod:" followed by the city of death, followed by the <!-> you wish to replace, I think something simple would do the job:
(#M_DeathNoticeBod:.*)<\\!->
Capture group 1 will contain everything up until that first "<\!->", so you if you're doing a search and replace you can just replace each occurrence of that regex with the contents of group 1 (usually denoted by '\1') followed by an em dash.
This regex should do:
#M_DeathNoticeBod:([A-Z ]*) (<\\!->)
I think this is what you are actually looking for:
(?<=#M_DeathNoticeBod:).+<\\!->
To explain things, the first portion, (?<=#M_DeathNoticeBod:) within the parenthesis is the positive lookbehind that doesn't participate in matching but ensures the trailing portion shall always be preceded by that expression.
I the trailing portion .+ should capture any city name containing any character sequence followed by your <!-> delimiter, which is captured by <\\!-> regex.